JaronTHU/Video-CCAM-4B-v1.2

Model Summary

Video-CCAM-4B-v1.2 is a lightweight Video-MLLM developed by TencentQQ Multimedia Research Team, built upon Phi-3.5-mini-instruct and SigLIP SO400M. Compared to previous versions, it has better performances on public benchmarks and supports Chinese response.

Usage

Inference using Huggingface transformers on NVIDIA GPUs. Requirements tested on python 3.9/3.10.

pip install -U pip torch transformers accelerate peft decord pysubs2 imageio
# flash attention support
pip install flash-attn --no-build-isolation

Inference

import os
import torch
from huggingface_hub import snapshot_download
from PIL import Image
from transformers import AutoModel

from eval import load_decord

os.environ['TOKENIZERS_PARALLELISM'] = 'false'

# if you have downloaded this model, just replace the following line with your local path
model_path = snapshot_download(repo_id='JaronTHU/Video-CCAM-4B-v1.2')

videoccam = AutoModel.from_pretrained(
    model_path,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    device_map='cuda:0',
    attn_implementation='flash_attention_2'
)

tokenizer = AutoTokenizer.from_pretrained(model_path)

image_processor = AutoImageProcessor.from_pretrained(model_path)

messages = [
    [
        {
            'role': 'user',
            'content': '<image>\nDescribe this image in detail.'
        }
    ], [
        {
            'role': 'user',
            'content': '<video>\n请仔细描述这个视频。'
        }
    ]
]

images = [
    [Image.open('assets/example_image.jpg').convert('RGB')],
    load_decord('assets/example_video.mp4', sample_type='uniform', num_frames=32)
]

response = videoccam.chat(messages, images, tokenizer, image_processor, max_new_tokens=512, do_sample=False)

print(response)

Please refer to Video-CCAM for more details.

Benchmarks

Benchmark	Video-CCAM-4B	Video-CCAM-4B-v1.1	Video-CCAM-4B-v1.2
MVBench (32 frames)	57.43	62.80	66.28
Video-MME (w/o sub, 96 frames)	49.7	50.1	51.5
Video-MME (w sub, 96 frames)	52.8	51.2	54.5
MLVU (M-Avg, 96 frames)	57.3	56.5	61.0
VideoVista (96 frames)	68.09	70.82	73.44

Acknowledgement

xtuner: Video-CCAM-14B is trained using the xtuner framework. Thanks for their excellent works!
Phi-3.5-mini-instruct: Powerful language models developed by Microsoft.
SigLIP SO400M: Outstanding vision encoder developed by Google.

License

The project is licensed under the Apache 2.0 License and is restricted to uses that comply with the license agreements of Phi-3.5-mini-instruct and SigLIP SO400M.

JaronTHU
/

Video-CCAM-4B-v1.2