Video-CCAM-14B / README.md
jaronfei
first commit
4d478a8
---
license: mit
---
## Model Summary
Video-CCAM-14B is a lightweight Video-MLLM built on [Phi-3-medium-4k-instruct](https://huggingface.co/microsoft/Phi-3-medium-4k-instruct) and [SigLIP SO400M](https://huggingface.co/google/siglip-so400m-patch14-384).
## Usage
Inference using Huggingface transformers on NVIDIA GPUs. Requirements tested on python 3.10:
```
torch==2.1.0
torchvision==0.16.0
transformers==4.40.2
peft==0.10.0
```
## Inference & Evaluation
Please refer to [Video-CCAM](https://github.com/QQ-MM/Video-CCAM) on inference and evaluation.
### Video-MME: 53.2/57.4 (96 frames)
### MVBench: 61.43 (16 frames)
## Acknowledgement
* [xtuner](https://github.com/InternLM/xtuner): Video-CCAM-14B is trained using the xtuner framework. Thanks for their excellent works!
* [Phi-3-medium-4k-instruct](https://huggingface.co/microsoft/Phi-3-medium-4k-instruct): Powerful language models developed by Microsoft.
* [SigLIP SO400M](https://huggingface.co/google/siglip-so400m-patch14-384): Outstanding vision encoder developed by Google.
## License
The model is licensed under the MIT license.