# The VILD Dataset (VIdeo and Long-Description) This dataset is proposed from [VideoCLIP-XL](https://arxiv.org/abs/2410.00741). We establish an automatic data collection system, designed to aggregate sufficient and high-quality pairs from multiple data sources. We have successfully collected over 2M (VIdeo, Long Description) pairs, denoted as our VILD dataset. # Format ```json { "short_captions": [ "...", ], "long_captions": [ "...", ], "video_id": "..." } { ..... }, ..... ``` # Source ~~~ @misc{wang2024videoclipxladvancinglongdescription, title={VideoCLIP-XL: Advancing Long Description Understanding for Video CLIP Models}, author={Jiapeng Wang and Chengyu Wang and Kunzhe Huang and Jun Huang and Lianwen Jin}, year={2024}, eprint={2410.00741}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2410.00741}, } ~~~