English
jpWang commited on
Commit
72f77cf
1 Parent(s): 394f89e

Upload 6 files

Browse files
.gitattributes CHANGED
@@ -33,3 +33,8 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ LVDR-Benchmark-4x1.jsonl filter=lfs diff=lfs merge=lfs -text
37
+ LVDR-Benchmark-4x2.jsonl filter=lfs diff=lfs merge=lfs -text
38
+ LVDR-Benchmark-4x3.jsonl filter=lfs diff=lfs merge=lfs -text
39
+ LVDR-Benchmark-4x4.jsonl filter=lfs diff=lfs merge=lfs -text
40
+ LVDR-Benchmark-4x5.jsonl filter=lfs diff=lfs merge=lfs -text
LVDR-Benchmark-4x1.jsonl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c6d6ded11e07d8984dc2114970a3e1c32577bbc077fd2fbc1c45c6e9dde9dd40
3
+ size 10643868
LVDR-Benchmark-4x2.jsonl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:52c974ebabb22fdbccd1c6237658aa12533c6aa69cd901c4181502e3cde528d6
3
+ size 10651568
LVDR-Benchmark-4x3.jsonl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c59b6064566d37dad162205602a211173e9f00a8ac276d2f3adcd662856ed84e
3
+ size 10659244
LVDR-Benchmark-4x4.jsonl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d577501d8c5c971bc31768fa576b30dcef734971e808ad955e5623e93f518679
3
+ size 10665474
LVDR-Benchmark-4x5.jsonl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d68d3322edbb1e6b55ebb034f092e608877013437947c444dcf77ce95ba5a239
3
+ size 10673358
README.md ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # The LVDR Benchmark (Long Video Description Ranking)
2
+
3
+ This benchmark is proposed from [VideoCLIP-XL](https://arxiv.org/abs/2410.00741).
4
+ Given each video and its corresponding ground-truth description, we perform a synthesis process that iterates p − 1 times and alters q words as hallucination during each iteration, resulting in totally p descriptions with gradually increasing degrees of hallucination. We denote such a subset as p × q and construct five subsets as {4 × 1, 4 × 2, 4 × 3, 4 × 4, 4 × 5}. The video CLIP models need to be able to correctly rank these descriptions in descending order of similarity given the video.
5
+
6
+ # Format
7
+ ```json
8
+ {
9
+ "long_captions": [
10
+ "...",
11
+ ],
12
+ "video_id": "..."
13
+ }
14
+ {
15
+ .....
16
+ },
17
+ .....
18
+ ```
19
+
20
+
21
+ # Source
22
+ ~~~
23
+ @misc{wang2024videoclipxladvancinglongdescription,
24
+ title={VideoCLIP-XL: Advancing Long Description Understanding for Video CLIP Models},
25
+ author={Jiapeng Wang and Chengyu Wang and Kunzhe Huang and Jun Huang and Lianwen Jin},
26
+ year={2024},
27
+ eprint={2410.00741},
28
+ archivePrefix={arXiv},
29
+ primaryClass={cs.CL},
30
+ url={https://arxiv.org/abs/2410.00741},
31
+ }
32
+ ~~~