nan commited on
Commit
0eb5b5e
1 Parent(s): 59f9112

feat: update README

Browse files
Files changed (1) hide show
  1. README.md +286 -2
README.md CHANGED
@@ -1,7 +1,100 @@
1
  ---
2
  license: cc-by-4.0
3
  language:
4
- - en
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
  tags:
6
  - ColBERT
7
  - passage-retrieval
@@ -20,7 +113,198 @@ tags:
20
 
21
  # Jina-ColBERT-v2
22
 
23
- ## TODO
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
24
 
25
  ## Other Models
26
 
 
1
  ---
2
  license: cc-by-4.0
3
  language:
4
+ - multilingual
5
+ - af
6
+ - am
7
+ - ar
8
+ - as
9
+ - az
10
+ - be
11
+ - bg
12
+ - bn
13
+ - br
14
+ - bs
15
+ - ca
16
+ - cs
17
+ - cy
18
+ - da
19
+ - de
20
+ - el
21
+ - en
22
+ - eo
23
+ - es
24
+ - et
25
+ - eu
26
+ - fa
27
+ - fi
28
+ - fr
29
+ - fy
30
+ - ga
31
+ - gd
32
+ - gl
33
+ - gu
34
+ - ha
35
+ - he
36
+ - hi
37
+ - hr
38
+ - hu
39
+ - hy
40
+ - id
41
+ - is
42
+ - it
43
+ - ja
44
+ - jv
45
+ - ka
46
+ - kk
47
+ - km
48
+ - kn
49
+ - ko
50
+ - ku
51
+ - ky
52
+ - la
53
+ - lo
54
+ - lt
55
+ - lv
56
+ - mg
57
+ - mk
58
+ - ml
59
+ - mn
60
+ - mr
61
+ - ms
62
+ - my
63
+ - ne
64
+ - nl
65
+ - 'no'
66
+ - om
67
+ - or
68
+ - pa
69
+ - pl
70
+ - ps
71
+ - pt
72
+ - ro
73
+ - ru
74
+ - sa
75
+ - sd
76
+ - si
77
+ - sk
78
+ - sl
79
+ - so
80
+ - sq
81
+ - sr
82
+ - su
83
+ - sv
84
+ - sw
85
+ - ta
86
+ - te
87
+ - th
88
+ - tl
89
+ - tr
90
+ - ug
91
+ - uk
92
+ - ur
93
+ - uz
94
+ - vi
95
+ - xh
96
+ - yi
97
+ - zh
98
  tags:
99
  - ColBERT
100
  - passage-retrieval
 
113
 
114
  # Jina-ColBERT-v2
115
 
116
+ ## Usage
117
+
118
+ ### Installation
119
+
120
+ `jina-colbert-v2` is trained with flash attention and therefore requires `einops` and `flash_attn` to be installed.
121
+
122
+ To use the model, you could either use the Standford ColBERT library or use the `ragatouille` package that we provide.
123
+
124
+ ```bash
125
+ pip install -U einops flash_attn
126
+ pip install -U ragatouille
127
+ pip install -U colbert-ai
128
+ ```
129
+
130
+ ### RAGatouille
131
+
132
+ ```python
133
+ from ragatouille import RAGPretrainedModel
134
+
135
+ RAG = RAGPretrainedModel.from_pretrained("jinaai/jina-colbert-v2")
136
+
137
+ docs = [
138
+ "ColBERT is a novel ranking model that adapts deep LMs for efficient retrieval.",
139
+ "Jina-ColBERT is a ColBERT-style model but based on JinaBERT so it can support both 8k context length, fast and accurate retrieval."
140
+ ]
141
+
142
+ RAG.index(docs, index_name="demo")
143
+
144
+ query = 'What does ColBERT do?'
145
+
146
+ results = RAG.search(query)
147
+ ```
148
+
149
+ ### Stanford ColBERT
150
+ Typically, you would run the following code to index using the Stanford ColBERT library on a GPU machine. Check the reference at [Stanford ColBERT](https://github.com/stanford-futuredata/ColBERT?tab=readme-ov-file#installation) for more details.
151
+
152
+ #### Indexing
153
+
154
+ ```python
155
+ from colbert import Indexer
156
+ from colbert.infra import ColBERTConfig
157
+
158
+ if __name__ == "__main__":
159
+ config = ColBERTConfig(
160
+ doc_maxlen=512,
161
+ nbits=2
162
+ )
163
+ indexer = Indexer(
164
+ checkpoint="jinaai/jina-colbert-v2",
165
+ config=config,
166
+ )
167
+ docs = [
168
+ "ColBERT is a novel ranking model that adapts deep LMs for efficient retrieval.",
169
+ "Jina-ColBERT is a ColBERT-style model but based on JinaBERT so it can support both 8k context length, fast and accurate retrieval."
170
+ ]
171
+ indexer.index(name='demo', collection=docs)
172
+ ```
173
+
174
+ #### Searching
175
+
176
+ ```python
177
+ from colbert import Searcher
178
+ from colbert.infra import ColBERTConfig
179
+
180
+ k = 10
181
+
182
+ if __name__ == "__main__":
183
+ config = ColBERTConfig(
184
+ query_maxlen=128
185
+ )
186
+ searcher = Searcher(
187
+ index='demo',
188
+ config=config
189
+ )
190
+ query = 'What does ColBERT do?'
191
+ results = searcher.search(query, k=k)
192
+
193
+ ```
194
+
195
+ #### Creating vectors
196
+
197
+ ```python
198
+ from colbert.infra import ColBERTConfig
199
+ from colbert.modeling.checkpoint import Checkpoint
200
+
201
+ ckpt = Checkpoint("jinaai/jina-colbert-v2", colbert_config=ColBERTConfig())
202
+ docs = [
203
+ "ColBERT is a novel ranking model that adapts deep LMs for efficient retrieval.",
204
+ "Jina-ColBERT is a ColBERT-style model but based on JinaBERT so it can support both 8k context length, fast and accurate retrieval."
205
+ ]
206
+ query_vectors = ckpt.queryFromText( docs, bsize=2)
207
+ print(query_vectors)
208
+ ```
209
+
210
+ ## Evaluation Results
211
+
212
+ ### Retrieval Benchmarks
213
+
214
+ #### BEIR
215
+
216
+ | **Short Name** | **jina-colbert-v2** | **jina-colbert-v1** | **ColBERTv2.0** | **BM25** |
217
+ |--------------------|---------------------|---------------------|-----------------|----------|
218
+ | **avg** | 0.531 | 0.502 | 0.496 | 0.440 |
219
+ | **nfcorpus** | 0.346 | 0.338 | 0.337 | 0.325 |
220
+ | **fiqa** | 0.408 | 0.368 | 0.354 | 0.236 |
221
+ | **trec-covid** | 0.834 | 0.750 | 0.726 | 0.656 |
222
+ | **arguana** | 0.366 | 0.494 | 0.465 | 0.315 |
223
+ | **quora** | 0.887 | 0.823 | 0.855 | 0.789 |
224
+ | **scidocs** | 0.186 | 0.169 | 0.154 | 0.158 |
225
+ | **scifact** | 0.678 | 0.701 | 0.689 | 0.665 |
226
+ | **webis-touche** | 0.274 | 0.270 | 0.260 | 0.367 |
227
+ | **dbpedia-entity** | 0.471 | 0.413 | 0.452 | 0.313 |
228
+ | **fever** | 0.805 | 0.795 | 0.785 | 0.753 |
229
+ | **climate-fever** | 0.239 | 0.196 | 0.176 | 0.213 |
230
+ | **hotpotqa** | 0.766 | 0.656 | 0.675 | 0.603 |
231
+ | **nq** | 0.640 | 0.549 | 0.524 | 0.329 |
232
+
233
+
234
+
235
+ #### MS MARCO Passage Retrieval
236
+
237
+ | **Models** | **MRR@10** |
238
+ |---------------------|------------|
239
+ | **jina-colbert-v2** | 0.396 |
240
+ | **jina-colbert-v1** | 0.390 |
241
+ | **ColBERTv2.0** | **0.397** |
242
+ | **BM25** | 0.187 |
243
+
244
+ ### Multilingual Benchmarks
245
+
246
+ #### MIRACLE
247
+ We present our model's performance on the MIRACLE dataset, which is a multilingual retrieval benchmark.
248
+
249
+ | **** | **jina-colbert-v2** | **mDPR (zero shot)** |
250
+ |---------|---------------------|----------------------|
251
+ | **avg** | 0.627 | 0.427 |
252
+ | **ar** | 0.753 | 0.499 |
253
+ | **bn** | 0.750 | 0.443 |
254
+ | **de** | 0.504 | 0.490 |
255
+ | **es** | 0.538 | 0.478 |
256
+ | **en** | 0.570 | 0.394 |
257
+ | **fa** | 0.563 | 0.480 |
258
+ | **fi** | 0.740 | 0.472 |
259
+ | **fr** | 0.541 | 0.435 |
260
+ | **hi** | 0.600 | 0.383 |
261
+ | **id** | 0.547 | 0.272 |
262
+ | **ja** | 0.632 | 0.439 |
263
+ | **ko** | 0.671 | 0.419 |
264
+ | **ru** | 0.643 | 0.407 |
265
+ | **sw** | 0.499 | 0.299 |
266
+ | **te** | 0.742 | 0.356 |
267
+ | **th** | 0.772 | 0.358 |
268
+ | **yo** | 0.623 | 0.396 |
269
+ | **zh** | 0.523 | 0.512 |
270
+
271
+ #### mMARCO
272
+
273
+ | **** | **jina-colbert-v2** | **BM-25** |
274
+ |--------|---------------------|-----------|
275
+ | **ar** | 0.272 | 0.111 |
276
+ | **de** | 0.331 | 0.136 |
277
+ | **nl** | 0.330 | 0.140 |
278
+ | **es** | 0.341 | 0.158 |
279
+ | **fr** | 0.335 | 0.155 |
280
+ | **hi** | 0.309 | 0.134 |
281
+ | **id** | 0.319 | 0.149 |
282
+ | **it** | 0.337 | 0.153 |
283
+ | **ja** | 0.276 | 0.141 |
284
+ | **pt** | 0.337 | 0.152 |
285
+ | **ru** | 0.298 | 0.124 |
286
+ | **vi** | 0.287 | 0.136 |
287
+ | **zh** | 0.302 | |
288
+
289
+
290
+ ### Matryoshka Representation Benchmarks
291
+
292
+ #### BEIR
293
+
294
+ | **dim** | **Average** | **nfcorpus** | **fiqa** | **trec-covid** | **hotpotqa** | **nq** |
295
+ |---------|-------------|--------------|----------|----------------|--------------|--------|
296
+ | **128** | 0.565 | 0.346 | 0.408 | 0.834 | 0.766 | 0.640 |
297
+ | **96** | 0.558 | 0.340 | 0.404 | 0.808 | 0.764 | 0.640 |
298
+ | **64** | 0.556 | 0.347 | 0.404 | 0.805 | 0.756 | 0.635 |
299
+
300
+
301
+ #### MSMARCO
302
+
303
+ | **dim** | **msmarco** |
304
+ |---------|-------------|
305
+ | **128** | 0.396 |
306
+ | **96** | 0.391 |
307
+ | **64** | 0.388 |
308
 
309
  ## Other Models
310