File size: 10,637 Bytes

---
license: cc-by-4.0
language:
  - multilingual
  - af
  - am
  - ar
  - as
  - az
  - be
  - bg
  - bn
  - br
  - bs
  - ca
  - cs
  - cy
  - da
  - de
  - el
  - en
  - eo
  - es
  - et
  - eu
  - fa
  - fi
  - fr
  - fy
  - ga
  - gd
  - gl
  - gu
  - ha
  - he
  - hi
  - hr
  - hu
  - hy
  - id
  - is
  - it
  - ja
  - jv
  - ka
  - kk
  - km
  - kn
  - ko
  - ku
  - ky
  - la
  - lo
  - lt
  - lv
  - mg
  - mk
  - ml
  - mn
  - mr
  - ms
  - my
  - ne
  - nl
  - 'no'
  - om
  - or
  - pa
  - pl
  - ps
  - pt
  - ro
  - ru
  - sa
  - sd
  - si
  - sk
  - sl
  - so
  - sq
  - sr
  - su
  - sv
  - sw
  - ta
  - te
  - th
  - tl
  - tr
  - ug
  - uk
  - ur
  - uz
  - vi
  - xh
  - yi
  - zh
tags:
- ColBERT
- passage-retrieval
---

<br><br>

<p align="center">
<img src="https://aeiljuispo.cloudimg.io/v7/https://cdn-uploads.huggingface.co/production/uploads/603763514de52ff951d89793/AFoybzd5lpBQXEBrQHuTt.png?w=200&h=200&f=face" alt="Finetuner logo: Finetuner helps you to create experiments in order to improve embeddings on search tasks. It accompanies you to deliver the last mile of performance-tuning for neural search applications." width="150px">
</p>


<p align="center">
<b>Trained by <a href="https://jina.ai/"><b>Jina AI</b></a>.</b>
</p>

# Jina-ColBERT-v2
Jina ColBERT v2 (`jina-colbert-v2`) is a new model based on the [Jina-ColBERT architecture](https://jina.ai/news/what-is-colbert-and-late-interaction-and-why-they-matter-in-search/) that expands on the capabilities and performance of the `jina-colbert-v1-en` model. Like the previous release, it has Jina AI’s 8192 token input context and the [improved efficiency, performance](https://jina.ai/news/what-is-colbert-and-late-interaction-and-why-they-matter-in-search/), and [explainability](https://jina.ai/news/ai-explainability-made-easy-how-late-interaction-makes-jina-colbert-transparent/) of token-level embeddings and late interaction. 

This new release adds new functionality and performance improvements:

- Multilingual support for dozens of languages, with strong performance on major global languages.
- [Matryoshka embeddings](https://huggingface.co/blog/matryoshka), which allow users to trade between efficiency and precision flexibly.
- Superior retrieval performance when compared to the English-only `jina-colbert-v1-en`.

## Usage

### Installation

`jina-colbert-v2` is trained with flash attention and therefore requires `einops` and `flash_attn` to be installed.

To use the model, you could either use the Standford ColBERT library or use the `ragatouille` package that we provide.

```bash
pip install -U einops flash_attn
pip install -U ragatouille
pip install -U colbert-ai
```

### RAGatouille

```python
from ragatouille import RAGPretrainedModel

RAG = RAGPretrainedModel.from_pretrained("jinaai/jina-colbert-v2")

docs = [
        "ColBERT is a novel ranking model that adapts deep LMs for efficient retrieval.",
        "Jina-ColBERT is a ColBERT-style model but based on JinaBERT so it can support both 8k context length, fast and accurate retrieval."
    ]

RAG.index(docs, index_name="demo")

query = 'What does ColBERT do?'

results = RAG.search(query)
```

### Stanford ColBERT
Typically, you would run the following code to index using the Stanford ColBERT library on a GPU machine. Check the reference at [Stanford ColBERT](https://github.com/stanford-futuredata/ColBERT?tab=readme-ov-file#installation) for more details.

#### Indexing

```python
from colbert import Indexer
from colbert.infra import ColBERTConfig

if __name__ == "__main__":
    config = ColBERTConfig(
        doc_maxlen=512,
        nbits=2
    )
    indexer = Indexer(
        checkpoint="jinaai/jina-colbert-v2",
        config=config,
    )
    docs = [
        "ColBERT is a novel ranking model that adapts deep LMs for efficient retrieval.",
        "Jina-ColBERT is a ColBERT-style model but based on JinaBERT so it can support both 8k context length, fast and accurate retrieval."
    ]
    indexer.index(name='demo', collection=docs)
```

#### Searching

```python
from colbert import Searcher
from colbert.infra import ColBERTConfig

k = 10

if __name__ == "__main__":
    config = ColBERTConfig(
        query_maxlen=128
    )
    searcher = Searcher(
        index='demo',
        config=config
    )
    query = 'What does ColBERT do?'
    results = searcher.search(query, k=k)

``` 

#### Creating vectors

```python
from colbert.infra import ColBERTConfig
from colbert.modeling.checkpoint import Checkpoint

ckpt = Checkpoint("jinaai/jina-colbert-v2", colbert_config=ColBERTConfig())
docs = [
        "ColBERT is a novel ranking model that adapts deep LMs for efficient retrieval.",
        "Jina-ColBERT is a ColBERT-style model but based on JinaBERT so it can support both 8k context length, fast and accurate retrieval."
    ]
query_vectors = ckpt.queryFromText( docs, bsize=2)
print(query_vectors)
```

## Evaluation Results

### Retrieval Benchmarks

#### BEIR

| **NDCG@10**        | **jina-colbert-v2** | **jina-colbert-v1** | **ColBERTv2.0** | **BM25** |
|--------------------|---------------------|---------------------|-----------------|----------|
| **avg**            | 0.531               | 0.502               | 0.496           | 0.440    |
| **nfcorpus**       | 0.346               | 0.338               | 0.337           | 0.325    |
| **fiqa**           | 0.408               | 0.368               | 0.354           | 0.236    |
| **trec-covid**     | 0.834               | 0.750               | 0.726           | 0.656    |
| **arguana**        | 0.366               | 0.494               | 0.465           | 0.315    |
| **quora**          | 0.887               | 0.823               | 0.855           | 0.789    |
| **scidocs**        | 0.186               | 0.169               | 0.154           | 0.158    |
| **scifact**        | 0.678               | 0.701               | 0.689           | 0.665    |
| **webis-touche**   | 0.274               | 0.270               | 0.260           | 0.367    |
| **dbpedia-entity** | 0.471               | 0.413               | 0.452           | 0.313    |
| **fever**          | 0.805               | 0.795               | 0.785           | 0.753    |
| **climate-fever**  | 0.239               | 0.196               | 0.176           | 0.213    |
| **hotpotqa**       | 0.766               | 0.656               | 0.675           | 0.603    |
| **nq**             | 0.640               | 0.549               | 0.524           | 0.329    |



#### MS MARCO Passage Retrieval

| **MRR@10**  | **jina-colbert-v2** | **jina-colbert-v1** | **ColBERTv2.0** | **BM25** |
|-------------|---------------------|---------------------|-----------------|----------|
| **MSMARCO** | 0.396               | 0.390               | 0.397           | 0.187    |


### Multilingual Benchmarks

#### MIRACLE

| **NDCG@10**    | **jina-colbert-v2** | **mDPR (zero shot)** |
|---------|---------------------|----------------------|
| **avg** | 0.627               | 0.427                |
| **ar**  | 0.753               | 0.499                |
| **bn**  | 0.750               | 0.443                |
| **de**  | 0.504               | 0.490                |
| **es**  | 0.538               | 0.478                |
| **en**  | 0.570               | 0.394                |
| **fa**  | 0.563               | 0.480                |
| **fi**  | 0.740               | 0.472                |
| **fr**  | 0.541               | 0.435                |
| **hi**  | 0.600               | 0.383                |
| **id**  | 0.547               | 0.272                |
| **ja**  | 0.632               | 0.439                |
| **ko**  | 0.671               | 0.419                |
| **ru**  | 0.643               | 0.407                |
| **sw**  | 0.499               | 0.299                |
| **te**  | 0.742               | 0.356                |
| **th**  | 0.772               | 0.358                |
| **yo**  | 0.623               | 0.396                |
| **zh**  | 0.523               | 0.512                |

#### mMARCO

| **MRR@10** | **jina-colbert-v2** | **BM-25** | **ColBERT-XM** |
|------------|---------------------|-----------|----------------|
| **avg**    | 0.313               | 0.141     | 0.254          |
| **ar**     | 0.272               | 0.111     | 0.195          |
| **de**     | 0.331               | 0.136     | 0.270          |
| **nl**     | 0.330               | 0.140     | 0.275          |
| **es**     | 0.341               | 0.158     | 0.285          |
| **fr**     | 0.335               | 0.155     | 0.269          |
| **hi**     | 0.309               | 0.134     | 0.238          |
| **id**     | 0.319               | 0.149     | 0.263          |
| **it**     | 0.337               | 0.153     | 0.265          |
| **ja**     | 0.276               | 0.141     | 0.241          |
| **pt**     | 0.337               | 0.152     | 0.276          |
| **ru**     | 0.298               | 0.124     | 0.251          |
| **vi**     | 0.287               | 0.136     | 0.226          |
| **zh**     | 0.302               |           | 0.246          |



### Matryoshka Representation Benchmarks

#### BEIR

| **NDCG@10**    | **dim=128** | **dim=96** | **dim=64** |
|----------------|-------------|------------|------------|
| **avg**    | 0.599       | 0.591      | 0.589      |
| **nfcorpus**   | 0.346       | 0.340      | 0.347      |
| **fiqa**       | 0.408       | 0.404      | 0.404      |
| **trec-covid** | 0.834       | 0.808      | 0.805      |
| **hotpotqa**   | 0.766       | 0.764      | 0.756      |
| **nq**         | 0.640       | 0.640      | 0.635      |


#### MSMARCO

| **MRR@10**     | **dim=128** | **dim=96** | **dim=64** |
|----------------|-------------|------------|------------|
| **msmarco**    | 0.396       | 0.391      | 0.388      |

## Other Models

Additionally, we provide the following embedding models, you can also use them for retrieval.

- [`jina-embeddings-v2-base-en`](https://huggingface.co/jinaai/jina-embeddings-v2-base-en): 137 million parameters.
- [`jina-embeddings-v2-base-zh`](https://huggingface.co/jinaai/jina-embeddings-v2-base-zh): 161 million parameters Chinese-English bilingual model.
- [`jina-embeddings-v2-base-de`](https://huggingface.co/jinaai/jina-embeddings-v2-base-de): 161 million parameters German-English bilingual model.
- [`jina-embeddings-v2-base-es`](https://huggingface.co/jinaai/jina-embeddings-v2-base-es): 161 million parameters Spanish-English bilingual model.

## Contact

Join our [Discord community](https://discord.jina.ai) and chat with other community members about ideas.