ldwang commited on
Commit
97969ae
1 Parent(s): 19b0485
Files changed (1) hide show
  1. README.md +34 -16
README.md CHANGED
@@ -21,6 +21,7 @@ language:
21
  <a href="#evaluation">Evaluation</a> |
22
  <a href="#train">Train</a> |
23
  <a href="#contact">Contact</a> |
 
24
  <a href="#license">License</a>
25
  <p>
26
  </h4>
@@ -34,6 +35,7 @@ FlagEmbedding can map any text to a low-dimensional dense vector which can be us
34
  And it also can be used in vector databases for LLMs.
35
 
36
  ************* 🌟**Updates**🌟 *************
 
37
  - 09/12/2023: New Release:
38
  - **New reranker model**: release cross-encoder models `BAAI/bge-reranker-base` and `BAAI/bge-reranker-large`, which are more powerful than embedding model. We recommend to use/fine-tune them to re-rank top-k documents returned by embedding models.
39
  - **update embedding model**: release `bge-*-v1.5` embedding model to alleviate the issue of the similarity distribution, and enhance its retrieval ability without instruction.
@@ -68,10 +70,9 @@ And it also can be used in vector databases for LLMs.
68
 
69
  \*: If you need to search the relevant passages to a query, we suggest to add the instruction to the query; in other cases, no instruction is needed, just use the original query directly. In all cases, **no instruction** needs to be added to passages.
70
 
71
- \**: Different embedding model, reranker is a cross-encoder, which cannot be used to generate embedding. To balance the accuracy and time cost, cross-encoder is widely used to re-rank top-k documents retrieved by other simple models.
72
  For examples, use bge embedding model to retrieve top 100 relevant documents, and then use bge reranker to re-rank the top 100 document to get the final top-3 results.
73
 
74
-
75
  ## Frequently asked questions
76
 
77
  <details>
@@ -134,7 +135,9 @@ If it doesn't work for you, you can see [FlagEmbedding](https://github.com/FlagO
134
  from FlagEmbedding import FlagModel
135
  sentences_1 = ["样例数据-1", "样例数据-2"]
136
  sentences_2 = ["样例数据-3", "样例数据-4"]
137
- model = FlagModel('BAAI/bge-large-zh', query_instruction_for_retrieval="为这个句子生成表示以用于检索相关文章:")
 
 
138
  embeddings_1 = model.encode(sentences_1)
139
  embeddings_2 = model.encode(sentences_2)
140
  similarity = embeddings_1 @ embeddings_2.T
@@ -165,7 +168,7 @@ pip install -U sentence-transformers
165
  from sentence_transformers import SentenceTransformer
166
  sentences_1 = ["样例数据-1", "样例数据-2"]
167
  sentences_2 = ["样例数据-3", "样例数据-4"]
168
- model = SentenceTransformer('BAAI/bge-large-zh')
169
  embeddings_1 = model.encode(sentences_1, normalize_embeddings=True)
170
  embeddings_2 = model.encode(sentences_2, normalize_embeddings=True)
171
  similarity = embeddings_1 @ embeddings_2.T
@@ -180,7 +183,7 @@ queries = ['query_1', 'query_2']
180
  passages = ["样例文档-1", "样例文档-2"]
181
  instruction = "为这个句子生成表示以用于检索相关文章:"
182
 
183
- model = SentenceTransformer('BAAI/bge-large-zh')
184
  q_embeddings = model.encode([instruction+q for q in queries], normalize_embeddings=True)
185
  p_embeddings = model.encode(passages, normalize_embeddings=True)
186
  scores = q_embeddings @ p_embeddings.T
@@ -191,7 +194,7 @@ scores = q_embeddings @ p_embeddings.T
191
  You can use `bge` in langchain like this:
192
  ```python
193
  from langchain.embeddings import HuggingFaceBgeEmbeddings
194
- model_name = "BAAI/bge-small-en"
195
  model_kwargs = {'device': 'cuda'}
196
  encode_kwargs = {'normalize_embeddings': True} # set True to compute cosine similarity
197
  model = HuggingFaceBgeEmbeddings(
@@ -215,8 +218,8 @@ import torch
215
  sentences = ["样例数据-1", "样例数据-2"]
216
 
217
  # Load model from HuggingFace Hub
218
- tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-large-zh')
219
- model = AutoModel.from_pretrained('BAAI/bge-large-zh')
220
  model.eval()
221
 
222
  # Tokenize sentences
@@ -236,6 +239,7 @@ print("Sentence embeddings:", sentence_embeddings)
236
 
237
  ### Usage for Reranker
238
 
 
239
  You can get a relevance score by inputting query and passage to the reranker.
240
  The reranker is optimized based cross-entropy loss, so the relevance score is not bounded to a specific range.
241
 
@@ -245,10 +249,10 @@ The reranker is optimized based cross-entropy loss, so the relevance score is no
245
  pip install -U FlagEmbedding
246
  ```
247
 
248
- Get relevance score:
249
  ```python
250
  from FlagEmbedding import FlagReranker
251
- reranker = FlagReranker('BAAI/bge-reranker-base', use_fp16=True) #use fp16 can speed up computing
252
 
253
  score = reranker.compute_score(['query', 'passage'])
254
  print(score)
@@ -262,10 +266,10 @@ print(scores)
262
 
263
  ```python
264
  import torch
265
- from transformers import AutoModelForSequenceClassification, AutoTokenizer, BatchEncoding, PreTrainedTokenizerFast
266
 
267
- tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-reranker-base')
268
- model = AutoModelForSequenceClassification.from_pretrained('BAAI/bge-reranker-base')
269
  model.eval()
270
 
271
  pairs = [['what is panda?', 'hi'], ['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']]
@@ -331,7 +335,7 @@ Please refer to [C_MTEB](https://github.com/FlagOpen/FlagEmbedding/blob/master/C
331
  - **Reranking**:
332
  See [C_MTEB](https://github.com/FlagOpen/FlagEmbedding/blob/master/C_MTEB/) for evaluation script.
333
 
334
- | Model | T2Reranking | T2RerankingZh2En\* | T2RerankingEn2Zh\* | MmarcoReranking | CMedQAv1 | CMedQAv2 | Avg |
335
  |:-------------------------------|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|
336
  | text2vec-base-multilingual | 64.66 | 62.94 | 62.51 | 14.37 | 48.46 | 48.6 | 50.26 |
337
  | multilingual-e5-small | 65.62 | 60.94 | 56.41 | 29.91 | 67.26 | 66.54 | 57.78 |
@@ -344,13 +348,13 @@ See [C_MTEB](https://github.com/FlagOpen/FlagEmbedding/blob/master/C_MTEB/) for
344
  | [BAAI/bge-reranker-base](https://huggingface.co/BAAI/bge-reranker-base) | 67.28 | 63.95 | 60.45 | 35.46 | 81.26 | 84.1 | 65.42 |
345
  | [BAAI/bge-reranker-large](https://huggingface.co/BAAI/bge-reranker-large) | 67.6 | 64.03 | 61.44 | 37.16 | 82.15 | 84.18 | 66.09 |
346
 
347
- \* : T2RerankingZh2En and T2RerankingEn2Zh are cross-language retrieval task
348
 
349
  ## Train
350
 
351
  ### BAAI Embedding
352
 
353
- We pre-train the models using retromae and train them on large-scale pairs data using contrastive learning.
354
  **You can fine-tune the embedding model on your data following our [examples](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune).**
355
  We also provide a [pre-train example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/pretrain).
356
  Note that the goal of pre-training is to reconstruct the text, and the pre-trained model cannot be used for similarity calculation directly, it needs to be fine-tuned.
@@ -373,6 +377,20 @@ If you have any question or suggestion related to this project, feel free to ope
373
  You also can email Shitao Xiao(stxiao@baai.ac.cn) and Zheng Liu(liuzheng@baai.ac.cn).
374
 
375
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
376
  ## License
377
  FlagEmbedding is licensed under the [MIT License](https://github.com/FlagOpen/FlagEmbedding/blob/master/LICENSE). The released models can be used for commercial purposes free of charge.
378
 
 
21
  <a href="#evaluation">Evaluation</a> |
22
  <a href="#train">Train</a> |
23
  <a href="#contact">Contact</a> |
24
+ <a href="#citation">Citation</a> |
25
  <a href="#license">License</a>
26
  <p>
27
  </h4>
 
35
  And it also can be used in vector databases for LLMs.
36
 
37
  ************* 🌟**Updates**🌟 *************
38
+ - 09/15/2023: Release [paper](https://arxiv.org/pdf/2309.07597.pdf) and [dataset](https://data.baai.ac.cn/details/BAAI-MTP).
39
  - 09/12/2023: New Release:
40
  - **New reranker model**: release cross-encoder models `BAAI/bge-reranker-base` and `BAAI/bge-reranker-large`, which are more powerful than embedding model. We recommend to use/fine-tune them to re-rank top-k documents returned by embedding models.
41
  - **update embedding model**: release `bge-*-v1.5` embedding model to alleviate the issue of the similarity distribution, and enhance its retrieval ability without instruction.
 
70
 
71
  \*: If you need to search the relevant passages to a query, we suggest to add the instruction to the query; in other cases, no instruction is needed, just use the original query directly. In all cases, **no instruction** needs to be added to passages.
72
 
73
+ \**: Different from embedding model, reranker uses question and document as input and directly output similarity instead of embedding. To balance the accuracy and time cost, cross-encoder is widely used to re-rank top-k documents retrieved by other simple models.
74
  For examples, use bge embedding model to retrieve top 100 relevant documents, and then use bge reranker to re-rank the top 100 document to get the final top-3 results.
75
 
 
76
  ## Frequently asked questions
77
 
78
  <details>
 
135
  from FlagEmbedding import FlagModel
136
  sentences_1 = ["样例数据-1", "样例数据-2"]
137
  sentences_2 = ["样例数据-3", "样例数据-4"]
138
+ model = FlagModel('BAAI/bge-large-zh-v1.5',
139
+ query_instruction_for_retrieval="为这个句子生成表示以用于检索相关文章:",
140
+ use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation
141
  embeddings_1 = model.encode(sentences_1)
142
  embeddings_2 = model.encode(sentences_2)
143
  similarity = embeddings_1 @ embeddings_2.T
 
168
  from sentence_transformers import SentenceTransformer
169
  sentences_1 = ["样例数据-1", "样例数据-2"]
170
  sentences_2 = ["样例数据-3", "样例数据-4"]
171
+ model = SentenceTransformer('BAAI/bge-large-zh-v1.5')
172
  embeddings_1 = model.encode(sentences_1, normalize_embeddings=True)
173
  embeddings_2 = model.encode(sentences_2, normalize_embeddings=True)
174
  similarity = embeddings_1 @ embeddings_2.T
 
183
  passages = ["样例文档-1", "样例文档-2"]
184
  instruction = "为这个句子生成表示以用于检索相关文章:"
185
 
186
+ model = SentenceTransformer('BAAI/bge-large-zh-v1.5')
187
  q_embeddings = model.encode([instruction+q for q in queries], normalize_embeddings=True)
188
  p_embeddings = model.encode(passages, normalize_embeddings=True)
189
  scores = q_embeddings @ p_embeddings.T
 
194
  You can use `bge` in langchain like this:
195
  ```python
196
  from langchain.embeddings import HuggingFaceBgeEmbeddings
197
+ model_name = "BAAI/bge-large-en-v1.5"
198
  model_kwargs = {'device': 'cuda'}
199
  encode_kwargs = {'normalize_embeddings': True} # set True to compute cosine similarity
200
  model = HuggingFaceBgeEmbeddings(
 
218
  sentences = ["样例数据-1", "样例数据-2"]
219
 
220
  # Load model from HuggingFace Hub
221
+ tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-large-zh-v1.5')
222
+ model = AutoModel.from_pretrained('BAAI/bge-large-zh-v1.5')
223
  model.eval()
224
 
225
  # Tokenize sentences
 
239
 
240
  ### Usage for Reranker
241
 
242
+ Different from embedding model, reranker uses question and document as input and directly output similarity instead of embedding.
243
  You can get a relevance score by inputting query and passage to the reranker.
244
  The reranker is optimized based cross-entropy loss, so the relevance score is not bounded to a specific range.
245
 
 
249
  pip install -U FlagEmbedding
250
  ```
251
 
252
+ Get relevance scores (higher scores indicate more relevance):
253
  ```python
254
  from FlagEmbedding import FlagReranker
255
+ reranker = FlagReranker('BAAI/bge-reranker-large', use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation
256
 
257
  score = reranker.compute_score(['query', 'passage'])
258
  print(score)
 
266
 
267
  ```python
268
  import torch
269
+ from transformers import AutoModelForSequenceClassification, AutoTokenizer
270
 
271
+ tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-reranker-large')
272
+ model = AutoModelForSequenceClassification.from_pretrained('BAAI/bge-reranker-large')
273
  model.eval()
274
 
275
  pairs = [['what is panda?', 'hi'], ['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']]
 
335
  - **Reranking**:
336
  See [C_MTEB](https://github.com/FlagOpen/FlagEmbedding/blob/master/C_MTEB/) for evaluation script.
337
 
338
+ | Model | T2Reranking | T2RerankingZh2En\* | T2RerankingEn2Zh\* | MMarcoReranking | CMedQAv1 | CMedQAv2 | Avg |
339
  |:-------------------------------|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|
340
  | text2vec-base-multilingual | 64.66 | 62.94 | 62.51 | 14.37 | 48.46 | 48.6 | 50.26 |
341
  | multilingual-e5-small | 65.62 | 60.94 | 56.41 | 29.91 | 67.26 | 66.54 | 57.78 |
 
348
  | [BAAI/bge-reranker-base](https://huggingface.co/BAAI/bge-reranker-base) | 67.28 | 63.95 | 60.45 | 35.46 | 81.26 | 84.1 | 65.42 |
349
  | [BAAI/bge-reranker-large](https://huggingface.co/BAAI/bge-reranker-large) | 67.6 | 64.03 | 61.44 | 37.16 | 82.15 | 84.18 | 66.09 |
350
 
351
+ \* : T2RerankingZh2En and T2RerankingEn2Zh are cross-language retrieval tasks
352
 
353
  ## Train
354
 
355
  ### BAAI Embedding
356
 
357
+ We pre-train the models using [retromae](https://github.com/staoxiao/RetroMAE) and train them on large-scale pairs data using contrastive learning.
358
  **You can fine-tune the embedding model on your data following our [examples](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune).**
359
  We also provide a [pre-train example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/pretrain).
360
  Note that the goal of pre-training is to reconstruct the text, and the pre-trained model cannot be used for similarity calculation directly, it needs to be fine-tuned.
 
377
  You also can email Shitao Xiao(stxiao@baai.ac.cn) and Zheng Liu(liuzheng@baai.ac.cn).
378
 
379
 
380
+ ## Citation
381
+
382
+ If you find our work helpful, please cite us:
383
+ ```
384
+ @misc{bge_embedding,
385
+ title={C-Pack: Packaged Resources To Advance General Chinese Embedding},
386
+ author={Shitao Xiao and Zheng Liu and Peitian Zhang and Niklas Muennighoff},
387
+ year={2023},
388
+ eprint={2309.07597},
389
+ archivePrefix={arXiv},
390
+ primaryClass={cs.CL}
391
+ }
392
+ ```
393
+
394
  ## License
395
  FlagEmbedding is licensed under the [MIT License](https://github.com/FlagOpen/FlagEmbedding/blob/master/LICENSE). The released models can be used for commercial purposes free of charge.
396