Shitao commited on
Commit
442084e
1 Parent(s): ecc1ac1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +8 -37
README.md CHANGED
@@ -6,20 +6,7 @@ pipeline_tag: sentence-similarity
6
  ---
7
 
8
  <h1 align="center">FlagEmbedding</h1>
9
- <p align="center">
10
- <a href="https://www.python.org/">
11
- <img alt="Build" src="https://img.shields.io/badge/Contribution-Welcome-blue">
12
- </a>
13
- <a href="https://github.com/FlagOpen/FlagEmbedding/blob/master/LICENSE">
14
- <img alt="License" src="https://img.shields.io/badge/LICENSE-MIT-green">
15
- </a>
16
- <a href="https://huggingface.co/C-MTEB">
17
- <img alt="Build" src="https://img.shields.io/badge/C_MTEB-🤗-yellow">
18
- </a>
19
- <a href="https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding">
20
- <img alt="Build" src="https://img.shields.io/badge/FlagEmbedding-1.0.1-red">
21
- </a>
22
- </p>
23
 
24
  <h4 align="center">
25
  <p>
@@ -27,16 +14,16 @@ pipeline_tag: sentence-similarity
27
  <a href=#usage>Usage</a> |
28
  <a href="#evaluation">Evaluation</a> |
29
  <a href="#train">Train</a> |
30
- <a href="#contact">Contact</a> |
31
  <a href="#license">License</a>
32
  <p>
33
  </h4>
34
 
 
35
 
36
  [English](README.md) | [中文](https://github.com/FlagOpen/FlagEmbedding/blob/master/README_zh.md)
37
 
38
  FlagEmbedding can map any text to a low-dimensional dense vector which can be used for tasks like retrieval, classification, clustering, or semantic search.
39
- And it also can be used in vector database for LLMs.
40
 
41
  ************* 🌟**Updates**🌟 *************
42
  - 08/05/2023: Release base-scale and small-scale models, **best performance among the models of the same size 🤗**
@@ -103,7 +90,7 @@ embeddings = model.encode(sentences, normalize_embeddings=True)
103
  print(embeddings)
104
  ```
105
  For retrieval task,
106
- each query should start with a instruction (instructions see [Model List](https://github.com/FlagOpen/FlagEmbedding/tree/master#model-list)).
107
  ```python
108
  from sentence_transformers import SentenceTransformer
109
  queries = ["手机开不了机怎么办?"]
@@ -132,7 +119,7 @@ model = AutoModel.from_pretrained('BAAI/bge-large-zh')
132
 
133
  # Tokenize sentences
134
  encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
135
- # for retrieval task, add a instruction to query
136
  # encoded_input = tokenizer([instruction + q for q in queries], padding=True, truncation=True, return_tensors='pt')
137
 
138
  # Compute token embeddings
@@ -176,7 +163,7 @@ More details and evaluation tools see our [scripts](https://github.com/FlagOpen/
176
 
177
 
178
  - **C-MTEB**:
179
- We create a benchmark C-MTEB for chinese text embedding which consists of 31 datasets from 6 tasks.
180
  Please refer to [C_MTEB](https://github.com/FlagOpen/FlagEmbedding/blob/master/C_MTEB/README.md) for a detailed introduction.
181
 
182
  | Model | Embedding dimension | Avg | Retrieval | STS | PairClassification | Classification | Reranking | Clustering |
@@ -204,7 +191,7 @@ and we provide some examples to do [pre-train](https://github.com/FlagOpen/FlagE
204
  We pre-train the model following the method [retromae](https://github.com/staoxiao/RetroMAE),
205
  which shows promising improvement in retrieval task ([paper](https://aclanthology.org/2022.emnlp-main.35.pdf)).
206
  The pre-training was conducted on 24 A100(40G) GPUs with a batch size of 720.
207
- In retromae, the mask ratio of encoder and decoder are 0.3, 0.5 respectively.
208
  We used the AdamW optimizer and the learning rate is 2e-5.
209
 
210
  **Pre-training data**:
@@ -221,7 +208,7 @@ We used the AdamW optimizer and the learning rate is 2e-5.
221
  We fine-tune the model using a contrastive objective.
222
  The format of input data is a triple`(query, positive, negative)`.
223
  Besides the negative in the triple, we also adopt in-batch negatives strategy.
224
- We employ the cross-device negatives sharing method to sharing negatives among different GPUs,
225
  which can dramatically **increase the number of negatives**.
226
 
227
  We trained our model on 48 A100(40G) GPUs with a large batch size of 32,768 (so there are **65,535** negatives for each query in a batch).
@@ -246,22 +233,6 @@ You can easily finetune your model with it.
246
  **The data collection is to be released in the future.**
247
 
248
 
249
- ## Schedule
250
- - [x] Chinese Massive Text Embedding Benchmark
251
- - [x] release baai-general-embedding models
252
- - [x] release codes for training
253
- - [ ] Training Datasets
254
- - [ ] Multilingual model
255
- - [ ] ...
256
-
257
- We will continually update the embedding models and training codes,
258
- hoping to promote the development of the embedding model community.
259
-
260
-
261
- ## Contact
262
- If you have any question or suggestion related to this project, feel free to open an issue or pull a request.
263
- You also can email Shitao Xiao(stxiao@baai.ac.cn) and Zheng Liu(liuzheng@baai.ac.cn).
264
-
265
 
266
  ## License
267
  FlagEmbedding is licensed under [MIT License](https://github.com/FlagOpen/FlagEmbedding/blob/master/LICENSE). The released models can be used for commercial purposes free of charge.
 
6
  ---
7
 
8
  <h1 align="center">FlagEmbedding</h1>
9
+
 
 
 
 
 
 
 
 
 
 
 
 
 
10
 
11
  <h4 align="center">
12
  <p>
 
14
  <a href=#usage>Usage</a> |
15
  <a href="#evaluation">Evaluation</a> |
16
  <a href="#train">Train</a> |
 
17
  <a href="#license">License</a>
18
  <p>
19
  </h4>
20
 
21
+ For more details please refer to our GitHub repo: [FlagEmbedding](https://github.com/FlagOpen/FlagEmbedding).
22
 
23
  [English](README.md) | [中文](https://github.com/FlagOpen/FlagEmbedding/blob/master/README_zh.md)
24
 
25
  FlagEmbedding can map any text to a low-dimensional dense vector which can be used for tasks like retrieval, classification, clustering, or semantic search.
26
+ And it also can be used in vector databases for LLMs.
27
 
28
  ************* 🌟**Updates**🌟 *************
29
  - 08/05/2023: Release base-scale and small-scale models, **best performance among the models of the same size 🤗**
 
90
  print(embeddings)
91
  ```
92
  For retrieval task,
93
+ each query should start with an instruction (instructions see [Model List](https://github.com/FlagOpen/FlagEmbedding/tree/master#model-list)).
94
  ```python
95
  from sentence_transformers import SentenceTransformer
96
  queries = ["手机开不了机怎么办?"]
 
119
 
120
  # Tokenize sentences
121
  encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
122
+ # for retrieval task, add an instruction to query
123
  # encoded_input = tokenizer([instruction + q for q in queries], padding=True, truncation=True, return_tensors='pt')
124
 
125
  # Compute token embeddings
 
163
 
164
 
165
  - **C-MTEB**:
166
+ We create a benchmark C-MTEB for Chinese text embedding which consists of 31 datasets from 6 tasks.
167
  Please refer to [C_MTEB](https://github.com/FlagOpen/FlagEmbedding/blob/master/C_MTEB/README.md) for a detailed introduction.
168
 
169
  | Model | Embedding dimension | Avg | Retrieval | STS | PairClassification | Classification | Reranking | Clustering |
 
191
  We pre-train the model following the method [retromae](https://github.com/staoxiao/RetroMAE),
192
  which shows promising improvement in retrieval task ([paper](https://aclanthology.org/2022.emnlp-main.35.pdf)).
193
  The pre-training was conducted on 24 A100(40G) GPUs with a batch size of 720.
194
+ In retromae, the mask ratio of encoder and decoder are 0.3, and 0.5 respectively.
195
  We used the AdamW optimizer and the learning rate is 2e-5.
196
 
197
  **Pre-training data**:
 
208
  We fine-tune the model using a contrastive objective.
209
  The format of input data is a triple`(query, positive, negative)`.
210
  Besides the negative in the triple, we also adopt in-batch negatives strategy.
211
+ We employ the cross-device negatives sharing method to share negatives among different GPUs,
212
  which can dramatically **increase the number of negatives**.
213
 
214
  We trained our model on 48 A100(40G) GPUs with a large batch size of 32,768 (so there are **65,535** negatives for each query in a batch).
 
233
  **The data collection is to be released in the future.**
234
 
235
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
236
 
237
  ## License
238
  FlagEmbedding is licensed under [MIT License](https://github.com/FlagOpen/FlagEmbedding/blob/master/LICENSE). The released models can be used for commercial purposes free of charge.