Update README.md
Browse files
README.md
CHANGED
@@ -6,20 +6,7 @@ pipeline_tag: sentence-similarity
|
|
6 |
---
|
7 |
|
8 |
<h1 align="center">FlagEmbedding</h1>
|
9 |
-
|
10 |
-
<a href="https://www.python.org/">
|
11 |
-
<img alt="Build" src="https://img.shields.io/badge/Contribution-Welcome-blue">
|
12 |
-
</a>
|
13 |
-
<a href="https://github.com/FlagOpen/FlagEmbedding/blob/master/LICENSE">
|
14 |
-
<img alt="License" src="https://img.shields.io/badge/LICENSE-MIT-green">
|
15 |
-
</a>
|
16 |
-
<a href="https://huggingface.co/C-MTEB">
|
17 |
-
<img alt="Build" src="https://img.shields.io/badge/C_MTEB-🤗-yellow">
|
18 |
-
</a>
|
19 |
-
<a href="https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding">
|
20 |
-
<img alt="Build" src="https://img.shields.io/badge/FlagEmbedding-1.0.1-red">
|
21 |
-
</a>
|
22 |
-
</p>
|
23 |
|
24 |
<h4 align="center">
|
25 |
<p>
|
@@ -27,16 +14,16 @@ pipeline_tag: sentence-similarity
|
|
27 |
<a href=#usage>Usage</a> |
|
28 |
<a href="#evaluation">Evaluation</a> |
|
29 |
<a href="#train">Train</a> |
|
30 |
-
<a href="#contact">Contact</a> |
|
31 |
<a href="#license">License</a>
|
32 |
<p>
|
33 |
</h4>
|
34 |
|
|
|
35 |
|
36 |
[English](README.md) | [中文](https://github.com/FlagOpen/FlagEmbedding/blob/master/README_zh.md)
|
37 |
|
38 |
FlagEmbedding can map any text to a low-dimensional dense vector which can be used for tasks like retrieval, classification, clustering, or semantic search.
|
39 |
-
And it also can be used in vector
|
40 |
|
41 |
************* 🌟**Updates**🌟 *************
|
42 |
- 08/05/2023: Release base-scale and small-scale models, **best performance among the models of the same size 🤗**
|
@@ -103,7 +90,7 @@ embeddings = model.encode(sentences, normalize_embeddings=True)
|
|
103 |
print(embeddings)
|
104 |
```
|
105 |
For retrieval task,
|
106 |
-
each query should start with
|
107 |
```python
|
108 |
from sentence_transformers import SentenceTransformer
|
109 |
queries = ["手机开不了机怎么办?"]
|
@@ -132,7 +119,7 @@ model = AutoModel.from_pretrained('BAAI/bge-large-zh')
|
|
132 |
|
133 |
# Tokenize sentences
|
134 |
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
|
135 |
-
# for retrieval task, add
|
136 |
# encoded_input = tokenizer([instruction + q for q in queries], padding=True, truncation=True, return_tensors='pt')
|
137 |
|
138 |
# Compute token embeddings
|
@@ -176,7 +163,7 @@ More details and evaluation tools see our [scripts](https://github.com/FlagOpen/
|
|
176 |
|
177 |
|
178 |
- **C-MTEB**:
|
179 |
-
We create a benchmark C-MTEB for
|
180 |
Please refer to [C_MTEB](https://github.com/FlagOpen/FlagEmbedding/blob/master/C_MTEB/README.md) for a detailed introduction.
|
181 |
|
182 |
| Model | Embedding dimension | Avg | Retrieval | STS | PairClassification | Classification | Reranking | Clustering |
|
@@ -204,7 +191,7 @@ and we provide some examples to do [pre-train](https://github.com/FlagOpen/FlagE
|
|
204 |
We pre-train the model following the method [retromae](https://github.com/staoxiao/RetroMAE),
|
205 |
which shows promising improvement in retrieval task ([paper](https://aclanthology.org/2022.emnlp-main.35.pdf)).
|
206 |
The pre-training was conducted on 24 A100(40G) GPUs with a batch size of 720.
|
207 |
-
In retromae, the mask ratio of encoder and decoder are 0.3, 0.5 respectively.
|
208 |
We used the AdamW optimizer and the learning rate is 2e-5.
|
209 |
|
210 |
**Pre-training data**:
|
@@ -221,7 +208,7 @@ We used the AdamW optimizer and the learning rate is 2e-5.
|
|
221 |
We fine-tune the model using a contrastive objective.
|
222 |
The format of input data is a triple`(query, positive, negative)`.
|
223 |
Besides the negative in the triple, we also adopt in-batch negatives strategy.
|
224 |
-
We employ the cross-device negatives sharing method to
|
225 |
which can dramatically **increase the number of negatives**.
|
226 |
|
227 |
We trained our model on 48 A100(40G) GPUs with a large batch size of 32,768 (so there are **65,535** negatives for each query in a batch).
|
@@ -246,22 +233,6 @@ You can easily finetune your model with it.
|
|
246 |
**The data collection is to be released in the future.**
|
247 |
|
248 |
|
249 |
-
## Schedule
|
250 |
-
- [x] Chinese Massive Text Embedding Benchmark
|
251 |
-
- [x] release baai-general-embedding models
|
252 |
-
- [x] release codes for training
|
253 |
-
- [ ] Training Datasets
|
254 |
-
- [ ] Multilingual model
|
255 |
-
- [ ] ...
|
256 |
-
|
257 |
-
We will continually update the embedding models and training codes,
|
258 |
-
hoping to promote the development of the embedding model community.
|
259 |
-
|
260 |
-
|
261 |
-
## Contact
|
262 |
-
If you have any question or suggestion related to this project, feel free to open an issue or pull a request.
|
263 |
-
You also can email Shitao Xiao(stxiao@baai.ac.cn) and Zheng Liu(liuzheng@baai.ac.cn).
|
264 |
-
|
265 |
|
266 |
## License
|
267 |
FlagEmbedding is licensed under [MIT License](https://github.com/FlagOpen/FlagEmbedding/blob/master/LICENSE). The released models can be used for commercial purposes free of charge.
|
|
|
6 |
---
|
7 |
|
8 |
<h1 align="center">FlagEmbedding</h1>
|
9 |
+
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
10 |
|
11 |
<h4 align="center">
|
12 |
<p>
|
|
|
14 |
<a href=#usage>Usage</a> |
|
15 |
<a href="#evaluation">Evaluation</a> |
|
16 |
<a href="#train">Train</a> |
|
|
|
17 |
<a href="#license">License</a>
|
18 |
<p>
|
19 |
</h4>
|
20 |
|
21 |
+
For more details please refer to our GitHub repo: [FlagEmbedding](https://github.com/FlagOpen/FlagEmbedding).
|
22 |
|
23 |
[English](README.md) | [中文](https://github.com/FlagOpen/FlagEmbedding/blob/master/README_zh.md)
|
24 |
|
25 |
FlagEmbedding can map any text to a low-dimensional dense vector which can be used for tasks like retrieval, classification, clustering, or semantic search.
|
26 |
+
And it also can be used in vector databases for LLMs.
|
27 |
|
28 |
************* 🌟**Updates**🌟 *************
|
29 |
- 08/05/2023: Release base-scale and small-scale models, **best performance among the models of the same size 🤗**
|
|
|
90 |
print(embeddings)
|
91 |
```
|
92 |
For retrieval task,
|
93 |
+
each query should start with an instruction (instructions see [Model List](https://github.com/FlagOpen/FlagEmbedding/tree/master#model-list)).
|
94 |
```python
|
95 |
from sentence_transformers import SentenceTransformer
|
96 |
queries = ["手机开不了机怎么办?"]
|
|
|
119 |
|
120 |
# Tokenize sentences
|
121 |
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
|
122 |
+
# for retrieval task, add an instruction to query
|
123 |
# encoded_input = tokenizer([instruction + q for q in queries], padding=True, truncation=True, return_tensors='pt')
|
124 |
|
125 |
# Compute token embeddings
|
|
|
163 |
|
164 |
|
165 |
- **C-MTEB**:
|
166 |
+
We create a benchmark C-MTEB for Chinese text embedding which consists of 31 datasets from 6 tasks.
|
167 |
Please refer to [C_MTEB](https://github.com/FlagOpen/FlagEmbedding/blob/master/C_MTEB/README.md) for a detailed introduction.
|
168 |
|
169 |
| Model | Embedding dimension | Avg | Retrieval | STS | PairClassification | Classification | Reranking | Clustering |
|
|
|
191 |
We pre-train the model following the method [retromae](https://github.com/staoxiao/RetroMAE),
|
192 |
which shows promising improvement in retrieval task ([paper](https://aclanthology.org/2022.emnlp-main.35.pdf)).
|
193 |
The pre-training was conducted on 24 A100(40G) GPUs with a batch size of 720.
|
194 |
+
In retromae, the mask ratio of encoder and decoder are 0.3, and 0.5 respectively.
|
195 |
We used the AdamW optimizer and the learning rate is 2e-5.
|
196 |
|
197 |
**Pre-training data**:
|
|
|
208 |
We fine-tune the model using a contrastive objective.
|
209 |
The format of input data is a triple`(query, positive, negative)`.
|
210 |
Besides the negative in the triple, we also adopt in-batch negatives strategy.
|
211 |
+
We employ the cross-device negatives sharing method to share negatives among different GPUs,
|
212 |
which can dramatically **increase the number of negatives**.
|
213 |
|
214 |
We trained our model on 48 A100(40G) GPUs with a large batch size of 32,768 (so there are **65,535** negatives for each query in a batch).
|
|
|
233 |
**The data collection is to be released in the future.**
|
234 |
|
235 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
236 |
|
237 |
## License
|
238 |
FlagEmbedding is licensed under [MIT License](https://github.com/FlagOpen/FlagEmbedding/blob/master/LICENSE). The released models can be used for commercial purposes free of charge.
|