prithivida
commited on
Commit
•
2a2e10d
1
Parent(s):
2deee4b
Update README.md
Browse files
README.md
CHANGED
@@ -52,7 +52,7 @@ pipeline_tag: sentence-similarity
|
|
52 |
- [How can I reduce overall inference cost ?](#how-can-i-reduce-overall-inference-cost)
|
53 |
- [How do I reduce vector storage cost?](#how-do-i-reduce-vector-storage-cost)
|
54 |
- [How do I offer hybrid search to improve accuracy?](#how-do-i-offer-hybrid-search-to-improve-accuracy)
|
55 |
-
- [
|
56 |
- [Roadmap](#roadmap)
|
57 |
- [Notes on Reproducing:](#notes-on-reproducing)
|
58 |
- [Reference:](#reference)
|
@@ -165,9 +165,23 @@ The below numbers are with mDPR model, but miniMiracle_zh_v1 should give a even
|
|
165 |
|
166 |
*Note: MIRACL paper shows a different (higher) value for BM25 Chinese, So we are taking that value from BGE-M3 paper, rest all are form the MIRACL paper.*
|
167 |
|
168 |
-
####
|
169 |
CMTEB is a general purpose embedding evaluation bechmark covering wide range of tasks, but like BGE-M3, miniMiracle models are predominantly tuned for retireval tasks aimed at search & IR based usecases.
|
170 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
171 |
|
172 |
|
173 |
# Roadmap
|
|
|
52 |
- [How can I reduce overall inference cost ?](#how-can-i-reduce-overall-inference-cost)
|
53 |
- [How do I reduce vector storage cost?](#how-do-i-reduce-vector-storage-cost)
|
54 |
- [How do I offer hybrid search to improve accuracy?](#how-do-i-offer-hybrid-search-to-improve-accuracy)
|
55 |
+
- [CMTEB numbers](#cmteb-numbers)
|
56 |
- [Roadmap](#roadmap)
|
57 |
- [Notes on Reproducing:](#notes-on-reproducing)
|
58 |
- [Reference:](#reference)
|
|
|
165 |
|
166 |
*Note: MIRACL paper shows a different (higher) value for BM25 Chinese, So we are taking that value from BGE-M3 paper, rest all are form the MIRACL paper.*
|
167 |
|
168 |
+
#### cMTEB numbers:
|
169 |
CMTEB is a general purpose embedding evaluation bechmark covering wide range of tasks, but like BGE-M3, miniMiracle models are predominantly tuned for retireval tasks aimed at search & IR based usecases.
|
170 |
+
We ran the retrieval slice of the cMTEB and add the scores here.
|
171 |
+
|
172 |
+
We compared the performance few top general purpose embedding models on the C-MTEB benchmark. please refer to the C-MTEB leaderboard.
|
173 |
+
|
174 |
+
|
175 |
+
| Model Name | Model Size (GB) | Dimension | Sequence Length | Retrieval (8) | Remarks|
|
176 |
+
|:----:|:---:|:---:|:---:|:---:|
|
177 |
+
| [360Zhinao-search] | 0.61 (FP16)| 1024 | 512 | 75.06 | Top Model as on Jun 2024 |
|
178 |
+
| [piccolo-large-zh] | 0.65 (FP16)| 1024 | 512 | 70.93 ||
|
179 |
+
| [bge-large-zh]| 1.3 | 1024| 512 | 70.46||
|
180 |
+
| [piccolo-base-zh]| 0.2 (FP16)| 768 | 512 | 71.2 ||
|
181 |
+
| [bge-large-zh-no-instruct]| 1.3 | 1024 | 512 | 70.54 ||
|
182 |
+
| [bge-base-zh]| 0.41 | 768 | 512 | 69.3 ||
|
183 |
+
| [**miniMiracle_zh_v1**]| **0.47** | **384** | **512** | **59.91** ||
|
184 |
+
|
185 |
|
186 |
|
187 |
# Roadmap
|