prithivida
/

miniDense_chinese_v1

Sentence Similarity

sentence-transformers

feature-extraction

passage-retrieval

knowledge-distillation

middle-training

text-embeddings-inference

Inference Endpoints

Model card Files Files and versions

prithivida commited on Jun 4

Commit

2a2e10d

•

1 Parent(s): 2deee4b

Update README.md

Files changed (1) hide show

README.md +17 -3

README.md CHANGED Viewed

@@ -52,7 +52,7 @@ pipeline_tag: sentence-similarity
     - [How can I reduce overall inference cost ?](#how-can-i-reduce-overall-inference-cost)
     - [How do I reduce vector storage cost?](#how-do-i-reduce-vector-storage-cost)
     - [How do I offer hybrid search to improve accuracy?](#how-do-i-offer-hybrid-search-to-improve-accuracy)
-    - [Why not run MTEB?](#why-not-run-mteb)
 - [Roadmap](#roadmap)
 - [Notes on Reproducing:](#notes-on-reproducing)
 - [Reference:](#reference)
@@ -165,9 +165,23 @@ The below numbers are with mDPR model, but miniMiracle_zh_v1 should give a even
 *Note: MIRACL paper shows a different (higher) value for BM25 Chinese, So we are taking that value from BGE-M3 paper, rest all are form the MIRACL paper.*
-#### Why not run cMTEB?
 CMTEB is a general purpose embedding evaluation bechmark covering wide range of tasks, but like BGE-M3, miniMiracle models are predominantly tuned for retireval tasks aimed at search & IR based usecases.
-But we would run the retrieval slice of the cMTEB and add the scores here.
 # Roadmap

     - [How can I reduce overall inference cost ?](#how-can-i-reduce-overall-inference-cost)
     - [How do I reduce vector storage cost?](#how-do-i-reduce-vector-storage-cost)
     - [How do I offer hybrid search to improve accuracy?](#how-do-i-offer-hybrid-search-to-improve-accuracy)
+    - [CMTEB numbers](#cmteb-numbers)
 - [Roadmap](#roadmap)
 - [Notes on Reproducing:](#notes-on-reproducing)
 - [Reference:](#reference)
 *Note: MIRACL paper shows a different (higher) value for BM25 Chinese, So we are taking that value from BGE-M3 paper, rest all are form the MIRACL paper.*
+#### cMTEB numbers:
 CMTEB is a general purpose embedding evaluation bechmark covering wide range of tasks, but like BGE-M3, miniMiracle models are predominantly tuned for retireval tasks aimed at search & IR based usecases.
+We ran the retrieval slice of the cMTEB and add the scores here.
+We compared the performance few top general purpose embedding models on the C-MTEB benchmark. please refer to the C-MTEB leaderboard.
+| Model Name | Model Size (GB) | Dimension | Sequence Length | Retrieval (8) | Remarks|
+|:----:|:---:|:---:|:---:|:---:|
+| [360Zhinao-search] | 0.61 (FP16)| 1024 | 512 | 75.06 | Top Model as on Jun 2024 |
+| [piccolo-large-zh] | 0.65 (FP16)| 1024 | 512 | 70.93 ||
+| [bge-large-zh]| 1.3 | 1024| 512 | 70.46||
+| [piccolo-base-zh]| 0.2 (FP16)| 768 | 512 | 71.2 ||
+| [bge-large-zh-no-instruct]| 1.3 | 1024 | 512 | 70.54 ||
+| [bge-base-zh]| 0.41 | 768 | 512 | 69.3 ||
+| [**miniMiracle_zh_v1**]| **0.47** | **384** | **512** | **59.91** ||
 # Roadmap