prithivida commited on
Commit
9a079cf
1 Parent(s): 0e7ac0a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +34 -9
README.md CHANGED
@@ -37,27 +37,31 @@ pipeline_tag: sentence-similarity
37
  </center>
38
 
39
 
40
- - [License and Terms:](#license-and-terms)
41
- - [Detailed comparison & Our Contribution:](#detailed-comparison--our-contribution)
42
- - [ONNX & GGUF Variants:](#detailed-comparison--our-contribution)
43
- - [Usage:](#usage)
44
  - [With Sentence Transformers:](#with-sentence-transformers)
45
  - [With Huggingface Transformers:](#with-huggingface-transformers)
 
 
46
  - [How do I optimise vector index cost?](#how-do-i-optimise-vector-index-cost)
47
  - [How do I offer hybrid search to address Vocabulary Mismatch Problem?](#how-do-i-offer)
 
 
48
  - [Notes on Reproducing:](#notes-on-reproducing)
49
  - [Reference:](#reference)
50
  - [Note on model bias](#note-on-model-bias)
51
 
52
 
53
- ## License and Terms:
54
 
55
  <center>
56
  <img src="./terms.png" width=200%/>
57
  </center>
58
 
59
 
60
- ## Detailed comparison & Our Contribution:
61
 
62
  English language famously have **all-minilm** series models which were great for quick experimentations and for certain production workloads. The Idea is to have same for the other popular langauges, starting with Indo-Aryan and Indo-Dravidian languages. Our innovation is in bringing high quality models which easy to serve and embeddings are cheaper to store without ANY pretraining or expensive finetuning. For instance, **all-minilm** are finetuned on 1-Billion pairs. We offer a very lean model but with a huge vocabulary - around 250K.
63
  We will add more details here.
@@ -81,7 +85,7 @@ Full set of evaluation numbers for our model
81
 
82
  <br/>
83
 
84
- ## Usage:
85
 
86
  #### With Sentence Transformers:
87
 
@@ -138,10 +142,16 @@ for query, query_embedding in zip(queries, query_embeddings):
138
  #### With Huggingface Transformers:
139
  - T.B.A
140
 
 
 
 
 
 
 
141
  #### How do I optimise vector index cost ?
142
  [Use Binary and Scalar Quantisation](https://huggingface.co/blog/embedding-quantization)
143
 
144
- <h4>How do I offer hybrid search to address Vocabulary Mismatch Problem?</h4>
145
  MIRACL paper shows simply combining BM25 is a good starting point for a Hybrid option:
146
  The below numbers are with mDPR model, but miniMiracle_zh_v1 should give a even better hybrid performance.
147
 
@@ -151,6 +161,21 @@ The below numbers are with mDPR model, but miniMiracle_zh_v1 should give a even
151
 
152
  *Note: MIRACL paper shows a different (higher) value for BM25 Chinese, So we are taking that value from BGE-M3 paper, rest all are form the MIRACL paper.*
153
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
154
  # Notes on reproducing:
155
 
156
  We welcome anyone to reproduce our results. Here are some tips and observations:
@@ -168,7 +193,7 @@ Here are our numbers for the full hindi run on BGE-M3
168
  {'MRR@10': 0.60893, 'MRR@100': 0.615, 'MRR@1000': 0.6151}
169
  ```
170
 
171
- Fair warning BGE-M3 is $ expensive to evaluate, probably that's why it's not part of any of the MTEB benchmarks.
172
 
173
 
174
  # Reference:
 
37
  </center>
38
 
39
 
40
+ - [License and Terms:](#license-and-terms)
41
+ - [Detailed comparison & Our Contribution:](#detailed-comparison--our-contribution)
42
+ - [ONNX & GGUF Variants:](#detailed-comparison--our-contribution)
43
+ - [Usage:](#usage)
44
  - [With Sentence Transformers:](#with-sentence-transformers)
45
  - [With Huggingface Transformers:](#with-huggingface-transformers)
46
+ - [FAQs](#faqs)
47
+ - [How can we run these models with out heavy torch dependency?](#how-can-we-run-these-models-with-out-heavy-torch-dependency)
48
  - [How do I optimise vector index cost?](#how-do-i-optimise-vector-index-cost)
49
  - [How do I offer hybrid search to address Vocabulary Mismatch Problem?](#how-do-i-offer)
50
+ - [Why not run cMTEB?](#why-not-run-cmteb)
51
+ - [Roadmap](#roadmap)
52
  - [Notes on Reproducing:](#notes-on-reproducing)
53
  - [Reference:](#reference)
54
  - [Note on model bias](#note-on-model-bias)
55
 
56
 
57
+ # License and Terms:
58
 
59
  <center>
60
  <img src="./terms.png" width=200%/>
61
  </center>
62
 
63
 
64
+ # Detailed comparison & Our Contribution:
65
 
66
  English language famously have **all-minilm** series models which were great for quick experimentations and for certain production workloads. The Idea is to have same for the other popular langauges, starting with Indo-Aryan and Indo-Dravidian languages. Our innovation is in bringing high quality models which easy to serve and embeddings are cheaper to store without ANY pretraining or expensive finetuning. For instance, **all-minilm** are finetuned on 1-Billion pairs. We offer a very lean model but with a huge vocabulary - around 250K.
67
  We will add more details here.
 
85
 
86
  <br/>
87
 
88
+ # Usage:
89
 
90
  #### With Sentence Transformers:
91
 
 
142
  #### With Huggingface Transformers:
143
  - T.B.A
144
 
145
+ # FAQs:
146
+
147
+ #### How can we run these models with out heavy torch dependency?
148
+
149
+ - You can use ONNX flavours of these models via [FlashRetrieve](https://github.com/PrithivirajDamodaran/FlashRetrieve) library.
150
+
151
  #### How do I optimise vector index cost ?
152
  [Use Binary and Scalar Quantisation](https://huggingface.co/blog/embedding-quantization)
153
 
154
+ #### How do I offer hybrid search to address Vocabulary Mismatch Problem?
155
  MIRACL paper shows simply combining BM25 is a good starting point for a Hybrid option:
156
  The below numbers are with mDPR model, but miniMiracle_zh_v1 should give a even better hybrid performance.
157
 
 
161
 
162
  *Note: MIRACL paper shows a different (higher) value for BM25 Chinese, So we are taking that value from BGE-M3 paper, rest all are form the MIRACL paper.*
163
 
164
+ #### Why not run cMTEB?
165
+ CMTEB is a general purpose embedding evaluation bechmark covering wide range of tasks, but like BGE-M3, miniMiracle models are predominantly tuned for retireval tasks aimed at search & IR based usecases.
166
+ But we would run the retrieval slice of the cMTEB and add the scores here.
167
+
168
+
169
+ # Roadmap
170
+ We will add miniMiracle series of models for all popular languages as we see fit or based on community requests in phases. Some of the languages we have in our list are
171
+
172
+ - Spanish
173
+ - Tamil
174
+ - Arabic
175
+ - German
176
+ - English ?
177
+
178
+
179
  # Notes on reproducing:
180
 
181
  We welcome anyone to reproduce our results. Here are some tips and observations:
 
193
  {'MRR@10': 0.60893, 'MRR@100': 0.615, 'MRR@1000': 0.6151}
194
  ```
195
 
196
+ Fair warning BGE-M3 is $ expensive to evaluate, probably that's why it's not part of any of the retrieval slice of MTEB benchmarks.
197
 
198
 
199
  # Reference: