File size: 1,172 Bytes
c786fa0
 
 
 
 
 
 
 
 
 
 
4cc5620
c786fa0
f18933f
c786fa0
f18933f
c786fa0
 
 
 
 
 
 
 
 
 
 
 
 
 
4cc5620
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
---
datasets:
- abokbot/wikipedia-first-paragraph
language:
- en
library_name: sentence-transformers
tags:
- bi-coder
- MSMARCO
---
# Description
We use MS Marco Encoder msmarco-MiniLM-L-6-v3 from the sentence-transformers library to encode the text from dataset [abokbot/wikipedia-first-paragraph](https://huggingface.co/datasets/abokbot/wikipedia-first-paragraph).

The dataset contains the first paragraphs of the English "20220301.en" version of the [Wikipedia dataset](https://huggingface.co/datasets/wikipedia).

The output is an embedding tensor of size [6458670, 384].

# Code
It was obtained by running the following code.

```python
from datasets import load_dataset
from sentence_transformers import SentenceTransformer

dataset = load_dataset("abokbot/wikipedia-first-paragraph")
bi_encoder = SentenceTransformer('msmarco-MiniLM-L-6-v3')
bi_encoder.max_seq_length = 256
wikipedia_embedding = bi_encoder.encode(dataset["text"], convert_to_tensor=True, show_progress_bar=True)

```
This operation took 35min on a Google Colab notebook with GPU. 

# Reference
More information of MS Marco encoders here https://www.sbert.net/docs/pretrained-models/ce-msmarco.html