|
--- |
|
datasets: |
|
- abokbot/wikipedia-first-paragraph |
|
language: |
|
- en |
|
library_name: sentence-transformers |
|
tags: |
|
- bi-coder |
|
- MSMARCO |
|
--- |
|
# Description |
|
We use MS Marco Encoder msmarco-MiniLM-L-6-v3 to encode the text from dataset [abokbot/wikipedia-first-paragraph](https://huggingface.co/datasets/abokbot/wikipedia-first-paragraph). |
|
|
|
This dataset contains the first paragraphs of the English "20220301.en" version of the [Wikipedia dataset](https://huggingface.co/datasets/wikipedia). |
|
|
|
|
|
# Code |
|
It was obtained by running the following code. |
|
|
|
```python |
|
from datasets import load_dataset |
|
from sentence_transformers import SentenceTransformer |
|
|
|
dataset = load_dataset("abokbot/wikipedia-first-paragraph") |
|
bi_encoder = SentenceTransformer('msmarco-MiniLM-L-6-v3') |
|
bi_encoder.max_seq_length = 256 |
|
wikipedia_embedding = bi_encoder.encode(dataset["text"], convert_to_tensor=True, show_progress_bar=True) |
|
|
|
``` |
|
This operation took 35min on a Google Colab notebook with GPU. |