--- datasets: - abokbot/wikipedia-first-paragraph language: - en library_name: sentence-transformers tags: - bi-coder - MSMARCO --- # Description We use MS Marco Encoder msmarco-MiniLM-L-6-v3 from the sentence-transformers library to encode the text from dataset [abokbot/wikipedia-first-paragraph](https://huggingface.co/datasets/abokbot/wikipedia-first-paragraph). The dataset contains the first paragraphs of the English "20220301.en" version of the [Wikipedia dataset](https://huggingface.co/datasets/wikipedia). The output is an embedding tensor of size [6458670, 384]. # Code It was obtained by running the following code. ```python from datasets import load_dataset from sentence_transformers import SentenceTransformer dataset = load_dataset("abokbot/wikipedia-first-paragraph") bi_encoder = SentenceTransformer('msmarco-MiniLM-L-6-v3') bi_encoder.max_seq_length = 256 wikipedia_embedding = bi_encoder.encode(dataset["text"], convert_to_tensor=True, show_progress_bar=True) ``` This operation took 35min on a Google Colab notebook with GPU. # Reference More information of MS Marco encoders here https://www.sbert.net/docs/pretrained-models/ce-msmarco.html