abokbot
/

wikipedia-embedding

sentence-transformers

Model card Files Files and versions Community

wikipedia-embedding / README.md

abokbot's picture

Update README.md

4cc5620 over 1 year ago

|

history blame contribute delete

1.17 kB

	---
	datasets:
	- abokbot/wikipedia-first-paragraph
	language:
	- en
	library_name: sentence-transformers
	tags:
	- bi-coder
	- MSMARCO
	---
	# Description
	We use MS Marco Encoder msmarco-MiniLM-L-6-v3 from the sentence-transformers library to encode the text from dataset [abokbot/wikipedia-first-paragraph](https://huggingface.co/datasets/abokbot/wikipedia-first-paragraph).

	The dataset contains the first paragraphs of the English "20220301.en" version of the [Wikipedia dataset](https://huggingface.co/datasets/wikipedia).

	The output is an embedding tensor of size [6458670, 384].

	# Code
	It was obtained by running the following code.

	```python
	from datasets import load_dataset
	from sentence_transformers import SentenceTransformer

	dataset = load_dataset("abokbot/wikipedia-first-paragraph")
	bi_encoder = SentenceTransformer('msmarco-MiniLM-L-6-v3')
	bi_encoder.max_seq_length = 256
	wikipedia_embedding = bi_encoder.encode(dataset["text"], convert_to_tensor=True, show_progress_bar=True)

	```
	This operation took 35min on a Google Colab notebook with GPU.

	# Reference
	More information of MS Marco encoders here https://www.sbert.net/docs/pretrained-models/ce-msmarco.html