NghiemAbe/Vi-Legal-Bi-Encoder-v2

This is a sentence-transformers model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.

Usage (Sentence-Transformers)

Using this model becomes easy when you have sentence-transformers installed:

pip install -U sentence-transformers

Then you can use the model like this:

from sentence_transformers import SentenceTransformer
from pyvi.ViTokenizer import tokenize
sentences = [tokenize("This is an example sentence"), tokenize("Each sentence is converted")]

model = SentenceTransformer('NghiemAbe/Vi-Legal-Bi-Encoder-v2')
embeddings = model.encode(sentences)
print(embeddings)

Usage (HuggingFace Transformers)

Without sentence-transformers, you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.

from transformers import AutoTokenizer, AutoModel
import torch


#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


# Sentences we want sentence embeddings for
sentences = [tokenize("This is an example sentence"), tokenize("Each sentence is converted")]

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('NghiemAbe/Vi-Legal-Bi-Encoder-v2')
model = AutoModel.from_pretrained('NghiemAbe/Vi-Legal-Bi-Encoder-v2')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

print("Sentence embeddings:")
print(sentence_embeddings)

Evaluation Results

I evaluated my Dev-Legal-Dataset and here are the results:

Model	R@1	R@5	R@10	R@20	R@100	MRR@5	MRR@10	MRR@20	MRR@100	Avg
keepitreal/vietnamese-sbert	0.278	0.552	0.649	0.734	0.842	0.396	0.409	0.415	0.417	0.521
sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2	0.314	0.486	0.585	0.662	0.854	0.395	0.409	0.414	0.419	0.504
sentence-transformers/paraphrase-multilingual-mpnet-base-v2	0.354	0.553	0.646	0.750	0.896	0.449	0.461	0.468	0.472	0.561
intfloat/multilingual-e5-small	0.488	0.746	0.835	0.906	0.962	0.610	0.620	0.624	0.625	0.713
intfloat/multilingual-e5-base	0.466	0.740	0.840	0.907	0.952	0.596	0.608	0.612	0.613	0.704
bkai-foundation-models/vietnamese-bi-encoder	0.644	0.881	0.924	0.954	0.986	0.752	0.757	0.758	0.759	0.824
Vi-Legal-Bi-Encoder-v2	0.720	0.884	0.935	0.963	0.986	0.796	0.802	0.803	0.804	0.855