Building a Custom Arabic Semantic Search Model with Arabic Matryoshka Embeddings for RAG Using Sentence Transformers

Community Article Published September 25, 2024

Our Arabic Semantic Search Model, powered by Matryoshka Embeddings, has clinched the #1 spot on the MTEB leaderboard, demonstrating cutting-edge NLP excellence.

Implementing a Retrieval Augmented Generation (RAG) pipeline by fine-tuning your own semantic search model is a powerful approach to enhance the accuracy and relevance of question-answering systems. This technique combines the strengths of both semantic search and generative AI, enabling the system to better understand user questions and generate more accurate and contextually relevant responses. By fine-tuning a semantic search model using Sentence Transformers, developers can tailor the model to their specific domain, improving the overall performance of the RAG pipeline.

image/webp

What Is Semantic Search?

Semantic search in Natural Language Processing (NLP) refers to the ability of search engines to understand the contextual meaning of search queries and content, rather than relying solely on keyword matching. By leveraging techniques like word embeddings, knowledge graphs, and natural language understanding, semantic search aims to comprehend the intent behind a user's query and retrieve more relevant and accurate results. This approach is essential as it enhances the user experience by providing more intuitive and effective search outcomes, especially in complex or ambiguous queries. However, while significant advancements have been made in semantic search for widely spoken languages such as English, there remains a notable gap in the availability of reliable Arabic embeddings models. This lack of robust tools for Arabic hinders the development and application of semantic search in Arabic-speaking regions, underscoring the urgent need for improved language-specific NLP resources to support these applications.

What Is Retrieval Augmented Generation (RAG)?

Retrieval Augmented Generation (RAG) is a technique in NLP that enhances the performance of generative AI models by integrating them with external data sources. This method combines a retrieval mechanism with a generative model, enabling the system to search and retrieve relevant information from a database or knowledge base to supplement its internal knowledge during response generation. RAG ensures that the AI model can deliver contextually accurate and up-to-date answers, drawing on verified external sources.

This integration significantly improves the factual accuracy of AI-generated responses, making the information more reliable and verifiable. Additionally, it provides a means to update the AI's knowledge base without extensive retraining. RAG is particularly valuable in reducing the incidence of AI hallucinations—instances where the model generates incorrect or nonsensical information. For Arabic applications, the importance of RAG is even more pronounced. Given the current lack of robust Arabic embeddings models, incorporating external data sources can significantly enhance the quality and accuracy of generative AI outputs in Arabic. This approach can bridge the gap, ensuring that Arabic-speaking users benefit from reliable and contextually relevant AI interactions.

What Is the Matryoshka Embeddings Model?

Matryoshka Representation Learning (MRL) is a new state-of-the-art text embedding model optimized to produce embeddings with increasingly higher output dimensions, representing input texts with more values. While traditional embeddings models produce embeddings with fixed dimensions, improvement to enhance performance often reduces the efficiency of downstream tasks such as search or classification. Matryoshka embedding models address this issue by training embeddings to be useful even when truncated. These models can produce effective embeddings of varying dimensions.

Key Features of the Matryoshka Embeddings Model:

image/webp

The concept is inspired by "Matryoshka dolls," also known as "Russian nesting dolls," which are a set of wooden dolls of decreasing size placed inside one another. Similarly, Matryoshka embedding models store more critical information in the earlier dimensions and less important information in later dimensions. This characteristic allows the truncation of the original large embedding produced by the model, while still retaining sufficient information to perform well on downstream tasks. These variable-size embedding models can be highly valuable to practitioners in several ways:

  • Shortlisting and Reranking: Instead of performing downstream tasks (e.g., nearest neighbor search) on the full embeddings, you can shrink the embeddings to a smaller size for efficient shortlisting. Subsequently, the remaining embeddings can be processed using their full dimensionality.

  • Trade-offs: Matryoshka models enable scaling of embedding solutions according to desired storage cost, processing speed, and performance.

The core innovation of Matryoshka Representation Learning (MRL) lies in its ability to create adaptable, nested representations through explicit optimization. This flexibility is crucial for large-scale classification and retrieval tasks, where computational efficiency and accuracy are paramount.

Our Contributions

In our quest to advance Arabic NLP, we made several significant contributions:

  • Translation of a wide range of sentence similarity datasets into Arabic, facilitating robust training and evaluation of our models. This step was essential to create models that can accurately process and understand Arabic text.

Using the Arabic Natural Language Inference (NLI) triplet dataset to train different Matryoshka Embeddings Models based on different base models. This dataset includes anchor, positive, and negative sentence pairs, which helps the model learn to distinguish between similar and dissimilar sentences.

  • By employing a hierarchical embedding strategy, our models capture complex semantic relationships within Arabic text, making them highly effective for various downstream tasks. This hierarchical approach allows the model to process both fine-grained and broad semantic contexts.

We are proud to announce that four of our Matryoshka Embedding Models have achieved the top four positions on the MTEB leaderboard for the STS17 Arabic-Arabic (Ar-Ar) task. This remarkable accomplishment highlights the effectiveness and accuracy of our models in capturing semantic textual similarity in Arabic. These models have set a new standard for Arabic NLP, demonstrating their superior performance in understanding and processing the intricacies of the Arabic language.

image/webp

Creating Your Own Arabic Semantic Search Model

Creating your own semantic search model is an excellent way to achieve highly accurate results with minimal latency. This is especially effective when deploying your semantic search model on a GPU.

For this tutorial, we'll use a dataset from Hugging Face called "google/xtreme," which contains questions and contexts. This dataset provides a rich source of data for fine-tuning our semantic search model. It also offers a diverse range of contexts and questions, making it ideal for our use case.

Preparing the Dataset for RAG Application

We can simply prepare the dataset for RAG application using the following code:

# Load the dataset
dataset = load_dataset("google/xtreme", 'MLQA.ar.ar' , split="validation")
passages = dataset['context']

Now that we have the dataset dataset, we can encode the data using our Arabic model with Sentence Transformers. Create a Python script with the following (make sure that PyTorch and Sentence Transformers are installed).

from datasets import load_dataset
from sentence_transformers import SentenceTransformer, util
import torch

# Initialize the Hugging Face dataset and the SentenceTransformer model
model_name = 'Omartificial-Intelligence-Space/Arabert-all-nli-triplet-Matryoshka'
encoded_model_path = 'semantic_search_model.pt'

bi_encoder = SentenceTransformer(model_name)

# Load the dataset
dataset = load_dataset("google/xtreme", 'MLQA.ar.ar' , split="validation")
passages = dataset['context']

# Encode the passages
corpus_embeddings = bi_encoder.encode(
    passages, batch_size=32, convert_to_tensor=True, show_progress_bar=True)

# Save the encoded model
torch.save(corpus_embeddings, encoded_model_path)

So the code downloads and uses Omartificial-Intelligence-Space/Arabert-all-nli-triplet-Matryoshka “Our Best model and 1st on MTEB leaderboard for STS17 ar-ar Evaluation” as a base model and uses it to encode our data. You can choose among many available Arabic Matryoshka Embedding Models through our collection : https://huggingface.co/collections/Omartificial-Intelligence-Space/arabic-matryoshka-embedding-models-666f764d3b570f44d7f77d4e , depending on your requirements such as model size and use case. Additionally, depending on your hardware you will want to adapt the “batch_size” parameter in order to speed up the encoding process. Once created you can use your model for inference with the following Python script:

import csv
import torch

semantic_search_model = torch.load('semantic_search_model.pt')


question_embedding = bi_encoder.encode(
    "أين يتم استخدام الحمض النووي اليوم؟", convert_to_tensor=True)
hits = util.semantic_search(
    question_embedding, semantic_search_model, top_k=3)
hits = hits[0]

result = {"search_results": [
    {"score": hit['score'], "text": passages[hit['corpus_id']]} for hit in hits]}

result["search_results"]

output:

[{'score': 0.4949629604816437,
  'text': 'تفاعل البوليميريز التسلسلي والذي يعرف اختصارًا بتحليل PCR (الحمض النووي - DNA) أصبح من الممكن إجراء تحليل DNA في هذه الأيام باستخدام كميات صغيرة جدًا من الدم: وعلى الرغم من أن هذا التحليل يستخدم كثيرًا في الطب الشرعي، فإنه أصبح الآن جزءًا من عملية تشخيص العديد من الاضطرابات.'},
 {'score': 0.4260385036468506,
  'text': 'هابلوغروب L1 الميتوكوندرية  (بالإنجليزية:  Haplogroup L1 (mtDNA)) هي مجموعة جينات مميزة من دنا متقدرة بشرية ، تتوارثها الذكور والإناث عن الأم ولكن لا تتوارث من الأب. عندما  تتابعها الدراسات فهي تدرس  تتابعها في الإناث ؛ أي توارث  تلك المجموعة الجينية للبنت من الأم من الجدة ...وهكذا في الزمن الماضي.'},
 {'score': 0.37338581681251526,
  'text': 'هذا وتتطلب بعض تحاليل الدم الصوم (أو الامتناع عن الأكل) قبل سحب عينة الدم بفترة تتراوح بين 8 و12 ساعة. ومن أمثلة هذه التحاليل تلك التحاليل التي تقيس نسبة الجلوكوز أو الكوليسترول في الدم، أو تلك التي تستخدم لتحديد وجود أي من الأمراض المنقولة بالاتصال الجنسي من عدمه.  وبالنسبة لغالبية تحاليل الدم، فإنه عادةً ما يتم الحصول على عينة الدم من وريد المريض. ومع ذلك، تتطلب تحاليل دم أخرى متخصصة، مثل تحليل غاز الدم الشرياني سحب الدم من الشريان. هذا ويستخدم تحليل غاز الدم الشرياني في المقام الأول في رصد مستويات غاز ثاني أكسيد الكربون وغاز الأكسجين المرتبطين بوظائف الرئة، ولكنه يستخدم أيضًا في قياس درجة الحموضة في الدم ومستويات البيكربونات في ظل ظروف أيض معينة.'}]

By using the semantic_search_model with “top_k” parameter, we could determines how many results we want to return. In the result, we show the matching text from the dataset with a confidence score. This score is important because it helps us decide whether we want to accept the response or not . Since our embedding model has been successfully used to build the semantic search model, we have effectively completed the Retrieval part of the process. However, the main limitation of semantic search is that it returns raw text from the dataset without directly answering the question. This is where the Generation component comes into play. By using the power of generative AI with LLMs such as GPT3.5 Model, we can pass the the best result from the semantic search model and pass this as a context to the LLM to generate a formatted answer to the query.

from openai import OpenAI

# Function to perform semantic search
def semantic_search(query, top_k=3):
    question_embedding = bi_encoder.encode(query, convert_to_tensor=True)
    hits = util.semantic_search(question_embedding, semantic_search_model, top_k=top_k)
    hits = hits[0]

    results = [{"score": hit['score'], "text": passages[hit['corpus_id']]} for hit in hits]
    return results

# Function to generate output using GPT-3.5 Turbo
def generate_response(context, query):
    client = OpenAI(api_key=api_key)  # Replace with your OpenAI API key
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "You are an assistant providing detailed and accurate information."},
            {"role": "user", "content": f"Question: {query}\n\nContext: {context}"}
        ]
    )
    return response.choices[0].message.content

By using this code, we can implemente a RAG Pipeline using the semantic search and generation part. First, we loaded our pre-trained semantic search model and created a function semantic_search that takes a query and returns the top relevant passages from our dataset. The function encodes the query, performs the semantic search using SentenceTransformers, and retrieves the top results.

To address the limitation of semantic search returning raw text, we added a generation component using OpenAI's GPT-3.5 Turbo. The generate_response function takes the context from the top search results and generates a detailed answer to the user's query, combining the strengths of both retrieval and generative AI.

Next, let’s try some queries to the model:

# Example query
query = "من هو ستيفن ميلر؟"
search_results = semantic_search(query)

# Get the highest scoring context
highest_scoring_context = search_results[0]['text']
# Generate response using GPT-3.5 Turbo
response = generate_response(highest_scoring_context, query)

# Print the results
print("Search Results:", search_results)

Search Results: [{'score': 0.4687502682209015, 'text': 'ستيفن ميلر (ولد في 23 أغسطس 1985) هو أمريكي من أقصى اليمين والناشط السياسي الذي يشغل منصب مستشار سياسات الرئيس دونالد ترامب. كان مدير الاتصالات سابقا ثم المسئول عن جلسات السيناتورألاباما جيف. كما كان السكرتير الصحفي لميشيل باخمان الممثل الجمهوري و جون شديج .'}, {'score': 0.3333923816680908, 'text': 'ألكسندر بوب (بالإنجليزية: Alexander Pope) \u200f(21 مايو 1688—30 مايو 1744) هو شاعر إنجليزي شهير من القرن الثامن عشر، واشتهر بمقاطع شعرية ساخرة وبترجمته لأعمال هوميروس. وهو ثالث كاتب يتم الاقتباس منه في قاموس أكسفورد للاقتباسات، بعد شكسبير وألفريد تنيسون. واشتهر بوب باستخدام مقطع الشعر البطولي.'}, {'score': 0.31834760308265686, 'text': 'كريستيان كريستيانوفيتش ستيفن (19 يناير (30) 1781، 30 أبريل 1863) (بالروسية: Христиан Христианович Стевен) هو عالم نبات وعالم حشرات روي.'}

Then, This request returns a response similar to:

response
ستيفن ميلر هو ناشط السياسي الذي يشغل منصب مستشار سياسات الرئيس دونالد ترامب

What is the difference between Semantic Search Model and Storing Embeddings in a Vector Database?

In a Retrieval-Augmented Generation (RAG) system, two effective approaches for implementing semantic search are using locally encoded data or leveraging a vector database. When encoding your own data, you convert the data into tensors, which can then be loaded onto a GPU for enhanced performance. This method is particularly advantageous for achieving very low latency, as the computations are significantly accelerated by the GPU’s processing power. However, this approach requires re-encoding the data each time the dataset is updated, which can be cumbersome for rapidly changing data. Alternatively, a vector database is a specialized system designed to store, index, and efficiently query high-dimensional vectors. This method offers greater flexibility, allowing for incremental updates without the need for re-encoding the entire dataset. Vector databases, such as PG Vector, are well-suited for applications where the underlying data changes frequently, simplifying the process of maintaining up-to-date embeddings. For businesses aiming for minimal latency, loading encoded data onto a GPU is recommended. However, for environments with frequently changing data, utilizing a vector database can provide a more manageable and scalable solution.

Conclusion

In this article, we demonstrated the process of creating a powerful semantic search model tailored for Arabic NLP using the Matryoshka embedding models. By leveraging advanced techniques in semantic search and Retrieval-Augmented Generation (RAG), we highlighted the strengths of our fine-tuned models, which have achieved top positions on the MTEB leaderboard for the STS17 ar-ar task. This achievement underscores the effectiveness of our approach in capturing semantic nuances unique to the Arabic language. We explored the use of diverse training data, focusing on the Arabic NLI triplet dataset to fine-tune our models and enhance their performance. Through this meticulous fine-tuning process, our models demonstrated superior accuracy and robustness across various NLP tasks.

Our journey with the Matryoshka embedding models and the successful integration of semantic search and generative AI techniques pave the way for more accurate and contextually relevant question-answering systems. As we continue to refine and expand our models, we remain committed to advancing Arabic NLP and addressing the unique challenges posed by its rich linguistic diversity.

Resources

[1] Content of this article and code is inspired by https://nlpcloud.com/fine-tuning-semantic-search-model-with-sentence-transformers-for-rag-application.html#:~:text=RAG%20is%20a%20key%20component,a%20model%20proves%20extremely%20powerful. [2] Read more about Matryoshka embedding models from huggingface blog posts https://huggingface.co/blog/matryoshka [3] Check the Arabic Matryoshka embedding models collection https://huggingface.co/collections/Omartificial-Intelligence-Space/arabic-matryoshka-embedding-models-666f764d3b570f44d7f77d4e [4] Check the Arabic Matryoshka embedding datasets collection https://huggingface.co/collections/Omartificial-Intelligence-Space/arabic-nli-and-semantic-similarity-datasets-6671ba0a5e4cd3f5caca50c3

By: Omer Najar