Indexify: Bringing HuggingFace Models to Real-Time Pipelines for Production Applications

Community Article Published May 31, 2024

Whether you're developing e-commerce apps, customer support chatbots, or legal and financial RAG applications, Hugging Face has models for nearly every use case. New models tailored to specific tasks are added almost weekly, ensuring you always have the necessary tools. Prototyping an application with the ubiquitous transformers library is straightforward, but developers often face significant challenges when building data-intensive applications for the real world.

These challenges include:

Keeping up with constantly changing data.
Ensuring pipeline reliability by processing data consistently, even during transient failures in computing or model-serving infrastructure.
Adapting pipelines to newer models.

Thankfully, solving these challenges is easy with Indexify, an open-source data framework for building real-time, data-intensive applications. With Indexify, you can construct pipelines using one or more Hugging Face models that reliably handle tens of thousands of requests. The best part? Indexify can run on your laptop for prototyping and seamlessly scale to cloud infrastructure to handle any traffic volume in your production environment.

Creating Smarter Meeting Notes

If you want to dive into the code for this example, click here!

Imagine you are building an application to generate meeting notes. The typical workflow looks like this:

Receive a feed of meeting recordings.
Transcribe the recordings using an Automatic Speech Recognition model, preferably with diarization, to retain speaker identification.
Summarize the transcriptions to create searchable meeting notes.
Generate the final meeting notes from the summarized transcriptions.

To implement this with Indexify, you would start by creating a pipeline capable of handling these steps. In Indexify, we refer to these pipelines as Extraction Graphs. An Extraction Graph consists of one or more Extractors that transform unstructured data using models or other algorithms and then pass the processed data to another extractor or directly to a storage system.

from indexify import IndexifyClient, ExtractionGraph

client = IndexifyClient()

extraction_graph_spec = """
name: 'asrrag'
extraction_policies:
  - extractor: 'tensorlake/asrdiarization'
    name: 'sttextractor'
    input_params:
      batch_size: 24
  - extractor: 'tensorlake/chunk-extractor'
    name: 'chunker'
    input_params:
      chunk_size: 1000
      overlap: 100
    content_source: 'sttextractor'
  - extractor: 'tensorlake/arctic'
    name: 'embedder'
    content_source: 'chunker'
"""

extraction_graph = ExtractionGraph.from_yaml(extraction_graph_spec)
client.create_extraction_graph(extraction_graph)

The first extractor, tensorlake/asrdiarization, employs openai/whisper-large-v3 and pyannote/speaker-diarization-3.1 for speaker identification and transcription.
tensorlake/chunk-extractor segments text into 1,000-character chunks with 100-character overlaps for improved processing and retrieval.
tensorlake/arctic uses Snowflake's Arctic embedding model for fast and accurate semantic similarity searches.

Once configured:

Upload any volume of audio to Indexify.
Indexify consistently executes the pipeline, storing transcripts and summaries in blob storage and embeddings in vector databases.
The system operates continuously, producing summaries immediately after meetings and making them searchable.
Indexify autoscales to manage workload efficiently without infrastructure delays.

content_id = client.upload_file("asrrag", "interview.mp3")

Plug and Play with Hugging Face Models

Indexify works with all open-source large language models available on Hugging Face. If you need to create an extractor based on any other open-source model from Hugging Face, the process is simple - let’s see how we can create a basic Named Entity Recognition extractor.

from typing import List, Union, Optional
from indexify_extractor_sdk import Content, Extractor, Feature
from pydantic import BaseModel, Field
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

class NERExtractorConfig(BaseModel):
    model_name: Optional[str] = Field(default="dslim/bert-base-NER")

class NERExtractor(Extractor):
    name = "tensorlake/ner"
    description = "An extractor that let's you do Named Entity Recognition."
    system_dependencies = []
    input_mime_types = ["text/plain"]

    def __init__(self):
        super(NERExtractor, self).__init__()

    def extract(self, content: Content, params: NERExtractorConfig) -> List[Union[Feature, Content]]:
        text = content.data.decode("utf-8")
        model_name = params.model_name

        tokenizer = AutoTokenizer.from_pretrained(model_name)
        model = AutoModelForTokenClassification.from_pretrained(model_name)
        nlp = pipeline("ner", model=model, tokenizer=tokenizer)

        ner_results = nlp(text)
        
        return Content.from_text(str(ner_results))

    def sample_input(self) -> Content:
        return Content.from_text("My name is Wolfgang and I live in Berlin")

If you want to dive into the code for this extractor, click here!

Why Does Reliability and Real-Time Matter for AI Applications?

AI applications need to be responsive to changes, such as in businesses where documents are frequently updated or created. Consider a scenario where users rely on your meeting summarization application to share updates with their teams; if the pipeline fails to execute, it degrades the user experience and undermines user trust. Another potential issue arises if your application becomes very popular and experiences high adoption rates, but cannot scale because the data processing infrastructure was not designed to scale horizontally.

Most successful AI applications operate within the workflows of consumers or decision-makers. Therefore, the model's data needs to be updated within the time constraints of these workflows. It must be supported by a robust infrastructure capable of handling failures, such as nodes going down in a cluster or model APIs becoming slower or producing errors.

Key Features That Make Indexify Ideal for Building Production Applications

Multi-Modality and Extensibility

Indexify is multi-modal by default. You can work with documents, text, videos, and audio in your pipelines. You don’t have to reach for another tool or framework when working with a new data type in your applications.

LLM Framework and Database Compatibility

Indexify's retrieval API makes it easy to incorporate into your applications, and we have already built integrations with frameworks such as Langchain and DSPy.

Indexify supports a wide range of vector and structured data storage databases, including Qdrant, Pinecone, PgVector, LanceDB, Postgres, and SQLite. More integrations with backends like Cassandra, MongoDB, and Weviate are on the way.

Multi-Geography Deployments

Indexify can be deployed in a multi-datacenter mode, allowing data extraction and querying from anywhere. This is advantageous if you have additional GPU or compute capacity in Azure or GCP while your data and applications reside in AWS. You can deploy Indexify’s ingestion server and scheduler in AWS and run the extractors in other clouds or vice versa. This flexibility optimizes capacity, cost, and performance, ensuring that your LLM applications scale seamlessly as data volumes increase.

Indexify uses mTLS to encrypt all traffic between the control and data planes to secure data movement over the internet. This robust security measure protects your sensitive data and ensures compliance with data privacy regulations.

Start Using Indexify

Visit our website and GitHub repository to learn more about the project and how to use it.

Website - https://getindexify.ai/
GitHub - https://github.com/tensorlakeai/indexify

Upvote