Edit model card

Model Card for philipchung/bge-m3-onnx

This is the BAAI/BGE-M3 inference model converted to ONNX format and can be used with Optimum ONNX Runtime with CPU acceleration. This model outputs all 3 embedding types (Dense, Sparse, ColBERT).

No ONNX optimizations are applied to this model. If you want to apply optimizations, use the export script included in this repo to generate a version of ONNX model with optimizations.

Some of the code is adapted from aapot/bge-m3-onnx. The model in this repo inherits from PretrainedModel and the ONNX model can be downloaded from Huggingface Hub and used directly with the model.from_pretrained() method.

How to Use

from collections import defaultdict
from typing import Any

import numpy as np
from optimum.onnxruntime import ORTModelForCustomTasks
from transformers import AutoTokenizer

# Download ONNX model from Huggingface Hub
onnx_model = ORTModelForCustomTasks.from_pretrained("philipchung/bge-m3-onnx")
tokenizer = AutoTokenizer.from_pretrained("philipchung/bge-m3-onnx")
# Inference forward pass
sentences = ["First test sentence.", "Second test sentence"]
inputs = tokenizer(
    sentences,
    padding="longest",
    return_tensors="np",
)
outputs = onnx_model.forward(**inputs)

def process_token_weights(
    token_weights: np.ndarray, input_ids: list
) -> defaultdict[Any, int]:
    """Convert sparse token weights into dictionary of token indices and corresponding weights.

    Function is taken from the original FlagEmbedding.bge_m3.BGEM3FlagModel from the
    _process_token_weights() function defined within the encode() method.
    """
    # convert to dict
    result = defaultdict(int)
    unused_tokens = set(
        [
            tokenizer.cls_token_id,
            tokenizer.eos_token_id,
            tokenizer.pad_token_id,
            tokenizer.unk_token_id,
        ]
    )
    for w, idx in zip(token_weights, input_ids, strict=False):
        if idx not in unused_tokens and w > 0:
            idx = str(idx)
            # w = int(w)
            if w > result[idx]:
                result[idx] = w
    return result

# Each sentence results in a dict[str, list]float] | dict[str, float] | list[list[float]]] which corresponds to a dict with dense, sparse, and colbert embeddings.
embeddings_list = []
for input_ids, dense_vec, sparse_vec, colbert_vec in zip(
    inputs["input_ids"],
    outputs["dense_vecs"],
    outputs["sparse_vecs"],
    outputs["colbert_vecs"],
    strict=False,
):
    # Convert token weights into dictionary of token indices and corresponding weights
    token_weights = sparse_vec.astype(float).squeeze(-1)
    sparse_embeddings = process_token_weights(
            token_weights,
            input_ids.tolist(),
        )
    multivector_embedding = {
        "dense": dense_vec.astype(float).tolist(),  # (1024)
        "sparse": dict(sparse_embeddings),  # dict[token_index, weight]
        "colbert": colbert_vec.astype(float).tolist(),  # (token len, 1024)
    }
    embeddings_list.append(multivector_embedding)
Downloads last month
2
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Finetuned from