Edit model card

mxbai-embed-large-v1-financial-rag-matryoshka

This is a sentence-transformers model finetuned from mixedbread-ai/mxbai-embed-large-v1. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: mixedbread-ai/mxbai-embed-large-v1
  • Maximum Sequence Length: 512 tokens
  • Output Dimensionality: 1024 tokens
  • Similarity Function: Cosine Similarity
  • Language: en
  • License: apache-2.0

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("rbhatia46/mxbai-embed-large-v1-financial-rag-matryoshka")
# Run inference
sentences = [
    'Microsoft, in their latest press release, revealed that they are anticipating a revenue growth of approximately 12% for the fiscal year ending in 2024.',
    "What is Microsoft's projected revenue growth for fiscal year 2024?",
    "What was the impact of COVID-19 on Zoom's profits?",
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Information Retrieval

Metric Value
cosine_accuracy@1 0.8456
cosine_accuracy@3 0.9392
cosine_accuracy@5 0.9671
cosine_accuracy@10 0.9899
cosine_precision@1 0.8456
cosine_precision@3 0.3131
cosine_precision@5 0.1934
cosine_precision@10 0.099
cosine_recall@1 0.8456
cosine_recall@3 0.9392
cosine_recall@5 0.9671
cosine_recall@10 0.9899
cosine_ndcg@10 0.9212
cosine_mrr@10 0.8989
cosine_map@100 0.8994

Information Retrieval

Metric Value
cosine_accuracy@1 0.8456
cosine_accuracy@3 0.9392
cosine_accuracy@5 0.9671
cosine_accuracy@10 0.9899
cosine_precision@1 0.8456
cosine_precision@3 0.3131
cosine_precision@5 0.1934
cosine_precision@10 0.099
cosine_recall@1 0.8456
cosine_recall@3 0.9392
cosine_recall@5 0.9671
cosine_recall@10 0.9899
cosine_ndcg@10 0.9217
cosine_mrr@10 0.8995
cosine_map@100 0.8999

Information Retrieval

Metric Value
cosine_accuracy@1 0.8405
cosine_accuracy@3 0.9367
cosine_accuracy@5 0.9646
cosine_accuracy@10 0.9899
cosine_precision@1 0.8405
cosine_precision@3 0.3122
cosine_precision@5 0.1929
cosine_precision@10 0.099
cosine_recall@1 0.8405
cosine_recall@3 0.9367
cosine_recall@5 0.9646
cosine_recall@10 0.9899
cosine_ndcg@10 0.9186
cosine_mrr@10 0.8955
cosine_map@100 0.8959

Information Retrieval

Metric Value
cosine_accuracy@1 0.8456
cosine_accuracy@3 0.9392
cosine_accuracy@5 0.9646
cosine_accuracy@10 0.9899
cosine_precision@1 0.8456
cosine_precision@3 0.3131
cosine_precision@5 0.1929
cosine_precision@10 0.099
cosine_recall@1 0.8456
cosine_recall@3 0.9392
cosine_recall@5 0.9646
cosine_recall@10 0.9899
cosine_ndcg@10 0.9201
cosine_mrr@10 0.8976
cosine_map@100 0.898

Information Retrieval

Metric Value
cosine_accuracy@1 0.8405
cosine_accuracy@3 0.9418
cosine_accuracy@5 0.9646
cosine_accuracy@10 0.9848
cosine_precision@1 0.8405
cosine_precision@3 0.3139
cosine_precision@5 0.1929
cosine_precision@10 0.0985
cosine_recall@1 0.8405
cosine_recall@3 0.9418
cosine_recall@5 0.9646
cosine_recall@10 0.9848
cosine_ndcg@10 0.9171
cosine_mrr@10 0.8949
cosine_map@100 0.8957

Information Retrieval

Metric Value
cosine_accuracy@1 0.8405
cosine_accuracy@3 0.9316
cosine_accuracy@5 0.957
cosine_accuracy@10 0.9823
cosine_precision@1 0.8405
cosine_precision@3 0.3105
cosine_precision@5 0.1914
cosine_precision@10 0.0982
cosine_recall@1 0.8405
cosine_recall@3 0.9316
cosine_recall@5 0.957
cosine_recall@10 0.9823
cosine_ndcg@10 0.9153
cosine_mrr@10 0.8935
cosine_map@100 0.8943

Training Details

Training Dataset

Unnamed Dataset

  • Size: 3,550 training samples
  • Columns: positive and anchor
  • Approximate statistics based on the first 1000 samples:
    positive anchor
    type string string
    details
    • min: 17 tokens
    • mean: 44.69 tokens
    • max: 105 tokens
    • min: 10 tokens
    • mean: 18.26 tokens
    • max: 30 tokens
  • Samples:
    positive anchor
    The total revenue for Google as of 2021 stands at approximately $181 billion, primarily driven by the performance of its advertising and cloud segments, hailing from the Information Technology sector. What is the total revenue of Google as of 2021?
    In Q4 2021, Amazon.com Inc. reported a significant increase in net income, reaching $14.3 billion, due to the surge in online shopping during the pandemic. What was the Net Income of Amazon.com Inc. in Q4 2021?
    Coca-Cola reported full-year 2021 revenue of $37.3 billion, a rise of 13% compared to $33.0 billion in 2020. This was primarily due to strong volume growth as well as improved pricing and mix. How did Coca-Cola's revenue performance in 2021 measure against its previous year?
  • Loss: MatryoshkaLoss with these parameters:
    {
        "loss": "MultipleNegativesRankingLoss",
        "matryoshka_dims": [
            1024,
            768,
            512,
            256,
            128,
            64
        ],
        "matryoshka_weights": [
            1,
            1,
            1,
            1,
            1,
            1
        ],
        "n_dims_per_step": -1
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: epoch
  • per_device_train_batch_size: 32
  • per_device_eval_batch_size: 16
  • gradient_accumulation_steps: 16
  • learning_rate: 2e-05
  • num_train_epochs: 10
  • lr_scheduler_type: cosine
  • warmup_ratio: 0.1
  • bf16: True
  • tf32: True
  • load_best_model_at_end: True
  • optim: adamw_torch_fused
  • batch_sampler: no_duplicates

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: epoch
  • prediction_loss_only: True
  • per_device_train_batch_size: 32
  • per_device_eval_batch_size: 16
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 16
  • eval_accumulation_steps: None
  • learning_rate: 2e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 10
  • max_steps: -1
  • lr_scheduler_type: cosine
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.1
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: True
  • fp16: False
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: True
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: True
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch_fused
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: False
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • batch_sampler: no_duplicates
  • multi_dataset_batch_sampler: proportional

Training Logs

Epoch Step Training Loss dim_1024_cosine_map@100 dim_128_cosine_map@100 dim_256_cosine_map@100 dim_512_cosine_map@100 dim_64_cosine_map@100 dim_768_cosine_map@100
0.8649 6 - 0.8783 0.8651 0.8713 0.8783 0.8439 0.8809
1.4414 10 0.7682 - - - - - -
1.8739 13 - 0.8918 0.8827 0.8875 0.8918 0.8729 0.8933
2.8829 20 0.1465 0.8948 0.8896 0.8928 0.8961 0.8884 0.8953
3.8919 27 - 0.8930 0.8884 0.8917 0.8959 0.8900 0.8945
4.3243 30 0.0646 - - - - - -
4.9009 34 - 0.8972 0.8883 0.8947 0.8955 0.8925 0.8970
5.7658 40 0.0397 - - - - - -
5.9099 41 - 0.8964 0.8915 0.8953 0.8943 0.8926 0.8979
6.9189 48 - 0.8994 0.8930 0.8966 0.8955 0.8932 0.8974
7.2072 50 0.0319 - - - - - -
7.9279 55 - 0.8998 0.8945 0.8967 0.8961 0.8943 0.8999
8.6486 60 0.0296 0.8994 0.8957 0.898 0.8959 0.8943 0.8999
  • The bold row denotes the saved checkpoint.

Framework Versions

  • Python: 3.10.6
  • Sentence Transformers: 3.0.1
  • Transformers: 4.41.2
  • PyTorch: 2.1.2+cu121
  • Accelerate: 0.32.1
  • Datasets: 2.19.1
  • Tokenizers: 0.19.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MatryoshkaLoss

@misc{kusupati2024matryoshka,
    title={Matryoshka Representation Learning}, 
    author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
    year={2024},
    eprint={2205.13147},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply}, 
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
Downloads last month
9
Safetensors
Model size
335M params
Tensor type
F32
·
Inference API
This model can be loaded on Inference API (serverless).

Finetuned from

Evaluation results