Tom Aarsen

tomaarsen

AI & ML interests

NLP: text embeddings, named entity recognition, few-shot text classification

Articles

Organizations

Posts 11

view post
Post
2261
@Omartificial-Intelligence-Space has trained and released 6 Arabic embedding models for semantic similarity. 4 of them outperform all previous models on the STS17 Arabic-Arabic task!

📚 Trained on a large dataset of 558k Arabic triplets translated from the AllNLI triplet dataset: Omartificial-Intelligence-Space/Arabic-NLi-Triplet
6️⃣ 6 different base models: AraBERT, MarBERT, LaBSE, MiniLM, paraphrase-multilingual-mpnet-base, mpnet-base, ranging from 109M to 471M parameters.
🪆 Trained with a Matryoshka loss, allowing you to truncate embeddings with minimal performance loss: smaller embeddings are faster to compare.
📈 Outperforms all commonly used multilingual models like intfloat/multilingual-e5-large, sentence-transformers/paraphrase-multilingual-mpnet-base-v2, and sentence-transformers/LaBSE.

Check them out here:
- Omartificial-Intelligence-Space/Arabic-mpnet-base-all-nli-triplet
- Omartificial-Intelligence-Space/Arabic-all-nli-triplet-Matryoshka
- Omartificial-Intelligence-Space/Arabert-all-nli-triplet-Matryoshka
- Omartificial-Intelligence-Space/Arabic-labse-Matryoshka
- Omartificial-Intelligence-Space/Marbert-all-nli-triplet-Matryoshka
- Omartificial-Intelligence-Space/Arabic-MiniLM-L12-v2-all-nli-triplet
Or the collection with all: Omartificial-Intelligence-Space/arabic-matryoshka-embedding-models-666f764d3b570f44d7f77d4e

My personal favourite is likely Omartificial-Intelligence-Space/Arabert-all-nli-triplet-Matryoshka: a very efficient 135M parameters & scores #1 on mteb/leaderboard.
view post
Post
2443
I just published Sentence Transformers v3.0.1: the first patch release since v3 from last week. It introduces gradient checkpointing, pushing model checkpoints to Hugging Face while training, model card improvements and fixes. Details:

1️⃣ Gradient checkpointing allows for much less memory usage at a cost of ~20% training speed. Seems to allow for higher batch sizes, which is quite important for loss functions with in-batch negatives.
2️⃣ You can specify args.push_to_hub=True and args.hub_model_id to upload your model checkpoints to Hugging Face while training. It also uploads your emissions (if codecarbon is installed) and your Tensorboard logs (if tensorboard is installed)
3️⃣ Model card improvements: improved automatic widget examples, better tags, and the default of "sentence_transformers_model_id" now gets replaced when possible.
4️⃣ Several evaluator fixes, see release notes for details.
5️⃣ Fixed a bug with MatryoshkaLoss throwing an error if the supplied Matryoshka dimensions are ascending instead of descending.
6️⃣ Full Safetensors support; even the uncommon modules can now save and load "model.safetensors" files: no more pickle risks.

Check out the full release notes here: https://github.com/UKPLab/sentence-transformers/releases/tag/v3.0.1

And let me know what kind of features you'd like to see next! I have some plans already (ONNX, Sparse models, ColBERT, PEFT), but I don't yet know how I should prioritize everything.