--- language: - de library_name: sentence-transformers tags: - sentence-transformers - sentence-similarity - feature-extraction - loss:MatryoshkaLoss base_model: aari1995/gbert-large-2 metrics: - pearson_cosine - spearman_cosine - pearson_manhattan - spearman_manhattan - pearson_euclidean - spearman_euclidean - pearson_dot - spearman_dot - pearson_max - spearman_max widget: - source_sentence: Bundeskanzler. sentences: - Angela Merkel. - Olaf Scholz. - Tino Chrupalla. - source_sentence: Corona. sentences: - Virus. - Krone. - Bier. - source_sentence: Ein Mann übt Boxen sentences: - Ein Affe praktiziert Kampfsportarten. - Eine Person faltet ein Blatt Papier. - Eine Frau geht mit ihrem Hund spazieren. - source_sentence: Zwei Frauen laufen. sentences: - Frauen laufen. - Die Frau prüft die Augen des Mannes. - Ein Mann ist auf einem Dach pipeline_tag: sentence-similarity --- # German Semantic V3 The successor of German_Semantic_STS_V2 is here and comes with loads of cool new features! **Note:** To run this model properly, you need to set "trust_remote_code=True". See "Usage". ## Major updates and USPs: - **Flexibility:** Trained with flexible sequence-length and embedding truncation, flexibility is a core feature of the model. Yet, smaller dimensions bring a minor trade-off in quality. - **Sequence length:** Embed up to 8192 tokens (16 times more than V2 and other models) - **Matryoshka Embeddings:** The model is trained for embedding sizes from 1024 down to 64, allowing you to store much smaller embeddings with little quality loss. - **German only:** This model is German-only, it has rich cultural knowledge about Germany and German topics. Therefore, also the model to learn more efficient thanks to its tokenizer, deal better with shorter queries and generally be more nuanced in many scenarios. - **Updated knowledge and quality data:** The backbone of this model is gbert-large by deepset. With Stage-2 pretraining on 1 Billion tokens of German fineweb by occiglot, up-to-date knowledge is ensured. - **Typo and Casing**: This model was trained to be robust against minor typos and casing, leading to slightly weaker benchmark performance and learning during training, but higher robustness of the embeddings. - **Pooling Function:** Moving away from mean pooling towards using the CLS token. Generally seems to learn better after the stage-2 pretraining and allows for more flexibility. - **License:** Apache 2.0 ## Usage: ```python from sentence_transformers import SentenceTransformer matryoshka_dim = 1024 # How big your embeddings should be, choose from: 64, 128, 256, 512, 768, 1024 model = SentenceTransformer("aari1995/German_Semantic_V3", trust_remote_code=True, truncate_dim=matryoshka_dim) # model.truncate_dim = 64 # truncation dimensions can also be changed after loading # model.max_seq_length = 512 #optionally, set your maximum sequence length lower if your hardware is limited # Run inference sentences = [ 'Eine Flagge weht.', 'Die Flagge bewegte sich in der Luft.', 'Zwei Personen beobachten das Wasser.', ] embeddings = model.encode(sentences) # Get the similarity scores for the embeddings similarities = model.similarity(embeddings, embeddings) ``` ### Full Model Architecture ``` SentenceTransformer( (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: JinaBertModel (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True}) ) ``` ## Evaluation Evaluation to come. ## Thank You and Credits - To [occiglot](https://huggingface.co/occiglot) and OSCAR for their data used to pre-train the model - To [deepset](https://huggingface.co/deepset) for the gbert-large, which is a really great model - To [jinaAI](https://huggingface.co/jinaai) for their BERT implementation that is used, especially ALiBi - To [Tom](https://huggingface.co/tomaarsen), especially for sentence-transformers, [Björn and Jan from ellamind](https://ellamind.com/de/) for the consultation - To [Meta](https://huggingface.co/facebook) for XNLI ### BibTeX #### Sentence Transformers ```bibtex @inproceedings{reimers-2019-sentence-bert, title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks", author = "Reimers, Nils and Gurevych, Iryna", booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing", month = "11", year = "2019", publisher = "Association for Computational Linguistics", url = "https://arxiv.org/abs/1908.10084", } ``` #### MatryoshkaLoss ```bibtex @misc{kusupati2024matryoshka, title={Matryoshka Representation Learning}, author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi}, year={2024}, eprint={2205.13147}, archivePrefix={arXiv}, primaryClass={cs.LG} } ```