aari1995 commited on
Commit
bda68eb
1 Parent(s): 21fe783

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +17 -71
README.md CHANGED
@@ -20,26 +20,21 @@ metrics:
20
  - pearson_max
21
  - spearman_max
22
  widget:
23
- - source_sentence: Ein Mann spricht.
24
  sentences:
25
- - Ein Mann spricht in ein Mikrofon.
26
- - Der Mann spielt auf den Tastaturen.
27
- - Zwei Mädchen gehen im Ozean spazieren.
28
- - source_sentence: Eine Flagge weht.
29
  sentences:
30
- - Die Flagge bewegte sich in der Luft.
31
- - Ein Hund fährt auf einem Skateboard.
32
- - Zwei Frauen sitzen in einem Cafe.
33
  - source_sentence: Ein Mann übt Boxen
34
  sentences:
35
  - Ein Affe praktiziert Kampfsportarten.
36
  - Eine Person faltet ein Blatt Papier.
37
  - Eine Frau geht mit ihrem Hund spazieren.
38
- - source_sentence: Das Tor ist gelb.
39
- sentences:
40
- - Das Tor ist blau.
41
- - Die Frau hält die Hände des Mannes.
42
- - NATO-Soldat bei afghanischem Angriff getötet
43
  - source_sentence: Zwei Frauen laufen.
44
  sentences:
45
  - Frauen laufen.
@@ -50,14 +45,14 @@ pipeline_tag: sentence-similarity
50
 
51
  # German Semantic V3
52
 
53
- The successor of German_Semantic_STS_V2 is here and comes with loads of cool new features.
54
 
55
  **Note:** To run this model properly, you need to set "trust_remote_code=True". See "Usage".
56
 
57
  ## Major updates and USPs:
58
 
59
  - **Flexibility:** Trained with flexible sequence-length and embedding truncation, flexibility is a core feature of the model. Yet, smaller dimensions bring a minor trade-off in quality.
60
- - **Sequence length:** 8192, (16 times more than V2 and other models) -> thanks to the ALiBi implementation of Jina-Team!
61
  - **Matryoshka Embeddings:** The model is trained for embedding sizes from 1024 down to 64, allowing you to store much smaller embeddings with little quality loss.
62
  - **German only:** This model is German-only, it has rich cultural knowledge about Germany and German topics. Therefore, also the model to learn more efficient thanks to its tokenizer, deal better with shorter queries and generally be more nuanced in many scenarios.
63
  - **Updated knowledge and quality data:** The backbone of this model is gbert-large by deepset. With Stage-2 pretraining on 1 Billion tokens of German fineweb by occiglot, up-to-date knowledge is ensured.
@@ -100,67 +95,18 @@ SentenceTransformer(
100
  )
101
  ```
102
 
103
- ## Usage
104
-
105
- ### Direct Usage (Sentence Transformers)
106
-
107
- First install the Sentence Transformers library:
108
-
109
- ```bash
110
- pip install -U sentence-transformers
111
- ```
112
-
113
- Then you can load this model and run inference.
114
- ```python
115
- from sentence_transformers import SentenceTransformer
116
-
117
- # Download from the 🤗 Hub
118
- model = SentenceTransformer("aari1995/gbert-large-2-cls-pawsx-nli-sts")
119
- # Run inference
120
- sentences = [
121
- 'Zwei Frauen laufen.',
122
- 'Frauen laufen.',
123
- 'Die Frau prüft die Augen des Mannes.',
124
- ]
125
- embeddings = model.encode(sentences)
126
- print(embeddings.shape)
127
- # [3, 1024]
128
-
129
- # Get the similarity scores for the embeddings
130
- similarities = model.similarity(embeddings, embeddings)
131
- print(similarities.shape)
132
- # [3, 3]
133
- ```
134
-
135
- <!--
136
- ### Direct Usage (Transformers)
137
-
138
- <details><summary>Click to see the direct usage in Transformers</summary>
139
-
140
- </details>
141
- -->
142
-
143
- <!--
144
- ### Downstream Usage (Sentence Transformers)
145
-
146
- You can finetune this model on your own dataset.
147
-
148
- <details><summary>Click to expand</summary>
149
-
150
- </details>
151
- -->
152
-
153
- <!--
154
- ### Out-of-Scope Use
155
-
156
- *List how the model may foreseeably be misused and address what users ought not to do with the model.*
157
- -->
158
 
159
  ## Evaluation
160
 
161
  Evaluation to come.
162
 
163
- ## Citation
 
 
 
 
 
 
164
 
165
  ### BibTeX
166
 
 
20
  - pearson_max
21
  - spearman_max
22
  widget:
23
+ - source_sentence: Bundeskanzler.
24
  sentences:
25
+ - Angela Merkel.
26
+ - Olaf Scholz.
27
+ - Tino Chrupalla.
28
+ - source_sentence: Corona.
29
  sentences:
30
+ - Virus.
31
+ - Krone.
32
+ - Bier.
33
  - source_sentence: Ein Mann übt Boxen
34
  sentences:
35
  - Ein Affe praktiziert Kampfsportarten.
36
  - Eine Person faltet ein Blatt Papier.
37
  - Eine Frau geht mit ihrem Hund spazieren.
 
 
 
 
 
38
  - source_sentence: Zwei Frauen laufen.
39
  sentences:
40
  - Frauen laufen.
 
45
 
46
  # German Semantic V3
47
 
48
+ The successor of German_Semantic_STS_V2 is here and comes with loads of cool new features!
49
 
50
  **Note:** To run this model properly, you need to set "trust_remote_code=True". See "Usage".
51
 
52
  ## Major updates and USPs:
53
 
54
  - **Flexibility:** Trained with flexible sequence-length and embedding truncation, flexibility is a core feature of the model. Yet, smaller dimensions bring a minor trade-off in quality.
55
+ - **Sequence length:** Embed up to 8192 tokens (16 times more than V2 and other models)
56
  - **Matryoshka Embeddings:** The model is trained for embedding sizes from 1024 down to 64, allowing you to store much smaller embeddings with little quality loss.
57
  - **German only:** This model is German-only, it has rich cultural knowledge about Germany and German topics. Therefore, also the model to learn more efficient thanks to its tokenizer, deal better with shorter queries and generally be more nuanced in many scenarios.
58
  - **Updated knowledge and quality data:** The backbone of this model is gbert-large by deepset. With Stage-2 pretraining on 1 Billion tokens of German fineweb by occiglot, up-to-date knowledge is ensured.
 
95
  )
96
  ```
97
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
98
 
99
  ## Evaluation
100
 
101
  Evaluation to come.
102
 
103
+ ## Thank You and Credits
104
+
105
+ - To [occiglot](https://huggingface.co/occiglot) and OSCAR for their data used to pre-train the model
106
+ - To [deepset](https://huggingface.co/deepset) for the gbert-large, which is a really great model
107
+ - To [jinaAI](https://huggingface.co/jinaai) for their BERT implementation that is used, especially ALiBi
108
+ - To [Tom](https://huggingface.co/tomaarsen), especially for sentence-transformers, [Björn and Jan from ellamind](https://ellamind.com/de/) for the consultation
109
+ - To [Meta](https://huggingface.co/facebook) for XNLI
110
 
111
  ### BibTeX
112