Upload with huggingface_hub
Browse files
README.md
CHANGED
@@ -2,85 +2,130 @@
|
|
2 |
pipeline_tag: sentence-similarity
|
3 |
tags:
|
4 |
- sentence-transformers
|
5 |
-
- feature-extraction
|
6 |
- sentence-similarity
|
7 |
- transformers
|
|
|
|
|
8 |
---
|
9 |
|
10 |
-
#
|
11 |
|
12 |
-
|
|
|
|
|
13 |
|
14 |
-
|
|
|
|
|
15 |
|
16 |
## Usage (Sentence-Transformers)
|
17 |
|
18 |
-
|
19 |
|
20 |
```
|
21 |
pip install -U sentence-transformers
|
22 |
```
|
23 |
|
24 |
-
|
25 |
|
26 |
```python
|
27 |
-
from sentence_transformers import SentenceTransformer
|
28 |
sentences = ["This is an example sentence", "Each sentence is converted"]
|
29 |
|
30 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
31 |
embeddings = model.encode(sentences)
|
32 |
print(embeddings)
|
33 |
```
|
34 |
|
35 |
|
36 |
-
|
37 |
-
|
38 |
-
|
39 |
|
40 |
```python
|
41 |
-
from
|
42 |
-
import torch
|
43 |
|
|
|
|
|
|
|
|
|
|
|
44 |
|
45 |
-
|
46 |
-
|
47 |
-
|
48 |
-
|
49 |
-
|
|
|
|
|
50 |
|
|
|
|
|
|
|
51 |
|
52 |
-
|
53 |
-
|
54 |
|
55 |
-
|
56 |
-
tokenizer = AutoTokenizer.from_pretrained('{MODEL_NAME}')
|
57 |
-
model = AutoModel.from_pretrained('{MODEL_NAME}')
|
58 |
|
59 |
-
#
|
60 |
-
|
61 |
|
62 |
-
#
|
63 |
-
|
64 |
-
model_output = model(**encoded_input)
|
65 |
|
66 |
-
|
67 |
-
|
68 |
|
69 |
-
|
70 |
-
print(
|
71 |
```
|
72 |
|
73 |
|
74 |
|
75 |
## Evaluation Results
|
76 |
|
77 |
-
|
|
|
|
|
|
|
|
|
|
|
78 |
|
79 |
-
|
80 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
81 |
|
82 |
-
## Training
|
83 |
-
The model was trained with the parameters:
|
84 |
|
85 |
**DataLoader**:
|
86 |
|
@@ -97,7 +142,7 @@ Parameters of the fit()-Method:
|
|
97 |
```
|
98 |
{
|
99 |
"epochs": 4,
|
100 |
-
"evaluation_steps":
|
101 |
"evaluator": "sentence_transformers.evaluation.EmbeddingSimilarityEvaluator.EmbeddingSimilarityEvaluator",
|
102 |
"max_grad_norm": 1,
|
103 |
"optimizer_class": "<class 'transformers.optimization.AdamW'>",
|
@@ -115,7 +160,7 @@ Parameters of the fit()-Method:
|
|
115 |
## Full Model Architecture
|
116 |
```
|
117 |
SentenceTransformer(
|
118 |
-
(0): Transformer({'max_seq_length':
|
119 |
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
|
120 |
)
|
121 |
```
|
|
|
2 |
pipeline_tag: sentence-similarity
|
3 |
tags:
|
4 |
- sentence-transformers
|
|
|
5 |
- sentence-similarity
|
6 |
- transformers
|
7 |
+
- TAACO
|
8 |
+
language: ko
|
9 |
---
|
10 |
|
11 |
+
# TAACO_Similarity
|
12 |
|
13 |
+
λ³Έ λͺ¨λΈμ [Sentence-transformers](https://www.SBERT.net)λ₯Ό κΈ°λ°μΌλ‘ νλ©° KLUEμ STS(Sentence Textual Similarity) λ°μ΄ν°μ
μ ν΅ν΄ νλ ¨μ μ§νν λͺ¨λΈμ
λλ€.
|
14 |
+
νμκ° μ μνκ³ μλ νκ΅μ΄ λ¬Έμ₯κ° κ²°μμ± μΈ‘μ λκ΅¬μΈ K-TAACO(κ°μ )μ μ§ν μ€ νλμΈ λ¬Έμ₯ κ° μλ―Έμ κ²°μμ±μ μΈ‘μ νκΈ° μν΄ λ³Έ λͺ¨λΈμ μ μνμμ΅λλ€.
|
15 |
+
λν λͺ¨λμ λ§λμΉμ λ¬Έμ₯κ° μ μ¬λ λ°μ΄ν° λ± λ€μν λ°μ΄ν°λ₯Ό κ΅¬ν΄ μΆκ° νλ ¨μ μ§νν μμ μ
λλ€.
|
16 |
|
17 |
+
## Train Data
|
18 |
+
KLUE-sts-v1.1._train.json
|
19 |
+
NLI-sts-train.tsv
|
20 |
|
21 |
## Usage (Sentence-Transformers)
|
22 |
|
23 |
+
λ³Έ λͺ¨λΈμ μ¬μ©νκΈ° μν΄μλ [Sentence-transformers](https://www.SBERT.net)λ₯Ό μ€μΉνμ¬μΌ ν©λλ€.
|
24 |
|
25 |
```
|
26 |
pip install -U sentence-transformers
|
27 |
```
|
28 |
|
29 |
+
λͺ¨λΈμ μ¬μ©νκΈ° μν΄μλ μλ μ½λλ₯Ό μ°Έμ‘°νμκΈΈ λ°λλλ€.
|
30 |
|
31 |
```python
|
32 |
+
from sentence_transformers import SentenceTransformer, models
|
33 |
sentences = ["This is an example sentence", "Each sentence is converted"]
|
34 |
|
35 |
+
embedding_model = models.Transformer(
|
36 |
+
model_name_or_path="KDHyun08/TAACO_STS",
|
37 |
+
max_seq_length=256,
|
38 |
+
do_lower_case=True
|
39 |
+
)
|
40 |
+
|
41 |
+
pooling_model = models.Pooling(
|
42 |
+
embedding_model.get_word_embedding_dimension(),
|
43 |
+
pooling_mode_mean_tokens=True,
|
44 |
+
pooling_mode_cls_token=False,
|
45 |
+
pooling_mode_max_tokens=False,
|
46 |
+
)
|
47 |
+
model = SentenceTransformer(modules=[embedding_model, pooling_model])
|
48 |
+
|
49 |
embeddings = model.encode(sentences)
|
50 |
print(embeddings)
|
51 |
```
|
52 |
|
53 |
|
54 |
+
## Usage (μ€μ λ¬Έμ₯ κ° μ μ¬λ λΉκ΅)
|
55 |
+
[Sentence-transformers](https://www.SBERT.net) λ₯Ό μ€μΉν ν μλ λ΄μ©κ³Ό κ°μ΄ λ¬Έμ₯ κ° μ μ¬λλ₯Ό λΉκ΅ν μ μμ΅λλ€.
|
56 |
+
query λ³μλ λΉκ΅ κΈ°μ€μ΄ λλ λ¬Έμ₯(Source Sentence)μ΄κ³ λΉκ΅λ₯Ό μ§νν λ¬Έμ₯μ docsμ list νμμΌλ‘ ꡬμ±νμλ©΄ λ©λλ€.
|
57 |
|
58 |
```python
|
59 |
+
from sentence_transformers import SentenceTransformer, models
|
|
|
60 |
|
61 |
+
embedding_model = models.Transformer(
|
62 |
+
model_name_or_path="KDHyun08/TAACO_STS",
|
63 |
+
max_seq_length=256,
|
64 |
+
do_lower_case=True
|
65 |
+
)
|
66 |
|
67 |
+
pooling_model = models.Pooling(
|
68 |
+
embedding_model.get_word_embedding_dimension(),
|
69 |
+
pooling_mode_mean_tokens=True,
|
70 |
+
pooling_mode_cls_token=False,
|
71 |
+
pooling_mode_max_tokens=False,
|
72 |
+
)
|
73 |
+
model = SentenceTransformer(modules=[embedding_model, pooling_model])
|
74 |
|
75 |
+
docs = ['μ΄μ λ μλ΄μ μμΌμ΄μλ€', 'μμΌμ λ§μ΄νμ¬ μμΉ¨μ μ€λΉνκ² λ€κ³ μ€μ 8μ 30λΆλΆν° μμμ μ€λΉνμλ€. μ£Όλ λ©λ΄λ μ€ν
μ΄ν¬μ λμ§λ³Άμ, λ―Έμκ΅, μ‘μ±, μμΌ λ±μ΄μλ€', 'μ€ν
μ΄ν¬λ μμ£Ό νλ μμμ΄μ΄μ μμ μ΄ μ€λΉνλ €κ³ νλ€', 'μλ€λ 1λΆμ© 3λ² λ€μ§κ³ λμ€ν
μ μ νλ©΄ μ‘μ¦μ΄ κ°λν μ€ν
μ΄ν¬κ° μ€λΉλλ€', 'μλ΄λ κ·Έλ° μ€ν
μ΄ν¬λ₯Ό μ’μνλ€. κ·Έλ°λ° μμλ λͺ»ν μΌμ΄ λ²μ΄μ§κ³ λ§μλ€', 'λ³΄ν΅ μμ¦λμ΄ λμ§ μμ μμ‘μ μ¬μ μ€ν
μ΄ν¬λ₯Ό νλλ°, μ΄λ²μλ μμ¦λμ΄ λ λΆμ±μ΄μ ꡬμ
ν΄μ νλ€', 'κ·Έλ°λ° μΌμ΄μ€ μμ λ°©λΆμ κ° λ€μ΄μλ κ²μ μΈμ§νμ§ λͺ»νκ³ λ°©λΆμ μ λμμ νλΌμ΄ν¬μ μ¬λ €λμ κ²μ΄λ€', 'κ·Έκ²λ μΈμ§ λͺ»ν 체... μλ©΄μ μΌ λΆμ 1λΆμ κ΅½κ³ λ€μ§λ μκ° λ°©λΆμ κ° ν¨κ» ꡬμ΄μ§ κ²μ μμλ€', 'μλ΄μ μμΌμ΄λΌ λ§μκ² κ΅¬μλ³΄κ³ μΆμλλ° μ΄μ²κ΅¬λμλ μν©μ΄ λ°μν κ²μ΄λ€', 'λ°©λΆμ κ° μΌ λΆμ λ
Ήμμ κ·Έλ°μ§ λ¬Όμ²λΌ νλ¬λ΄λ Έλ€', ' κ³ λ―Όμ νλ€. λ°©λΆμ κ° λ¬»μ λΆλ¬Έλ§ μ κ±°νκ³ λ€μ ꡬμΈκΉ νλλ° λ°©λΆμ μ μ λ λ¨Ήμ§ λ§λΌλ λ¬Έκ΅¬κ° οΏ½οΏ½μ΄μ μκΉμ§λ§ λ²λ¦¬λ λ°©ν₯μ νλ€', 'λ무λ μνκΉμ λ€', 'μμΉ¨ μΌμ° μλ΄κ° μ’μνλ μ€ν
μ΄ν¬λ₯Ό μ€λΉνκ³ κ·Έκ²μ λ§μκ² λ¨Ήλ μλ΄μ λͺ¨μ΅μ λ³΄κ³ μΆμλλ° μ ν μκ°μ§λ λͺ»ν μν©μ΄ λ°μν΄μ... νμ§λ§ μ μ μ μΆμ€λ₯΄κ³ λ°λ‘ λ€λ₯Έ λ©λ΄λ‘ λ³κ²½νλ€', 'μμΌ, μμμ§ μΌμ±λ³Άμ..', 'μλ΄κ° μ’μνλμ§ λͺ¨λ₯΄κ² μ§λ§ λμ₯κ³ μμ μλ νλν¬μμΈμ§λ₯Ό 보λ λ°λ‘ μμΌλ₯Ό ν΄μΌκ² λ€λ μκ°μ΄ λ€μλ€. μμμ μ±κ³΅μ μΌλ‘ μμ±μ΄ λμλ€', '40λ²μ§Έλ₯Ό λ§μ΄νλ μλ΄μ μμΌμ μ±κ³΅μ μΌλ‘ μ€λΉκ° λμλ€', 'λ§μκ² λ¨Ήμ΄ μ€ μλ΄μκ²λ κ°μ¬νλ€', '맀λ
μλ΄μ μμΌμ λ§μ΄νλ©΄ μμΉ¨λ§λ€ μμΌμ μ°¨λ €μΌκ² λ€. μ€λλ μ¦κ±°μ΄ νλ£¨κ° λμμΌλ©΄ μ’κ² λ€', 'μμΌμ΄λκΉ~']
|
76 |
+
#κ° λ¬Έμ₯μ vectorκ° encoding
|
77 |
+
document_embeddings = model.encode(docs)
|
78 |
|
79 |
+
query = 'μμΌμ λ§μ΄νμ¬ μμΉ¨μ μ€λΉνκ² λ€κ³ μ€μ 8μ 30λΆλΆν° μμμ μ€λΉνμλ€'
|
80 |
+
query_embedding = model.encode(query)
|
81 |
|
82 |
+
top_k = min(10, len(docs))
|
|
|
|
|
83 |
|
84 |
+
# μ½μ¬μΈ μ μ¬λ κ³μ° ν,
|
85 |
+
cos_scores = util.pytorch_cos_sim(query_embedding, document_embeddings)[0]
|
86 |
|
87 |
+
# μ½μ¬μΈ μ μ¬λ μμΌλ‘ λ¬Έμ₯ μΆμΆ
|
88 |
+
top_results = torch.topk(cos_scores, k=top_k)
|
|
|
89 |
|
90 |
+
print(f"μ
λ ₯ λ¬Έμ₯: {query}")
|
91 |
+
print(f"\n<μ
λ ₯ λ¬Έμ₯κ³Ό μ μ¬ν {top_k} κ°μ λ¬Έμ₯>\n")
|
92 |
|
93 |
+
for i, (score, idx) in enumerate(zip(top_results[0], top_results[1])):
|
94 |
+
print(f"{i+1}: {docs[idx]} {'(μ μ¬λ: {:.4f})'.format(score)}\n")
|
95 |
```
|
96 |
|
97 |
|
98 |
|
99 |
## Evaluation Results
|
100 |
|
101 |
+
μ Usageλ₯Ό μ€ννκ² λλ©΄ μλμ κ°μ κ²°κ³Όκ° λμΆλ©λλ€. 1μ κ°κΉμΈμλ‘ μ μ¬ν λ¬Έμ₯μ
λλ€.
|
102 |
+
|
103 |
+
```
|
104 |
+
μ
λ ₯ λ¬Έμ₯: μμΌμ λ§μ΄νμ¬ μμΉ¨μ μ€λΉνκ² λ€κ³ μ€μ 8μ 30λΆλΆν° μμμ μ€λΉνμλ€
|
105 |
+
|
106 |
+
<μ
λ ₯ λ¬Έμ₯κ³Ό μ μ¬ν 10 κ°μ λ¬Έμ₯>
|
107 |
|
108 |
+
1: μμΌμ λ§μ΄νμ¬ μμΉ¨μ μ€λΉνκ² λ€κ³ μ€μ 8μ 30λΆλΆν° μμμ μ€λΉνμλ€. μ£Όλ λ©λ΄λ μ€ν
μ΄ν¬μ λμ§λ³Άμ, λ―Έμκ΅, μ‘μ±, μμΌ λ±μ΄μλ€ (μ μ¬λ: 0.6687)
|
109 |
|
110 |
+
2: 맀λ
μλ΄μ μμΌμ λ§μ΄νλ©΄ μμΉ¨λ§λ€ μμΌμ μ°¨λ €μΌκ² λ€. μ€λλ μ¦κ±°μ΄ νλ£¨κ° λμμΌλ©΄ μ’κ² λ€ (μ μ¬λ: 0.6468)
|
111 |
+
|
112 |
+
3: 40λ²μ§Έλ₯Ό λ§μ΄νλ μλ΄μ μμΌμ μ±κ³΅μ μΌλ‘ μ€λΉκ° λμλ€ (μ μ¬λ: 0.4647)
|
113 |
+
|
114 |
+
4: μλ΄μ μμΌμ΄λΌ λ§μκ² κ΅¬μλ³΄κ³ μΆμλλ° μ΄μ²κ΅¬λμλ μν©μ΄ λ°μν κ²μ΄λ€ (μ μ¬λ: 0.4469)
|
115 |
+
|
116 |
+
5: μμΌμ΄λκΉ~ (μ μ¬λ: 0.4218)
|
117 |
+
|
118 |
+
6: μ΄μ λ μλ΄μ μμΌμ΄μλ€ (μ μ¬λ: 0.4192)
|
119 |
+
|
120 |
+
7: μμΉ¨ μΌμ° μλ΄κ° μ’μνλ μ€ν
μ΄ν¬λ₯Ό μ€λΉνκ³ κ·Έκ²μ λ§μκ² λ¨Ήλ μλ΄μ λͺ¨μ΅μ λ³΄κ³ μΆμλλ° μ ν μκ°μ§λ λͺ»ν μν©μ΄ λ°μν΄μ... νμ§λ§ μ μ μ μΆμ€λ₯΄κ³ λ°λ‘ λ€λ₯Έ λ©λ΄λ‘ λ³κ²½νλ€ (μ μ¬λ: 0.4156)
|
121 |
+
|
122 |
+
8: λ§μκ² λ¨Ήμ΄ μ€ μλ΄μκ²λ κ°μ¬νλ€ (μ μ¬λ: 0.3093)
|
123 |
+
|
124 |
+
9: μλ΄κ° μ’μνλμ§ λͺ¨λ₯΄κ² μ§λ§ λμ₯κ³ μμ μλ νλν¬μμΈμ§λ₯Ό 보λ λ°λ‘ μμΌλ₯Ό ν΄μΌκ² λ€λ μκ°μ΄ λ€μλ€. μμμ μ±κ³΅μ μΌλ‘ μμ±μ΄ λμλ€ (μ μ¬λ: 0.2259)
|
125 |
+
|
126 |
+
10: μλ΄λ κ·Έλ° μ€ν
μ΄ν¬λ₯Ό μ’μνλ€. κ·Έλ°λ° μμλ λͺ»ν μΌμ΄ λ²μ΄μ§κ³ λ§μλ€ (μ μ¬λ: 0.1967)
|
127 |
+
```
|
128 |
|
|
|
|
|
129 |
|
130 |
**DataLoader**:
|
131 |
|
|
|
142 |
```
|
143 |
{
|
144 |
"epochs": 4,
|
145 |
+
"evaluation_steps": 1000,
|
146 |
"evaluator": "sentence_transformers.evaluation.EmbeddingSimilarityEvaluator.EmbeddingSimilarityEvaluator",
|
147 |
"max_grad_norm": 1,
|
148 |
"optimizer_class": "<class 'transformers.optimization.AdamW'>",
|
|
|
160 |
## Full Model Architecture
|
161 |
```
|
162 |
SentenceTransformer(
|
163 |
+
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
|
164 |
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
|
165 |
)
|
166 |
```
|