File size: 5,640 Bytes
e91a73e
 
 
 
a7949b9
e91a73e
 
24fab3b
e91a73e
 
24fab3b
e91a73e
a7949b9
e91a73e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
24fab3b
 
 
e91a73e
 
f50ab7d
e91a73e
f50ab7d
 
 
e91a73e
f50ab7d
 
e91a73e
f50ab7d
e91a73e
f50ab7d
 
e91a73e
f50ab7d
 
e91a73e
f50ab7d
 
e91a73e
f50ab7d
 
e91a73e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a7949b9
e91a73e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
---
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- feature-extraction
- sentence-similarity
- transformers
lan: ko
---

# TAACO_Similarity

This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.

<!--- Describe your model here -->

## Usage (Sentence-Transformers)

Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:

```
pip install -U sentence-transformers
```

Then you can use the model like this:

```python
from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('{MODEL_NAME}')
embeddings = model.encode(sentences)
print(embeddings)
```



## Usage (μ‹€μ œ λ¬Έμž₯ κ°„ μœ μ‚¬λ„ 비ꡐ)
Sentence-transformers [sentence-transformers](https://www.SBERT.net) λ₯Ό μ„€μΉ˜ν•œ ν›„ μ•„λž˜ λ‚΄μš©κ³Ό 같이 λ¬Έμž₯ κ°„ μœ μ‚¬λ„λ₯Ό 비ꡐ할 수 μžˆμŠ΅λ‹ˆλ‹€.
query λ³€μˆ˜λŠ” 비ꡐ 기쀀이 λ˜λŠ” λ¬Έμž₯(Source Sentence)이고 비ꡐλ₯Ό 진행할 λ¬Έμž₯은 docs에 list ν˜•μ‹μœΌλ‘œ κ΅¬μ„±ν•˜μ‹œλ©΄ λ©λ‹ˆλ‹€.

```python
model = SentenceTransformer("KDHyun08/TAACO_STS")

docs = ['μ–΄μ œλŠ” μ•„λ‚΄μ˜ μƒμΌμ΄μ—ˆλ‹€', '생일을 λ§žμ΄ν•˜μ—¬ 아침을 μ€€λΉ„ν•˜κ² λ‹€κ³  μ˜€μ „ 8μ‹œ 30λΆ„λΆ€ν„° μŒμ‹μ„ μ€€λΉ„ν•˜μ˜€λ‹€. 주된 λ©”λ‰΄λŠ” μŠ€ν…Œμ΄ν¬μ™€ λ‚™μ§€λ³ΆμŒ, λ―Έμ—­κ΅­, μž‘μ±„, μ†Œμ•Ό λ“±μ΄μ—ˆλ‹€', 'μŠ€ν…Œμ΄ν¬λŠ” 자주 ν•˜λŠ” μŒμ‹μ΄μ–΄μ„œ μžμ‹ μ΄ μ€€λΉ„ν•˜λ €κ³  ν–ˆλ‹€', 'μ•žλ’€λ„ 1λΆ„μ”© 3번 뒀집고 λž˜μŠ€νŒ…μ„ 잘 ν•˜λ©΄ μœ‘μ¦™μ΄ κ°€λ“ν•œ μŠ€ν…Œμ΄ν¬κ°€ μ€€λΉ„λ˜λ‹€', '아내도 그런 μŠ€ν…Œμ΄ν¬λ₯Ό μ’‹μ•„ν•œλ‹€. 그런데 상상도 λͺ»ν•œ 일이 λ²Œμ΄μ§€κ³  λ§μ•˜λ‹€', '보톡 μ‹œμ¦ˆλ‹μ΄ λ˜μ§€ μ•Šμ€ μ›μœ‘μ„ μ‚¬μ„œ μŠ€ν…Œμ΄ν¬λ₯Ό ν–ˆλŠ”λ°, μ΄λ²ˆμ—λŠ” μ‹œμ¦ˆλ‹μ΄ 된 뢀챗살을 κ΅¬μž…ν•΄μ„œ ν–ˆλ‹€', '그런데 μΌ€μ΄μŠ€ μ•ˆμ— λ°©λΆ€μ œκ°€ λ“€μ–΄μžˆλŠ” 것을 μΈμ§€ν•˜μ§€ λͺ»ν•˜κ³  λ°©λΆ€μ œμ™€ λ™μ‹œμ— ν”„λΌμ΄νŒ¬μ— μ˜¬λ €λ†“μ„ 것이닀', '그것도 인지 λͺ»ν•œ 체... μ•žλ©΄μ„ μ„Ό λΆˆμ— 1뢄을 κ΅½κ³  λ’€μ§‘λŠ” μˆœκ°„ λ°©λΆ€μ œκ°€ ν•¨κ»˜ ꡬ어진 것을 μ•Œμ•˜λ‹€', 'μ•„λ‚΄μ˜ 생일이라 λ§›μžˆκ²Œ κ΅¬μ›Œλ³΄κ³  μ‹Άμ—ˆλŠ”λ° μ–΄μ²˜κ΅¬λ‹ˆμ—†λŠ” 상황이 λ°œμƒν•œ 것이닀', 'λ°©λΆ€μ œκ°€ μ„Ό λΆˆμ— λ…Ήμ•„μ„œ κ·ΈλŸ°μ§€ 물처럼 ν˜λŸ¬λ‚΄λ Έλ‹€', ' 고민을 ν–ˆλ‹€. λ°©λΆ€μ œκ°€ 묻은 λΆ€λ¬Έλ§Œ μ œκ±°ν•˜κ³  λ‹€μ‹œ ꡬ울까 ν–ˆλŠ”λ° λ°©λΆ€μ œμ— μ ˆλŒ€ 먹지 λ§λΌλŠ” 문ꡬ가 μžˆμ–΄μ„œ μ•„κΉμ§€λ§Œ λ²„λ¦¬λŠ” λ°©ν–₯을 ν–ˆλ‹€', 'λ„ˆλ¬΄λ‚˜ μ•ˆνƒ€κΉŒμ› λ‹€', 'μ•„μΉ¨ 일찍 μ•„λ‚΄κ°€ μ’‹μ•„ν•˜λŠ” μŠ€ν…Œμ΄ν¬λ₯Ό μ€€λΉ„ν•˜κ³  그것을 λ§›μžˆκ²Œ λ¨ΉλŠ” μ•„λ‚΄μ˜ λͺ¨μŠ΅μ„ 보고 μ‹Άμ—ˆλŠ”λ° μ „ν˜€ 생각지도 λͺ»ν•œ 상황이 λ°œμƒν•΄μ„œ... ν•˜μ§€λ§Œ 정신을 μΆ”μŠ€λ₯΄κ³  λ°”λ‘œ λ‹€λ₯Έ λ©”λ‰΄λ‘œ λ³€κ²½ν–ˆλ‹€', 'μ†Œμ•Ό, μ†Œμ‹œμ§€ μ•Όμ±„λ³ΆμŒ..', 'μ•„λ‚΄κ°€ μ’‹μ•„ν•˜λŠ”μ§€ λͺ¨λ₯΄κ² μ§€λ§Œ 냉μž₯κ³  μ•ˆμ— μžˆλŠ” ν›„λž‘ν¬μ†Œμ„Έμ§€λ₯Ό λ³΄λ‹ˆ λ°”λ‘œ μ†Œμ•Όλ₯Ό ν•΄μ•Όκ² λ‹€λŠ” 생각이 λ“€μ—ˆλ‹€. μŒμ‹μ€ μ„±κ³΅μ μœΌλ‘œ 완성이 λ˜μ—ˆλ‹€', '40번째λ₯Ό λ§žμ΄ν•˜λŠ” μ•„λ‚΄μ˜ 생일은 μ„±κ³΅μ μœΌλ‘œ μ€€λΉ„κ°€ λ˜μ—ˆλ‹€', 'λ§›μžˆκ²Œ λ¨Ήμ–΄ μ€€ μ•„λ‚΄μ—κ²Œλ„ κ°μ‚¬ν–ˆλ‹€', '맀년 μ•„λ‚΄μ˜ 생일에 λ§žμ΄ν•˜λ©΄ μ•„μΉ¨λ§ˆλ‹€ 생일을 μ°¨λ €μ•Όκ² λ‹€. μ˜€λŠ˜λ„ 즐거운 ν•˜λ£¨κ°€ λ˜μ—ˆμœΌλ©΄ μ’‹κ² λ‹€', 'μƒμΌμ΄λ‹ˆκΉŒ~']
#각 λ¬Έμž₯의 vectorκ°’ encoding
document_embeddings = model.encode(docs)

query = '생일을 λ§žμ΄ν•˜μ—¬ 아침을 μ€€λΉ„ν•˜κ² λ‹€κ³  μ˜€μ „ 8μ‹œ 30λΆ„λΆ€ν„° μŒμ‹μ„ μ€€λΉ„ν•˜μ˜€λ‹€'
query_embedding = model.encode(query)

top_k = min(10, len(docs))

# 코사인 μœ μ‚¬λ„ 계산 ν›„,
cos_scores = util.pytorch_cos_sim(query_embedding, document_embeddings)[0]

# 코사인 μœ μ‚¬λ„ 순으둜 λ¬Έμž₯ μΆ”μΆœ
top_results = torch.topk(cos_scores, k=top_k)

print(f"μž…λ ₯ λ¬Έμž₯: {query}")
print(f"\n<μž…λ ₯ λ¬Έμž₯κ³Ό μœ μ‚¬ν•œ {top_k} 개의 λ¬Έμž₯>\n")

for i, (score, idx) in enumerate(zip(top_results[0], top_results[1])):
    print(f"{i+1}: {docs[idx]} {'(μœ μ‚¬λ„: {:.4f})'.format(score)}\n")
```



## Evaluation Results

<!--- Describe how your model was evaluated -->

For an automated evaluation of this model, see the *Sentence Embeddings Benchmark*: [https://seb.sbert.net](https://seb.sbert.net?model_name={MODEL_NAME})


## Training
The model was trained with the parameters:

**DataLoader**:

`torch.utils.data.dataloader.DataLoader` of length 142 with parameters:
```
{'batch_size': 32, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}
```

**Loss**:

`sentence_transformers.losses.CosineSimilarityLoss.CosineSimilarityLoss` 

Parameters of the fit()-Method:
```
{
    "epochs": 4,
    "evaluation_steps": 1000,
    "evaluator": "sentence_transformers.evaluation.EmbeddingSimilarityEvaluator.EmbeddingSimilarityEvaluator",
    "max_grad_norm": 1,
    "optimizer_class": "<class 'transformers.optimization.AdamW'>",
    "optimizer_params": {
        "lr": 2e-05
    },
    "scheduler": "WarmupLinear",
    "steps_per_epoch": null,
    "warmup_steps": 10000,
    "weight_decay": 0.01
}
```


## Full Model Architecture
```
SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)
```

## Citing & Authors

<!--- Describe where people can find more information -->