KDHyun08 commited on
Commit
53f72d8
β€’
1 Parent(s): 752d6bc

Upload with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +83 -38
README.md CHANGED
@@ -2,85 +2,130 @@
2
  pipeline_tag: sentence-similarity
3
  tags:
4
  - sentence-transformers
5
- - feature-extraction
6
  - sentence-similarity
7
  - transformers
 
 
8
  ---
9
 
10
- # {MODEL_NAME}
11
 
12
- This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.
 
 
13
 
14
- <!--- Describe your model here -->
 
 
15
 
16
  ## Usage (Sentence-Transformers)
17
 
18
- Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
19
 
20
  ```
21
  pip install -U sentence-transformers
22
  ```
23
 
24
- Then you can use the model like this:
25
 
26
  ```python
27
- from sentence_transformers import SentenceTransformer
28
  sentences = ["This is an example sentence", "Each sentence is converted"]
29
 
30
- model = SentenceTransformer('{MODEL_NAME}')
 
 
 
 
 
 
 
 
 
 
 
 
 
31
  embeddings = model.encode(sentences)
32
  print(embeddings)
33
  ```
34
 
35
 
36
-
37
- ## Usage (HuggingFace Transformers)
38
- Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.
39
 
40
  ```python
41
- from transformers import AutoTokenizer, AutoModel
42
- import torch
43
 
 
 
 
 
 
44
 
45
- #Mean Pooling - Take attention mask into account for correct averaging
46
- def mean_pooling(model_output, attention_mask):
47
- token_embeddings = model_output[0] #First element of model_output contains all token embeddings
48
- input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
49
- return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
 
 
50
 
 
 
 
51
 
52
- # Sentences we want sentence embeddings for
53
- sentences = ['This is an example sentence', 'Each sentence is converted']
54
 
55
- # Load model from HuggingFace Hub
56
- tokenizer = AutoTokenizer.from_pretrained('{MODEL_NAME}')
57
- model = AutoModel.from_pretrained('{MODEL_NAME}')
58
 
59
- # Tokenize sentences
60
- encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
61
 
62
- # Compute token embeddings
63
- with torch.no_grad():
64
- model_output = model(**encoded_input)
65
 
66
- # Perform pooling. In this case, mean pooling.
67
- sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
68
 
69
- print("Sentence embeddings:")
70
- print(sentence_embeddings)
71
  ```
72
 
73
 
74
 
75
  ## Evaluation Results
76
 
77
- <!--- Describe how your model was evaluated -->
 
 
 
 
 
78
 
79
- For an automated evaluation of this model, see the *Sentence Embeddings Benchmark*: [https://seb.sbert.net](https://seb.sbert.net?model_name={MODEL_NAME})
80
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
81
 
82
- ## Training
83
- The model was trained with the parameters:
84
 
85
  **DataLoader**:
86
 
@@ -97,7 +142,7 @@ Parameters of the fit()-Method:
97
  ```
98
  {
99
  "epochs": 4,
100
- "evaluation_steps": 4538,
101
  "evaluator": "sentence_transformers.evaluation.EmbeddingSimilarityEvaluator.EmbeddingSimilarityEvaluator",
102
  "max_grad_norm": 1,
103
  "optimizer_class": "<class 'transformers.optimization.AdamW'>",
@@ -115,7 +160,7 @@ Parameters of the fit()-Method:
115
  ## Full Model Architecture
116
  ```
117
  SentenceTransformer(
118
- (0): Transformer({'max_seq_length': 256, 'do_lower_case': True}) with Transformer model: BertModel
119
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
120
  )
121
  ```
 
2
  pipeline_tag: sentence-similarity
3
  tags:
4
  - sentence-transformers
 
5
  - sentence-similarity
6
  - transformers
7
+ - TAACO
8
+ language: ko
9
  ---
10
 
11
+ # TAACO_Similarity
12
 
13
+ λ³Έ λͺ¨λΈμ€ [Sentence-transformers](https://www.SBERT.net)λ₯Ό 기반으둜 ν•˜λ©° KLUE의 STS(Sentence Textual Similarity) 데이터셋을 톡해 ν›ˆλ ¨μ„ μ§„ν–‰ν•œ λͺ¨λΈμž…λ‹ˆλ‹€.
14
+ ν•„μžκ°€ μ œμž‘ν•˜κ³  μžˆλŠ” ν•œκ΅­μ–΄ λ¬Έμž₯κ°„ 결속성 μΈ‘μ • 도ꡬ인 K-TAACO(κ°€μ œ)의 μ§€ν‘œ 쀑 ν•˜λ‚˜μΈ λ¬Έμž₯ κ°„ 의미적 결속성을 μΈ‘μ •ν•˜κΈ° μœ„ν•΄ λ³Έ λͺ¨λΈμ„ μ œμž‘ν•˜μ˜€μŠ΅λ‹ˆλ‹€.
15
+ λ˜ν•œ λͺ¨λ‘μ˜ λ§λ­‰μΉ˜μ˜ λ¬Έμž₯κ°„ μœ μ‚¬λ„ 데이터 λ“± λ‹€μ–‘ν•œ 데이터λ₯Ό ꡬ해 μΆ”κ°€ ν›ˆλ ¨μ„ 진행할 μ˜ˆμ •μž…λ‹ˆλ‹€.
16
 
17
+ ## Train Data
18
+ KLUE-sts-v1.1._train.json
19
+ NLI-sts-train.tsv
20
 
21
  ## Usage (Sentence-Transformers)
22
 
23
+ λ³Έ λͺ¨λΈμ„ μ‚¬μš©ν•˜κΈ° μœ„ν•΄μ„œλŠ” [Sentence-transformers](https://www.SBERT.net)λ₯Ό μ„€μΉ˜ν•˜μ—¬μ•Ό ν•©λ‹ˆλ‹€.
24
 
25
  ```
26
  pip install -U sentence-transformers
27
  ```
28
 
29
+ λͺ¨λΈμ„ μ‚¬μš©ν•˜κΈ° μœ„ν•΄μ„œλŠ” μ•„λž˜ μ½”λ“œλ₯Ό μ°Έμ‘°ν•˜μ‹œκΈΈ λ°”λžλ‹ˆλ‹€.
30
 
31
  ```python
32
+ from sentence_transformers import SentenceTransformer, models
33
  sentences = ["This is an example sentence", "Each sentence is converted"]
34
 
35
+ embedding_model = models.Transformer(
36
+ model_name_or_path="KDHyun08/TAACO_STS",
37
+ max_seq_length=256,
38
+ do_lower_case=True
39
+ )
40
+
41
+ pooling_model = models.Pooling(
42
+ embedding_model.get_word_embedding_dimension(),
43
+ pooling_mode_mean_tokens=True,
44
+ pooling_mode_cls_token=False,
45
+ pooling_mode_max_tokens=False,
46
+ )
47
+ model = SentenceTransformer(modules=[embedding_model, pooling_model])
48
+
49
  embeddings = model.encode(sentences)
50
  print(embeddings)
51
  ```
52
 
53
 
54
+ ## Usage (μ‹€μ œ λ¬Έμž₯ κ°„ μœ μ‚¬λ„ 비ꡐ)
55
+ [Sentence-transformers](https://www.SBERT.net) λ₯Ό μ„€μΉ˜ν•œ ν›„ μ•„λž˜ λ‚΄μš©κ³Ό 같이 λ¬Έμž₯ κ°„ μœ μ‚¬λ„λ₯Ό 비ꡐ할 수 μžˆμŠ΅λ‹ˆλ‹€.
56
+ query λ³€μˆ˜λŠ” 비ꡐ 기쀀이 λ˜λŠ” λ¬Έμž₯(Source Sentence)이고 비ꡐλ₯Ό 진행할 λ¬Έμž₯은 docs에 list ν˜•μ‹μœΌλ‘œ κ΅¬μ„±ν•˜μ‹œλ©΄ λ©λ‹ˆλ‹€.
57
 
58
  ```python
59
+ from sentence_transformers import SentenceTransformer, models
 
60
 
61
+ embedding_model = models.Transformer(
62
+ model_name_or_path="KDHyun08/TAACO_STS",
63
+ max_seq_length=256,
64
+ do_lower_case=True
65
+ )
66
 
67
+ pooling_model = models.Pooling(
68
+ embedding_model.get_word_embedding_dimension(),
69
+ pooling_mode_mean_tokens=True,
70
+ pooling_mode_cls_token=False,
71
+ pooling_mode_max_tokens=False,
72
+ )
73
+ model = SentenceTransformer(modules=[embedding_model, pooling_model])
74
 
75
+ docs = ['μ–΄μ œλŠ” μ•„λ‚΄μ˜ μƒμΌμ΄μ—ˆλ‹€', '생일을 λ§žμ΄ν•˜μ—¬ 아침을 μ€€λΉ„ν•˜κ² λ‹€κ³  μ˜€μ „ 8μ‹œ 30λΆ„λΆ€ν„° μŒμ‹μ„ μ€€λΉ„ν•˜μ˜€λ‹€. 주된 λ©”λ‰΄λŠ” μŠ€ν…Œμ΄ν¬μ™€ λ‚™μ§€λ³ΆμŒ, λ―Έμ—­κ΅­, μž‘μ±„, μ†Œμ•Ό λ“±μ΄μ—ˆλ‹€', 'μŠ€ν…Œμ΄ν¬λŠ” 자주 ν•˜λŠ” μŒμ‹μ΄μ–΄μ„œ μžμ‹ μ΄ μ€€λΉ„ν•˜λ €κ³  ν–ˆλ‹€', 'μ•žλ’€λ„ 1λΆ„μ”© 3번 뒀집고 λž˜μŠ€νŒ…μ„ 잘 ν•˜λ©΄ μœ‘μ¦™μ΄ κ°€λ“ν•œ μŠ€ν…Œμ΄ν¬κ°€ μ€€λΉ„λ˜λ‹€', '아내도 그런 μŠ€ν…Œμ΄ν¬λ₯Ό μ’‹μ•„ν•œλ‹€. 그런데 상상도 λͺ»ν•œ 일이 λ²Œμ΄μ§€κ³  λ§μ•˜λ‹€', '보톡 μ‹œμ¦ˆλ‹μ΄ λ˜μ§€ μ•Šμ€ μ›μœ‘μ„ μ‚¬μ„œ μŠ€ν…Œμ΄ν¬λ₯Ό ν–ˆλŠ”λ°, μ΄λ²ˆμ—λŠ” μ‹œμ¦ˆλ‹μ΄ 된 뢀챗살을 κ΅¬μž…ν•΄μ„œ ν–ˆλ‹€', '그런데 μΌ€μ΄μŠ€ μ•ˆμ— λ°©λΆ€μ œκ°€ λ“€μ–΄μžˆλŠ” 것을 μΈμ§€ν•˜μ§€ λͺ»ν•˜κ³  λ°©λΆ€μ œμ™€ λ™μ‹œμ— ν”„λΌμ΄νŒ¬μ— μ˜¬λ €λ†“μ„ 것이닀', '그것도 인지 λͺ»ν•œ 체... μ•žλ©΄μ„ μ„Ό λΆˆμ— 1뢄을 κ΅½κ³  λ’€μ§‘λŠ” μˆœκ°„ λ°©λΆ€μ œκ°€ ν•¨κ»˜ ꡬ어진 것을 μ•Œμ•˜λ‹€', 'μ•„λ‚΄μ˜ 생일이라 λ§›μžˆκ²Œ κ΅¬μ›Œλ³΄κ³  μ‹Άμ—ˆλŠ”λ° μ–΄μ²˜κ΅¬λ‹ˆμ—†λŠ” 상황이 λ°œμƒν•œ 것이닀', 'λ°©λΆ€μ œκ°€ μ„Ό λΆˆμ— λ…Ήμ•„μ„œ κ·ΈλŸ°μ§€ 물처럼 ν˜λŸ¬λ‚΄λ Έλ‹€', ' 고민을 ν–ˆλ‹€. λ°©λΆ€μ œκ°€ 묻은 λΆ€λ¬Έλ§Œ μ œκ±°ν•˜κ³  λ‹€μ‹œ ꡬ울까 ν–ˆλŠ”λ° λ°©λΆ€μ œμ— μ ˆλŒ€ 먹지 λ§λΌλŠ” 문ꡬ가 οΏ½οΏ½μ–΄μ„œ μ•„κΉμ§€λ§Œ λ²„λ¦¬λŠ” λ°©ν–₯을 ν–ˆλ‹€', 'λ„ˆλ¬΄λ‚˜ μ•ˆνƒ€κΉŒμ› λ‹€', 'μ•„μΉ¨ 일찍 μ•„λ‚΄κ°€ μ’‹μ•„ν•˜λŠ” μŠ€ν…Œμ΄ν¬λ₯Ό μ€€λΉ„ν•˜κ³  그것을 λ§›μžˆκ²Œ λ¨ΉλŠ” μ•„λ‚΄μ˜ λͺ¨μŠ΅μ„ 보고 μ‹Άμ—ˆλŠ”λ° μ „ν˜€ 생각지도 λͺ»ν•œ 상황이 λ°œμƒν•΄μ„œ... ν•˜μ§€λ§Œ 정신을 μΆ”μŠ€λ₯΄κ³  λ°”λ‘œ λ‹€λ₯Έ λ©”λ‰΄λ‘œ λ³€κ²½ν–ˆλ‹€', 'μ†Œμ•Ό, μ†Œμ‹œμ§€ μ•Όμ±„λ³ΆμŒ..', 'μ•„λ‚΄κ°€ μ’‹μ•„ν•˜λŠ”μ§€ λͺ¨λ₯΄κ² μ§€λ§Œ 냉μž₯κ³  μ•ˆμ— μžˆλŠ” ν›„λž‘ν¬μ†Œμ„Έμ§€λ₯Ό λ³΄λ‹ˆ λ°”λ‘œ μ†Œμ•Όλ₯Ό ν•΄μ•Όκ² λ‹€λŠ” 생각이 λ“€μ—ˆλ‹€. μŒμ‹μ€ μ„±κ³΅μ μœΌλ‘œ 완성이 λ˜μ—ˆλ‹€', '40번째λ₯Ό λ§žμ΄ν•˜λŠ” μ•„λ‚΄μ˜ 생일은 μ„±κ³΅μ μœΌλ‘œ μ€€λΉ„κ°€ λ˜μ—ˆλ‹€', 'λ§›μžˆκ²Œ λ¨Ήμ–΄ μ€€ μ•„λ‚΄μ—κ²Œλ„ κ°μ‚¬ν–ˆλ‹€', '맀년 μ•„λ‚΄μ˜ 생일에 λ§žμ΄ν•˜λ©΄ μ•„μΉ¨λ§ˆλ‹€ 생일을 μ°¨λ €μ•Όκ² λ‹€. μ˜€λŠ˜λ„ 즐거운 ν•˜λ£¨κ°€ λ˜μ—ˆμœΌλ©΄ μ’‹κ² λ‹€', 'μƒμΌμ΄λ‹ˆκΉŒ~']
76
+ #각 λ¬Έμž₯의 vectorκ°’ encoding
77
+ document_embeddings = model.encode(docs)
78
 
79
+ query = '생일을 λ§žμ΄ν•˜μ—¬ 아침을 μ€€λΉ„ν•˜κ² λ‹€κ³  μ˜€μ „ 8μ‹œ 30λΆ„λΆ€ν„° μŒμ‹μ„ μ€€λΉ„ν•˜μ˜€λ‹€'
80
+ query_embedding = model.encode(query)
81
 
82
+ top_k = min(10, len(docs))
 
 
83
 
84
+ # 코사인 μœ μ‚¬λ„ 계산 ν›„,
85
+ cos_scores = util.pytorch_cos_sim(query_embedding, document_embeddings)[0]
86
 
87
+ # 코사인 μœ μ‚¬λ„ 순으둜 λ¬Έμž₯ μΆ”μΆœ
88
+ top_results = torch.topk(cos_scores, k=top_k)
 
89
 
90
+ print(f"μž…λ ₯ λ¬Έμž₯: {query}")
91
+ print(f"\n<μž…λ ₯ λ¬Έμž₯κ³Ό μœ μ‚¬ν•œ {top_k} 개의 λ¬Έμž₯>\n")
92
 
93
+ for i, (score, idx) in enumerate(zip(top_results[0], top_results[1])):
94
+ print(f"{i+1}: {docs[idx]} {'(μœ μ‚¬λ„: {:.4f})'.format(score)}\n")
95
  ```
96
 
97
 
98
 
99
  ## Evaluation Results
100
 
101
+ μœ„ Usageλ₯Ό μ‹€ν–‰ν•˜κ²Œ 되면 μ•„λž˜μ™€ 같은 κ²°κ³Όκ°€ λ„μΆœλ©λ‹ˆλ‹€. 1에 κ°€κΉŒμšΈμˆ˜λ‘ μœ μ‚¬ν•œ λ¬Έμž₯μž…λ‹ˆλ‹€.
102
+
103
+ ```
104
+ μž…λ ₯ λ¬Έμž₯: 생일을 λ§žμ΄ν•˜μ—¬ 아침을 μ€€λΉ„ν•˜κ² λ‹€κ³  μ˜€μ „ 8μ‹œ 30λΆ„λΆ€ν„° μŒμ‹μ„ μ€€λΉ„ν•˜μ˜€λ‹€
105
+
106
+ <μž…λ ₯ λ¬Έμž₯κ³Ό μœ μ‚¬ν•œ 10 개의 λ¬Έμž₯>
107
 
108
+ 1: 생일을 λ§žμ΄ν•˜μ—¬ 아침을 μ€€λΉ„ν•˜κ² λ‹€κ³  μ˜€μ „ 8μ‹œ 30λΆ„λΆ€ν„° μŒμ‹μ„ μ€€λΉ„ν•˜μ˜€λ‹€. 주된 λ©”λ‰΄λŠ” μŠ€ν…Œμ΄ν¬μ™€ λ‚™μ§€λ³ΆμŒ, λ―Έμ—­κ΅­, μž‘μ±„, μ†Œμ•Ό λ“±μ΄μ—ˆλ‹€ (μœ μ‚¬λ„: 0.6687)
109
 
110
+ 2: 맀년 μ•„λ‚΄μ˜ 생일에 λ§žμ΄ν•˜λ©΄ μ•„μΉ¨λ§ˆλ‹€ 생일을 μ°¨λ €μ•Όκ² λ‹€. μ˜€λŠ˜λ„ 즐거운 ν•˜λ£¨κ°€ λ˜μ—ˆμœΌλ©΄ μ’‹κ² λ‹€ (μœ μ‚¬λ„: 0.6468)
111
+
112
+ 3: 40번째λ₯Ό λ§žμ΄ν•˜λŠ” μ•„λ‚΄μ˜ 생일은 μ„±κ³΅μ μœΌλ‘œ μ€€λΉ„κ°€ λ˜μ—ˆλ‹€ (μœ μ‚¬λ„: 0.4647)
113
+
114
+ 4: μ•„λ‚΄μ˜ 생일이라 λ§›μžˆκ²Œ κ΅¬μ›Œλ³΄κ³  μ‹Άμ—ˆλŠ”λ° μ–΄μ²˜κ΅¬λ‹ˆμ—†λŠ” 상황이 λ°œμƒν•œ 것이닀 (μœ μ‚¬λ„: 0.4469)
115
+
116
+ 5: μƒμΌμ΄λ‹ˆκΉŒ~ (μœ μ‚¬λ„: 0.4218)
117
+
118
+ 6: μ–΄μ œλŠ” μ•„λ‚΄μ˜ μƒμΌμ΄μ—ˆλ‹€ (μœ μ‚¬λ„: 0.4192)
119
+
120
+ 7: μ•„μΉ¨ 일찍 μ•„λ‚΄κ°€ μ’‹μ•„ν•˜λŠ” μŠ€ν…Œμ΄ν¬λ₯Ό μ€€λΉ„ν•˜κ³  그것을 λ§›μžˆκ²Œ λ¨ΉλŠ” μ•„λ‚΄μ˜ λͺ¨μŠ΅μ„ 보고 μ‹Άμ—ˆλŠ”λ° μ „ν˜€ 생각지도 λͺ»ν•œ 상황이 λ°œμƒν•΄μ„œ... ν•˜μ§€λ§Œ 정신을 μΆ”μŠ€λ₯΄κ³  λ°”λ‘œ λ‹€λ₯Έ λ©”λ‰΄λ‘œ λ³€κ²½ν–ˆλ‹€ (μœ μ‚¬λ„: 0.4156)
121
+
122
+ 8: λ§›μžˆκ²Œ λ¨Ήμ–΄ μ€€ μ•„λ‚΄μ—κ²Œλ„ κ°μ‚¬ν–ˆλ‹€ (μœ μ‚¬λ„: 0.3093)
123
+
124
+ 9: μ•„λ‚΄κ°€ μ’‹μ•„ν•˜λŠ”μ§€ λͺ¨λ₯΄κ² μ§€λ§Œ 냉μž₯κ³  μ•ˆμ— μžˆλŠ” ν›„λž‘ν¬μ†Œμ„Έμ§€λ₯Ό λ³΄λ‹ˆ λ°”λ‘œ μ†Œμ•Όλ₯Ό ν•΄μ•Όκ² λ‹€λŠ” 생각이 λ“€μ—ˆλ‹€. μŒμ‹μ€ μ„±κ³΅μ μœΌλ‘œ 완성이 λ˜μ—ˆλ‹€ (μœ μ‚¬λ„: 0.2259)
125
+
126
+ 10: 아내도 그런 μŠ€ν…Œμ΄ν¬λ₯Ό μ’‹μ•„ν•œλ‹€. 그런데 상상도 λͺ»ν•œ 일이 λ²Œμ΄μ§€κ³  λ§μ•˜λ‹€ (μœ μ‚¬λ„: 0.1967)
127
+ ```
128
 
 
 
129
 
130
  **DataLoader**:
131
 
 
142
  ```
143
  {
144
  "epochs": 4,
145
+ "evaluation_steps": 1000,
146
  "evaluator": "sentence_transformers.evaluation.EmbeddingSimilarityEvaluator.EmbeddingSimilarityEvaluator",
147
  "max_grad_norm": 1,
148
  "optimizer_class": "<class 'transformers.optimization.AdamW'>",
 
160
  ## Full Model Architecture
161
  ```
162
  SentenceTransformer(
163
+ (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
164
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
165
  )
166
  ```