KDHyun08 commited on
Commit
dc35e72
β€’
1 Parent(s): 199153e

Upload with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +81 -39
README.md CHANGED
@@ -2,89 +2,131 @@
2
  pipeline_tag: sentence-similarity
3
  tags:
4
  - sentence-transformers
5
- - feature-extraction
6
  - sentence-similarity
7
  - transformers
 
 
8
  ---
9
 
10
- # {MODEL_NAME}
11
 
12
- This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.
13
-
14
- <!--- Describe your model here -->
15
 
16
  ## Usage (Sentence-Transformers)
17
 
18
- Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
19
 
20
  ```
21
  pip install -U sentence-transformers
22
  ```
23
 
24
- Then you can use the model like this:
25
 
26
  ```python
27
- from sentence_transformers import SentenceTransformer
28
  sentences = ["This is an example sentence", "Each sentence is converted"]
29
 
30
- model = SentenceTransformer('{MODEL_NAME}')
 
 
 
 
 
 
 
 
 
 
 
 
 
31
  embeddings = model.encode(sentences)
32
  print(embeddings)
33
  ```
34
 
35
 
36
-
37
- ## Usage (HuggingFace Transformers)
38
- Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.
39
 
40
  ```python
41
- from transformers import AutoTokenizer, AutoModel
42
- import torch
43
 
 
 
 
 
 
44
 
45
- #Mean Pooling - Take attention mask into account for correct averaging
46
- def mean_pooling(model_output, attention_mask):
47
- token_embeddings = model_output[0] #First element of model_output contains all token embeddings
48
- input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
49
- return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
50
-
 
 
 
 
51
 
52
- # Sentences we want sentence embeddings for
53
- sentences = ['This is an example sentence', 'Each sentence is converted']
54
 
55
- # Load model from HuggingFace Hub
56
- tokenizer = AutoTokenizer.from_pretrained('{MODEL_NAME}')
57
- model = AutoModel.from_pretrained('{MODEL_NAME}')
58
 
59
- # Tokenize sentences
60
- encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
61
 
62
- # Compute token embeddings
63
- with torch.no_grad():
64
- model_output = model(**encoded_input)
65
 
66
- # Perform pooling. In this case, mean pooling.
67
- sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
68
 
69
- print("Sentence embeddings:")
70
- print(sentence_embeddings)
71
  ```
72
 
73
 
74
 
75
  ## Evaluation Results
76
 
77
- <!--- Describe how your model was evaluated -->
 
 
 
78
 
79
- For an automated evaluation of this model, see the *Sentence Embeddings Benchmark*: [https://seb.sbert.net](https://seb.sbert.net?model_name={MODEL_NAME})
80
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
81
 
82
  ## Training
83
  The model was trained with the parameters:
84
 
85
  **DataLoader**:
86
 
87
- `torch.utils.data.dataloader.DataLoader` of length 25255 with parameters:
88
  ```
89
  {'batch_size': 32, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}
90
  ```
@@ -97,7 +139,7 @@ Parameters of the fit()-Method:
97
  ```
98
  {
99
  "epochs": 4,
100
- "evaluation_steps": 808146,
101
  "evaluator": "sentence_transformers.evaluation.EmbeddingSimilarityEvaluator.EmbeddingSimilarityEvaluator",
102
  "max_grad_norm": 1,
103
  "optimizer_class": "<class 'transformers.optimization.AdamW'>",
@@ -115,7 +157,7 @@ Parameters of the fit()-Method:
115
  ## Full Model Architecture
116
  ```
117
  SentenceTransformer(
118
- (0): Transformer({'max_seq_length': 256, 'do_lower_case': True}) with Transformer model: BertModel
119
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
120
  )
121
  ```
 
2
  pipeline_tag: sentence-similarity
3
  tags:
4
  - sentence-transformers
 
5
  - sentence-similarity
6
  - transformers
7
+ - TAACO
8
+ language: ko
9
  ---
10
 
11
+ # TAACO_Similarity
12
 
13
+ λ³Έ λͺ¨λΈμ€ [Sentence-transformers](https://www.SBERT.net)λ₯Ό 기반으둜 ν•˜λ©° KLUE의 STS(Sentence Textual Similarity) 데이터셋을 톡해 ν›ˆλ ¨μ„ μ§„ν–‰ν•œ λͺ¨λΈμž…λ‹ˆλ‹€.
14
+ ν•„μžκ°€ μ œμž‘ν•˜κ³  μžˆλŠ” ν•œκ΅­μ–΄ λ¬Έμž₯κ°„ 결속성 μΈ‘μ • 도ꡬ인 K-TAACO(κ°€μ œ)의 μ§€ν‘œ 쀑 ν•˜λ‚˜μΈ λ¬Έμž₯ κ°„ 의미적 결속성을 μΈ‘μ •ν•˜κΈ° μœ„ν•΄ λ³Έ λͺ¨λΈμ„ μ œμž‘ν•˜μ˜€μŠ΅λ‹ˆλ‹€.
15
+ λ˜ν•œ λͺ¨λ‘μ˜ λ§λ­‰μΉ˜μ˜ λ¬Έμž₯κ°„ μœ μ‚¬λ„ 데이터 λ“± λ‹€μ–‘ν•œ 데이터λ₯Ό ꡬ해 μΆ”κ°€ ν›ˆλ ¨μ„ 진행할 μ˜ˆμ •μž…λ‹ˆλ‹€.
16
 
17
  ## Usage (Sentence-Transformers)
18
 
19
+ λ³Έ λͺ¨λΈμ„ μ‚¬μš©ν•˜κΈ° μœ„ν•΄μ„œλŠ” [Sentence-transformers](https://www.SBERT.net)λ₯Ό μ„€μΉ˜ν•˜μ—¬μ•Ό ν•©λ‹ˆλ‹€.
20
 
21
  ```
22
  pip install -U sentence-transformers
23
  ```
24
 
25
+ λͺ¨λΈμ„ μ‚¬μš©ν•˜κΈ° μœ„ν•΄μ„œλŠ” μ•„λž˜ μ½”λ“œλ₯Ό μ°Έμ‘°ν•˜μ‹œκΈΈ λ°”λžλ‹ˆλ‹€.
26
 
27
  ```python
28
+ from sentence_transformers import SentenceTransformer, models
29
  sentences = ["This is an example sentence", "Each sentence is converted"]
30
 
31
+ embedding_model = models.Transformer(
32
+ model_name_or_path="KDHyun08/TAACO_STS",
33
+ max_seq_length=256,
34
+ do_lower_case=True
35
+ )
36
+
37
+ pooling_model = models.Pooling(
38
+ embedding_model.get_word_embedding_dimension(),
39
+ pooling_mode_mean_tokens=True,
40
+ pooling_mode_cls_token=False,
41
+ pooling_mode_max_tokens=False,
42
+ )
43
+ model = SentenceTransformer(modules=[embedding_model, pooling_model])
44
+
45
  embeddings = model.encode(sentences)
46
  print(embeddings)
47
  ```
48
 
49
 
50
+ ## Usage (μ‹€μ œ λ¬Έμž₯ κ°„ μœ μ‚¬λ„ 비ꡐ)
51
+ [Sentence-transformers](https://www.SBERT.net) λ₯Ό μ„€μΉ˜ν•œ ν›„ μ•„λž˜ λ‚΄μš©κ³Ό 같이 λ¬Έμž₯ κ°„ μœ μ‚¬λ„λ₯Ό 비ꡐ할 수 μžˆμŠ΅λ‹ˆλ‹€.
52
+ query λ³€μˆ˜λŠ” 비ꡐ 기쀀이 λ˜λŠ” λ¬Έμž₯(Source Sentence)이고 비ꡐλ₯Ό 진행할 λ¬Έμž₯은 docs에 list ν˜•μ‹μœΌλ‘œ κ΅¬μ„±ν•˜μ‹œλ©΄ λ©λ‹ˆλ‹€.
53
 
54
  ```python
55
+ from sentence_transformers import SentenceTransformer, models
 
56
 
57
+ embedding_model = models.Transformer(
58
+ model_name_or_path="KDHyun08/TAACO_STS",
59
+ max_seq_length=256,
60
+ do_lower_case=True
61
+ )
62
 
63
+ pooling_model = models.Pooling(
64
+ embedding_model.get_word_embedding_dimension(),
65
+ pooling_mode_mean_tokens=True,
66
+ pooling_mode_cls_token=False,
67
+ pooling_mode_max_tokens=False,
68
+ )
69
+ model = SentenceTransformer(modules=[embedding_model, pooling_model])
70
+ docs = ['μ–΄μ œλŠ” μ•„λ‚΄μ˜ μƒμΌμ΄μ—ˆλ‹€', '생일을 λ§žμ΄ν•˜μ—¬ 아침을 μ€€λΉ„ν•˜κ² λ‹€κ³  μ˜€μ „ 8μ‹œ 30λΆ„λΆ€ν„° μŒμ‹μ„ μ€€λΉ„ν•˜μ˜€λ‹€. 주된 λ©”λ‰΄λŠ” μŠ€ν…Œμ΄ν¬μ™€ λ‚™μ§€λ³ΆμŒ, λ―Έμ—­κ΅­, μž‘μ±„, μ†Œμ•Ό λ“±μ΄μ—ˆλ‹€', 'μŠ€ν…Œμ΄ν¬λŠ” 자주 ν•˜λŠ” μŒμ‹μ΄μ–΄μ„œ μžμ‹ μ΄ μ€€λΉ„ν•˜λ €κ³  ν–ˆλ‹€', 'μ•žλ’€λ„ 1λΆ„μ”© 3번 뒀집고 λž˜μŠ€νŒ…μ„ 잘 ν•˜λ©΄ μœ‘μ¦™μ΄ κ°€λ“ν•œ μŠ€ν…Œμ΄ν¬κ°€ μ€€λΉ„λ˜λ‹€', '아내도 그런 μŠ€ν…Œμ΄ν¬λ₯Ό μ’‹μ•„ν•œλ‹€. 그런데 상상도 λͺ»ν•œ 일이 λ²Œμ΄μ§€κ³  λ§μ•˜λ‹€', '보톡 μ‹œμ¦ˆλ‹μ΄ λ˜μ§€ μ•Šμ€ μ›μœ‘μ„ μ‚¬μ„œ μŠ€ν…Œμ΄ν¬λ₯Ό ν–ˆλŠ”λ°, μ΄λ²ˆμ—λŠ” μ‹œμ¦ˆλ‹μ΄ 된 뢀챗살을 κ΅¬μž…ν•΄μ„œ ν–ˆλ‹€', '그런데 μΌ€μ΄μŠ€ μ•ˆμ— λ°©λΆ€μ œκ°€ λ“€μ–΄μžˆλŠ” 것을 μΈμ§€ν•˜μ§€ λͺ»ν•˜κ³  λ°©λΆ€μ œμ™€ λ™μ‹œμ— ν”„λΌμ΄νŒ¬μ— μ˜¬λ €λ†“μ„ 것이닀', '그것도 인지 λͺ»ν•œ 체... μ•žλ©΄μ„ μ„Ό λΆˆμ— 1뢄을 κ΅½κ³  λ’€μ§‘λŠ” μˆœκ°„ λ°©λΆ€μ œκ°€ ν•¨κ»˜ ꡬ어진 것을 μ•Œμ•˜λ‹€', 'μ•„λ‚΄μ˜ 생일이라 λ§›μžˆκ²Œ κ΅¬μ›Œλ³΄κ³  μ‹Άμ—ˆλŠ”λ° μ–΄μ²˜κ΅¬λ‹ˆμ—†λŠ” 상황이 λ°œμƒν•œ 것이닀', 'λ°©λΆ€μ œκ°€ μ„Ό λΆˆμ— λ…Ήμ•„μ„œ κ·ΈλŸ°μ§€ 물처럼 ν˜λŸ¬λ‚΄λ Έλ‹€', ' 고민을 ν–ˆλ‹€. λ°©λΆ€μ œκ°€ 묻은 λΆ€λ¬Έλ§Œ μ œκ±°ν•˜κ³  λ‹€μ‹œ ꡬ울까 ν–ˆλŠ”λ° λ°©λΆ€μ œμ— μ ˆλŒ€ 먹지 λ§λΌλŠ” 문ꡬ가 μžˆμ–΄μ„œ μ•„κΉμ§€λ§Œ λ²„λ¦¬λŠ” λ°©ν–₯을 ν–ˆλ‹€', 'λ„ˆλ¬΄λ‚˜ μ•ˆνƒ€κΉŒμ› λ‹€', 'μ•„μΉ¨ 일찍 μ•„λ‚΄κ°€ μ’‹μ•„ν•˜λŠ” μŠ€ν…Œμ΄ν¬λ₯Ό μ€€λΉ„ν•˜κ³  그것을 λ§›μžˆκ²Œ λ¨ΉλŠ” μ•„λ‚΄μ˜ λͺ¨μŠ΅μ„ 보고 μ‹Άμ—ˆλŠ”λ° μ „ν˜€ 생각지도 λͺ»ν•œ 상황이 λ°œμƒν•΄μ„œ... ν•˜μ§€λ§Œ 정신을 μΆ”μŠ€λ₯΄κ³  λ°”λ‘œ λ‹€λ₯Έ λ©”λ‰΄λ‘œ λ³€κ²½ν–ˆλ‹€', 'μ†Œμ•Ό, μ†Œμ‹œμ§€ μ•Όμ±„λ³ΆμŒ..', 'μ•„λ‚΄κ°€ μ’‹μ•„ν•˜λŠ”μ§€ λͺ¨λ₯΄κ² μ§€λ§Œ 냉μž₯κ³  μ•ˆμ— μžˆλŠ” ν›„λž‘ν¬μ†Œμ„Έμ§€λ₯Ό λ³΄λ‹ˆ λ°”λ‘œ μ†Œμ•Όλ₯Ό ν•΄μ•Όκ² λ‹€λŠ” 생각이 λ“€μ—ˆλ‹€. μŒμ‹μ€ μ„±κ³΅μ μœΌλ‘œ 완성이 λ˜μ—ˆλ‹€', '40번째λ₯Ό λ§žμ΄ν•˜λŠ” μ•„λ‚΄μ˜ 생일은 μ„±κ³΅μ μœΌλ‘œ μ€€λΉ„κ°€ λ˜μ—ˆλ‹€', 'λ§›μžˆκ²Œ λ¨Ήμ–΄ μ€€ μ•„λ‚΄μ—κ²Œλ„ κ°μ‚¬ν–ˆλ‹€', '맀년 μ•„λ‚΄μ˜ 생일에 λ§žμ΄ν•˜λ©΄ μ•„μΉ¨λ§ˆλ‹€ 생일을 μ°¨λ €μ•Όκ² λ‹€. μ˜€λŠ˜λ„ 즐거운 ν•˜λ£¨κ°€ λ˜μ—ˆμœΌλ©΄ μ’‹κ² λ‹€', 'μƒμΌμ΄λ‹ˆκΉŒ~']
71
+ #각 λ¬Έμž₯의 vectorκ°’ encoding
72
+ document_embeddings = model.encode(docs)
73
 
74
+ query = '생일을 λ§žμ΄ν•˜μ—¬ 아침을 μ€€λΉ„ν•˜κ² λ‹€κ³  μ˜€μ „ 8μ‹œ 30λΆ„λΆ€ν„° μŒμ‹μ„ μ€€λΉ„ν•˜μ˜€λ‹€'
75
+ query_embedding = model.encode(query)
76
 
77
+ top_k = min(10, len(docs))
 
 
78
 
79
+ # 코사인 μœ μ‚¬λ„ 계산 ν›„,
80
+ cos_scores = util.pytorch_cos_sim(query_embedding, document_embeddings)[0]
81
 
82
+ # 코사인 μœ μ‚¬λ„ 순으둜 λ¬Έμž₯ μΆ”μΆœ
83
+ top_results = torch.topk(cos_scores, k=top_k)
 
84
 
85
+ print(f"μž…λ ₯ λ¬Έμž₯: {query}")
86
+ print(f"\n<μž…λ ₯ λ¬Έμž₯κ³Ό μœ μ‚¬ν•œ {top_k} 개의 λ¬Έμž₯>\n")
87
 
88
+ for i, (score, idx) in enumerate(zip(top_results[0], top_results[1])):
89
+ print(f"{i+1}: {docs[idx]} {'(μœ μ‚¬λ„: {:.4f})'.format(score)}\n")
90
  ```
91
 
92
 
93
 
94
  ## Evaluation Results
95
 
96
+ μœ„ Usageλ₯Ό μ‹€ν–‰ν•˜κ²Œ 되면 μ•„λž˜μ™€ 같은 κ²°κ³Όκ°€ λ„μΆœλ©λ‹ˆλ‹€. 1에 κ°€κΉŒμšΈμˆ˜λ‘ μœ μ‚¬ν•œ λ¬Έμž₯μž…λ‹ˆλ‹€.
97
+
98
+ ```
99
+ μž…λ ₯ λ¬Έμž₯: 생일을 λ§žμ΄ν•˜μ—¬ 아침을 μ€€λΉ„ν•˜κ² λ‹€κ³  μ˜€μ „ 8μ‹œ 30λΆ„λΆ€ν„° μŒμ‹μ„ μ€€λΉ„ν•˜μ˜€λ‹€
100
 
101
+ <μž…λ ₯ λ¬Έμž₯κ³Ό μœ μ‚¬ν•œ 10 개의 λ¬Έμž₯>
102
 
103
+ 1: 생일을 λ§žμ΄ν•˜μ—¬ 아침을 μ€€λΉ„ν•˜κ² λ‹€κ³  μ˜€μ „ 8μ‹œ 30λΆ„λΆ€ν„° μŒμ‹μ„ μ€€λΉ„ν•˜μ˜€λ‹€. 주된 λ©”λ‰΄λŠ” μŠ€ν…Œμ΄ν¬μ™€ λ‚™μ§€λ³ΆμŒ, λ―Έμ—­κ΅­, μž‘μ±„, μ†Œμ•Ό λ“±μ΄μ—ˆλ‹€ (μœ μ‚¬λ„: 0.6687)
104
+
105
+ 2: 맀년 μ•„λ‚΄μ˜ 생일에 λ§žμ΄ν•˜λ©΄ μ•„μΉ¨λ§ˆλ‹€ 생일을 μ°¨λ €μ•Όκ² λ‹€. μ˜€λŠ˜λ„ 즐거운 ν•˜λ£¨κ°€ λ˜μ—ˆμœΌλ©΄ μ’‹κ² λ‹€ (μœ μ‚¬λ„: 0.6468)
106
+
107
+ 3: 40번째λ₯Ό λ§žμ΄ν•˜λŠ” μ•„λ‚΄μ˜ 생일은 μ„±κ³΅μ μœΌλ‘œ μ€€λΉ„κ°€ λ˜μ—ˆλ‹€ (μœ μ‚¬λ„: 0.4647)
108
+
109
+ 4: μ•„λ‚΄μ˜ 생일이라 λ§›μžˆκ²Œ κ΅¬μ›Œλ³΄κ³  μ‹Άμ—ˆλŠ”λ° μ–΄μ²˜κ΅¬λ‹ˆμ—†λŠ” 상황이 λ°œμƒν•œ 것이닀 (μœ μ‚¬λ„: 0.4469)
110
+
111
+ 5: μƒμΌμ΄λ‹ˆκΉŒ~ (μœ μ‚¬λ„: 0.4218)
112
+
113
+ 6: μ–΄μ œλŠ” μ•„λ‚΄μ˜ μƒμΌμ΄μ—ˆλ‹€ (μœ μ‚¬λ„: 0.4192)
114
+
115
+ 7: μ•„μΉ¨ 일찍 μ•„λ‚΄κ°€ μ’‹μ•„ν•˜λŠ” μŠ€ν…Œμ΄ν¬λ₯Ό μ€€λΉ„ν•˜κ³  그것을 λ§›μžˆκ²Œ λ¨ΉλŠ” μ•„λ‚΄μ˜ λͺ¨μŠ΅μ„ 보고 μ‹Άμ—ˆλŠ”λ° μ „ν˜€ 생각지도 λͺ»ν•œ 상황이 λ°œμƒν•΄μ„œ... ν•˜μ§€λ§Œ 정신을 μΆ”μŠ€λ₯΄κ³  λ°”λ‘œ λ‹€λ₯Έ λ©”λ‰΄λ‘œ λ³€κ²½ν–ˆλ‹€ (μœ μ‚¬λ„: 0.4156)
116
+
117
+ 8: λ§›μžˆκ²Œ λ¨Ήμ–΄ μ€€ μ•„λ‚΄μ—κ²Œλ„ κ°μ‚¬ν–ˆλ‹€ (μœ μ‚¬λ„: 0.3093)
118
+
119
+ 9: μ•„λ‚΄κ°€ μ’‹μ•„ν•˜λŠ”μ§€ λͺ¨λ₯΄κ² μ§€λ§Œ 냉μž₯κ³  μ•ˆμ— μžˆλŠ” ν›„λž‘ν¬μ†Œμ„Έμ§€λ₯Ό λ³΄λ‹ˆ λ°”λ‘œ μ†Œμ•Όλ₯Ό ν•΄μ•Όκ² λ‹€λŠ” 생각이 λ“€μ—ˆλ‹€. μŒμ‹μ€ μ„±κ³΅μ μœΌλ‘œ 완성이 λ˜μ—ˆλ‹€ (μœ μ‚¬λ„: 0.2259)
120
+
121
+ 10: 아내도 그런 μŠ€ν…Œμ΄ν¬λ₯Ό μ’‹μ•„ν•œλ‹€. 그런데 상상도 λͺ»ν•œ 일이 λ²Œμ΄μ§€κ³  λ§μ•˜λ‹€ (μœ μ‚¬λ„: 0.1967)
122
+ ```
123
 
124
  ## Training
125
  The model was trained with the parameters:
126
 
127
  **DataLoader**:
128
 
129
+ `torch.utils.data.dataloader.DataLoader` of length 142 with parameters:
130
  ```
131
  {'batch_size': 32, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}
132
  ```
 
139
  ```
140
  {
141
  "epochs": 4,
142
+ "evaluation_steps": 1000,
143
  "evaluator": "sentence_transformers.evaluation.EmbeddingSimilarityEvaluator.EmbeddingSimilarityEvaluator",
144
  "max_grad_norm": 1,
145
  "optimizer_class": "<class 'transformers.optimization.AdamW'>",
 
157
  ## Full Model Architecture
158
  ```
159
  SentenceTransformer(
160
+ (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
161
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
162
  )
163
  ```