File size: 9,589 Bytes
c4c345f
 
 
 
cfd2935
c4c345f
 
 
cfd2935
 
 
 
 
 
 
 
 
6175e30
a892d6d
00fc7e4
 
 
 
63d0cad
9ffdfa0
7ff540e
8dc22ae
7ff540e
 
9ffdfa0
7ff540e
 
a892d6d
897a80e
7ff540e
 
 
 
8dc22ae
9f3e0d8
 
 
7ff5bac
f6dcb97
c792e3f
157ac85
8dc22ae
c8e8bdb
a892d6d
 
 
 
111032e
0c0c755
8dc22ae
c8e8bdb
247d1e4
 
 
 
6175e30
8dc22ae
deae7da
ca800c4
 
 
 
 
 
 
 
 
f6dcb97
ca800c4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f6dcb97
 
9ffdfa0
dc88d2d
7d4e17b
150fb85
 
7d4e17b
150fb85
8dc22ae
150fb85
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8dc22ae
150fb85
 
9f3e0d8
150fb85
9f3e0d8
150fb85
 
 
 
 
 
8dc22ae
150fb85
 
 
 
 
 
 
 
 
 
 
 
8dc22ae
150fb85
 
 
 
 
 
 
 
 
8dc22ae
1f254d5
ca800c4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1f254d5
ca800c4
 
1f254d5
ca800c4
1f254d5
dc88d2d
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
---
license: mit
language:
- ko
- en
base_model:
- openchat/openchat_3.5
pipeline_tag: text-generation
metrics:
- accuracy
library_name: adapter-transformers
tags:
- finance
- biology
- legal
- art
- text-generation-inference
---

### β›±   ktdsbaseLM v0.11은 openchat3.5λ₯Ό Foundation λͺ¨λΈλ‘œ ν•˜λŠ” ν•œκ΅­μ–΄ 및 ν•œκ΅­μ˜ λ‹€μ–‘ν•œ 
### 문화에 μ μš©ν•  수 μžˆλ„λ‘ ν•˜κΈ° μœ„ν•΄
### 개발 λ˜μ—ˆμœΌλ©° 자체 μ œμž‘ν•œ 53μ˜μ—­μ˜ ν•œκ΅­μ–΄ 데이터λ₯Ό ν™œμš©ν•˜μ—¬ ν•œκ΅­ μ‚¬νšŒ κ°€μΉ˜μ™€ 
### λ¬Έν™”λ₯Ό μ΄ν•΄ν•˜λŠ” λͺ¨λΈ μž…λ‹ˆλ‹€. ✌

                

# ❢ λͺ¨λΈ μ„€λͺ…
- λͺ¨λΈλͺ… 및 μ£Όμš”κΈ°λŠ₯:
  KTDSbaseLM v0.11은 OpenChat 3.5 λͺ¨λΈμ„ 기반으둜 SFT λ°©μ‹μœΌλ‘œ νŒŒμΈνŠœλ‹λœ Mistral 7B / openchat3.5 기반 λͺ¨λΈμž…λ‹ˆλ‹€.
  ν•œκ΅­μ–΄μ™€ ν•œκ΅­μ˜ λ‹€μ–‘ν•œ 문화적 λ§₯락을 μ΄ν•΄ν•˜λ„λ‘ μ„€κ³„λ˜μ—ˆμœΌλ©° ✨✨, 자체 μ œμž‘ν•œ 135개 μ˜μ—­μ˜ ν•œκ΅­μ–΄
  데이터λ₯Ό ν™œμš©ν•΄ ν•œκ΅­ μ‚¬νšŒμ˜ κ°€μΉ˜μ™€ λ¬Έν™”λ₯Ό λ°˜μ˜ν•©λ‹ˆλ‹€.
  μ£Όμš” κΈ°λŠ₯μœΌλ‘œλŠ” ν…μŠ€νŠΈ 생성, λŒ€ν™” μΆ”λ‘ , λ¬Έμ„œ μš”μ•½, μ§ˆμ˜μ‘λ‹΅, 감정 뢄석 및 μžμ—°μ–΄ 처리 κ΄€λ ¨ λ‹€μ–‘ν•œ μž‘μ—…μ„ μ§€μ›ν•˜λ©°,
  ν™œμš© λΆ„μ•ΌλŠ” 법λ₯ , 재무, κ³Όν•™, ꡐ윑, λΉ„μ¦ˆλ‹ˆμŠ€, λ¬Έν™” 연ꡬ λ“± λ‹€μ–‘ν•œ λΆ„μ•Όμ—μ„œ μ‘μš©λ  수 μžˆμŠ΅λ‹ˆλ‹€.
- λͺ¨λΈ μ•„ν‚€ν…μ²˜: KTDSBaseLM v0.11은 Mistral 7B λͺ¨λΈμ„ 기반으둜, νŒŒλΌλ―Έν„° μˆ˜λŠ” 70μ–΅ 개(7B)둜 κ΅¬μ„±λœ κ³ μ„±λŠ₯ μ–Έμ–΄ λͺ¨λΈμž…λ‹ˆλ‹€.
  이 λͺ¨λΈμ€ OpenChat 3.5λ₯Ό νŒŒμš΄λ°μ΄μ…˜ λͺ¨λΈλ‘œ μ‚Όμ•„, SFT(지도 λ―Έμ„Έ μ‘°μ •) 방식을 톡해 ν•œκ΅­μ–΄μ™€ ν•œκ΅­ 문화에 νŠΉν™”λœ μ„±λŠ₯을 λ°œνœ˜ν•˜λ„λ‘ ν›ˆλ ¨λ˜μ—ˆμŠ΅λ‹ˆλ‹€.
  Mistral 7B의 κ²½λŸ‰ν™”λœ κ΅¬μ‘°λŠ” λΉ λ₯Έ μΆ”λ‘  속도와 λ©”λͺ¨λ¦¬ νš¨μœ¨μ„±μ„ 보μž₯ν•˜λ©°, λ‹€μ–‘ν•œ μžμ—°μ–΄ 처리 μž‘μ—…μ— μ ν•©ν•˜κ²Œ μ΅œμ ν™”λ˜μ–΄ μžˆμŠ΅λ‹ˆλ‹€.
  이 μ•„ν‚€ν…μ²˜λŠ” ν…μŠ€νŠΈ 생성, μ§ˆμ˜μ‘λ‹΅, λ¬Έμ„œ μš”μ•½, 감정 뢄석과 같은 λ‹€μ–‘ν•œ μž‘μ—…μ—μ„œ νƒμ›”ν•œ μ„±λŠ₯을 λ³΄μ—¬μ€λ‹ˆλ‹€.

# ❷ ν•™μŠ΅ 데이터
- ktdsbaseLM v0.11은 자체 κ°œλ°œν•œ 총 3.6GB 크기의 데이터λ₯Ό λ°”νƒ•μœΌλ‘œ ν•™μŠ΅λ˜μ—ˆμŠ΅λ‹ˆλ‹€. λͺ¨λ‘ 233만 건의 QnA, μš”μ•½, λΆ„λ₯˜ λ“± 데이터λ₯Ό ν¬ν•¨ν•˜λ©°,
  κ·Έ 쀑 133만 건은 53개 μ˜μ—­μ˜ 객관식 문제둜 κ΅¬μ„±λ˜μ—ˆμŠ΅λ‹ˆλ‹€. 이 μ˜μ—­μ—λŠ” ν•œκ΅­μ‚¬, μ‚¬νšŒ, 재무, 법λ₯ , 세무, μˆ˜ν•™, 생물, 물리, ν™”ν•™ 등이 ν¬ν•¨λ˜λ©°,
  Chain of Thought λ°©μ‹μœΌλ‘œ ν•™μŠ΅λ˜μ—ˆμŠ΅λ‹ˆλ‹€. λ˜ν•œ 130만 건의 주관식 λ¬Έμ œλŠ” ν•œκ΅­μ‚¬, 재무, 법λ₯ , 세무, μˆ˜ν•™ λ“± 38개 μ˜μ—­μ— 걸쳐 ν•™μŠ΅λ˜μ—ˆμŠ΅λ‹ˆλ‹€.
  ν•™μŠ΅ 데이터 쀑 ν•œκ΅­μ˜ μ‚¬νšŒ κ°€μΉ˜μ™€ μΈκ°„μ˜ 감정을 μ΄ν•΄ν•˜κ³  μ§€μ‹œν•œ 사항에 따라 좜λ ₯ν•  수 μžˆλŠ” 데이터λ₯Ό ν•™μŠ΅ν•˜μ˜€μŠ΅λ‹ˆλ‹€.
- ν•™μŠ΅ Instruction Datasets Format: 
  <pre><code>{"prompt": "prompt text", "completion": "ideal generated text"}</code></pre>
    
# ❸ μ‚¬μš© 사둀
  ktdsbaseLM v0.11은 λ‹€μ–‘ν•œ μ‘μš© λΆ„μ•Όμ—μ„œ μ‚¬μš©λ  수 μžˆμŠ΅λ‹ˆλ‹€. 예λ₯Ό λ“€μ–΄:
- ꡐ윑 λΆ„μ•Ό: 역사, μˆ˜ν•™, κ³Όν•™ λ“± λ‹€μ–‘ν•œ ν•™μŠ΅ μžλ£Œμ— λŒ€ν•œ μ§ˆμ˜μ‘λ‹΅ 및 μ„€λͺ… 생성.
- λΉ„μ¦ˆλ‹ˆμŠ€: 법λ₯ , 재무, 세무 κ΄€λ ¨ μ§ˆμ˜μ— λŒ€ν•œ λ‹΅λ³€ 제곡 및 λ¬Έμ„œ μš”μ•½.
- 연ꡬ 및 λ¬Έν™”: ν•œκ΅­ μ‚¬νšŒμ™€ 문화에 맞좘 μžμ—°μ–΄ 처리 μž‘μ—…, 감정 뢄석, λ¬Έμ„œ 생성 및 λ²ˆμ—­.
- 고객 μ„œλΉ„μŠ€: μ‚¬μš©μžμ™€μ˜ λŒ€ν™” 생성 및 λ§žμΆ€ν˜• 응닡 제곡.
- 이 λͺ¨λΈμ€ λ‹€μ–‘ν•œ μžμ—°μ–΄ 처리 μž‘μ—…μ—μ„œ 높은 ν™œμš©λ„λ₯Ό κ°€μ§‘λ‹ˆλ‹€.

# ❹ ν•œκ³„ β›ˆβ›ˆ
- ktdsBaseLM v0.11은 ν•œκ΅­μ–΄μ™€ ν•œκ΅­ 문화에 νŠΉν™”λ˜μ–΄ μžˆμœΌλ‚˜, 
  νŠΉμ • μ˜μ—­(예: μ΅œμ‹  ꡭ제 자료, μ „λ¬Έ λΆ„μ•Ό)의 데이터 λΆ€μ‘±μœΌλ‘œ 인해 λ‹€λ₯Έ μ–Έμ–΄ λ˜λŠ”
  문화에 λŒ€ν•œ μ‘λ‹΅μ˜ 정확성이 λ–¨μ–΄μ§ˆ 수 μžˆμŠ΅λ‹ˆλ‹€.
  λ˜ν•œ, λ³΅μž‘ν•œ 논리적 사고λ₯Ό μš”κ΅¬ν•˜λŠ” λ¬Έμ œμ— λŒ€ν•΄ μ œν•œλœ μΆ”λ‘  λŠ₯λ ₯을 보일 수 있으며,
  편ν–₯된 데이터가 포함될 경우 편ν–₯된 응닡이 생성될 κ°€λŠ₯성도 μ‘΄μž¬ν•©λ‹ˆλ‹€.

# ❺ μ‚¬μš© 방법
  <pre><code>
  import os
  import os.path as osp
  import sys
  import fire
  import json
  from typing import List, Union
  import pandas as pd
  import torch
  from torch.nn import functional as F
  
  import transformers
  from transformers import TrainerCallback, TrainingArguments, TrainerState, TrainerControl, BitsAndBytesConfig
  from transformers.trainer_utils import PREFIX_CHECKPOINT_DIR
  from transformers import LlamaForCausalLM, LlamaTokenizer
  from transformers import AutoModelForCausalLM, AutoTokenizer
  
  from datasets import load_dataset
  
  from peft import (
      LoraConfig,
      get_peft_model,
      set_peft_model_state_dict
  )
  from peft import PeftModel
  import re
  import ast
  
  device = 'auto' #@param {type: "string"}
  model = '' #@param {type: "string"}
  model = AutoModelForCausalLM.from_pretrained(
    model,
    quantization_config=bnb_config,
    #load_in_4bit=True, # Quantization Load
    device_map=device)

  tokenizer = AutoTokenizer.from_pretrained(base_LLM_model)

  input_text = "μ•ˆλ…•ν•˜μ„Έμš”."
  inputs = tokenizer(input_text, return_tensors="pt")
  inputs = inputs.to("cuda:0")
  
  with torch.no_grad():
      outputs = model.generate(**inputs, max_length=1024)
  
  result = tokenizer.decode(outputs[0], skip_special_tokens=True)
  
</code></pre>

## βœ… ktdsλŠ” openchat 외에도 LlaMA, Polyglot, EEVE λ“± λŒ€ν‘œμ μΈ LLM에 λ‹€μ–‘ν•œ μ˜μ—­μ˜ ν•œκ΅­μ˜ 문화와 지식을 νŒŒμΈνŠœλ‹ν•œ LLM을 μ œκ³΅ν•  μ˜ˆμ •μž…λ‹ˆλ‹€.
---
Here’s the English version of the provided text:



# ❢ Model Description

**Model Name and Key Features**:  
KTDSbaseLM v0.11 is based on the OpenChat 3.5 model, fine-tuned using the SFT method on the Mistral 7B model. 
It is designed to understand Korean and various cultural contexts, utilizing data from 135 domains in Korean society. 
The model supports tasks such as text generation, conversation inference, document summarization, 
question answering, sentiment analysis, and other NLP tasks. 
Its applications span fields like law, finance, science, education, business, and cultural research.

**Model Architecture**:  
KTDSBaseLM v0.11 is a high-performance language model with 7 billion parameters based on the Mistral 7B model. 
It uses OpenChat 3.5 as the foundation and is fine-tuned using SFT to excel in Korean language and culture. 
The streamlined Mistral 7B architecture ensures fast inference and memory efficiency, 
optimized for various NLP tasks like text generation, question answering, document summarization, and sentiment analysis.

---

# ❷ Training Data

KTDSbaseLM v0.11 was trained on 3.6GB of data, comprising 2.33 million Q&A instances. 
This includes 1.33 million multiple-choice questions across 53 domains such as history, 
finance, law, tax, and science, trained with the Chain of Thought method. Additionally, 
1.3 million short-answer questions cover 38 domains including history, finance, and law. 

**Training Instruction Dataset Format**:  
`{"prompt": "prompt text", "completion": "ideal generated text"}`

---

# ❸ Use Cases

KTDSbaseLM v0.11 can be used across multiple fields, such as:

- **Education**: Answering questions and generating explanations for subjects like history, math, and science.
- **Business**: Providing responses and summaries for legal, financial, and tax-related queries.
- **Research and Culture**: Performing NLP tasks, sentiment analysis, document generation, and translation.
- **Customer Service**: Generating conversations and personalized responses for users.

This model is highly versatile in various NLP tasks.

---

# ❹ Limitations

KTDSBaseLM v0.11 is specialized in Korean language and culture. 
However, it may lack accuracy in responding to topics outside its scope, 
such as international or specialized data. 
Additionally, it may have limited reasoning ability for complex logical problems and 
may produce biased responses if trained on biased data.

---

# ❺ Usage Instructions
<pre><code>
  import os
  import os.path as osp
  import sys
  import fire
  import json
  from typing import List, Union
  import pandas as pd
  import torch
  from torch.nn import functional as F
  
  import transformers
  from transformers import TrainerCallback, TrainingArguments, TrainerState, TrainerControl, BitsAndBytesConfig
  from transformers.trainer_utils import PREFIX_CHECKPOINT_DIR
  from transformers import LlamaForCausalLM, LlamaTokenizer
  from transformers import AutoModelForCausalLM, AutoTokenizer
  
  from datasets import load_dataset
  
  from peft import (
      LoraConfig,
      get_peft_model,
      set_peft_model_state_dict
  )
  from peft import PeftModel
  import re
  import ast
  
  device = 'auto' #@param {type: "string"}
  model = '' #@param {type: "string"}
  model = AutoModelForCausalLM.from_pretrained(
    model,
    quantization_config=bnb_config,
    #load_in_4bit=True, # Quantization Load
    device_map=device)

  tokenizer = AutoTokenizer.from_pretrained(base_LLM_model)

  input_text = "μ•ˆλ…•ν•˜μ„Έμš”."
  inputs = tokenizer(input_text, return_tensors="pt")
  inputs = inputs.to("cuda:0")
  
  with torch.no_grad():
      outputs = model.generate(**inputs, max_length=1024)
  
  result = tokenizer.decode(outputs[0], skip_special_tokens=True)
</code></pre>

## KTDS plans to provide fine-tuned LLMs (Large Language Models) across various domains of Korean culture and knowledge, 
## including models based on not only OpenChat but also LLaMA, Polyglot, and EEVE. 
## These models will be tailored to better understand and generate content specific to Korean contexts.