KOJUNSEO commited on
Commit
7858a36
โ€ข
1 Parent(s): e141cff

JEJUMA-001-README

Browse files
Files changed (1) hide show
  1. README.md +101 -3
README.md CHANGED
@@ -1,3 +1,101 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # JEJUMA-001
2
+ LLM์œผ๋กœ ์‚ฌ๋ผ์ ธ๊ฐ€๋Š” ์šฐ๋ฆฌ ๋ฐฉ์–ธ ์ง€ํ‚ค๊ธฐ ํ”„๋กœ์ ํŠธ1: ์ œ์ฃผ๋„ ๋ฐฉ์–ธ
3
+
4
+ ## ์™œ ์‹œ์ž‘ํ•˜๊ฒŒ ๋˜์—ˆ๋‚˜์š”?
5
+ ### ๋น ๋ฅด๊ฒŒ ์‚ฌ๋ผ์ ธ๊ฐ€๋Š” ์ง€์—ญ๋ฐฉ์–ธ: ์ œ์ฃผ๋„
6
+ * ์—ฌ๋Ÿฌ ์ง€์—ญ ๋ฐฉ์–ธ, ํŠนํžˆ ์ œ์ฃผ๋„์˜ ๋ฐฉ์–ธ์ด ๋น ๋ฅด๊ฒŒ ์‚ฌ๋ผ์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.
7
+ * ์œ ๋„ค์Šค์ฝ”๋Š” ์ œ์ฃผ์–ด(์ œ์ฃผ๋ฐฉ์–ธ)์„ **์•„์ฃผ ์‹ฌ๊ฐํ•˜๊ฒŒ ์œ„๊ธฐ์— ์ฒ˜ํ•œ ์–ธ์–ด** ๋กœ ๋ถ„๋ฅ˜ํ–ˆ์Šต๋‹ˆ๋‹ค.
8
+ * ์ œ์ฃผ๋„๋ฏผ ์ค‘ **์ œ์ฃผ์–ด๋ฅผ ์•„๋Š” ์‚ฌ๋žŒ์˜ ๋น„์œจ์€ 36.1%** ์— ๊ทธ์น˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.
9
+ * ํŠนํžˆ, ํƒ€์ง€์—ญ๊ณผ์˜ ๊ต๋ฅ˜๊ฐ€ ํ™œ๋ฐœํ•ด์ง€๋ฉด์„œ ์ Š์€ ์ธต์—์„  ์ œ์ฃผ์–ด๋ณด๋‹จ ํ‘œ์ค€์–ด๋ฅผ ์„ ํ˜ธํ•˜๋Š” ํ˜„์ƒ์ด ๋‚˜ํƒ€๋‚ฉ๋‹ˆ๋‹ค.
10
+
11
+ ### ์ง€์—ญ๋ฐฉ์–ธ์— ์•ฝํ•œ ์–ธ์–ด๋ชจ๋ธ
12
+ * ์˜จ๋ผ์ธ ์†Œ์Šค๋Š” ํ‘œ์ค€์–ด๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ๊ธฐ์—, ์ž๋ฃŒ๊ฐ€ ์ ์€ ์ง€์—ญ๋ฐฉ์–ธ์„ ์ž˜ ๋ชจ๋ฆ…๋‹ˆ๋‹ค.
13
+ * ํŠนํžˆ ์ œ์ฃผ์–ด๋Š” ํ‘œ์ค€์–ด์™€ ์ฐจ์ด๊ฐ€ ํฌ๊ธฐ ๋•Œ๋ฌธ์—, ์œ ๋ช…ํ•œ ๋‹จ์–ด๋‚˜ ๋ฌธ์žฅ ์™ธ์—๋Š” ๋ชจ๋ธ์ด ์ดํ•ดํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค.
14
+
15
+ ## ์–ด๋–ป๊ฒŒ ์ด๋ฅผ ํ•ด๊ฒฐํ–ˆ๋‚˜์š”?
16
+ * ์–ธ์–ด๋ชจ๋ธ์„ ํ†ตํ•ด ์–ด๋ ค์šด ์ œ์ฃผ์–ด๋ฅผ ํ‘œ์ค€์–ด๋กœ ๋ณ€๊ฒฝํ•˜์—ฌ ์ œ์ฃผ์–ด๊ฐ€ ์žŠํ˜€์ง€์ง€ ์•Š๋„๋ก ํ•ฉ๋‹ˆ๋‹ค.
17
+ * ์–ธ์–ด๋ชจ๋ธ์„ ํ†ตํ•ด ํ‘œ์ค€์–ด์˜ ์ œ์ฃผ์–ด ๋ฒ„์ „์„ ์ƒ์„ฑํ•˜์—ฌ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•ฉ๋‹ˆ๋‹ค.
18
+ * ์–ธ์–ด๋ชจ๋ธ์„ ์‚ฌ์šฉํ•œ ์ด์œ ๋Š” ๊ธฐ์กด์— ํ•™์Šต๋œ ๋‹ค์–‘ํ•œ ๋‚ด์šฉ์„ ๊ทธ๋Œ€๋กœ ์ด์–ด๊ฐˆ ์ˆ˜ ์žˆ๋„๋ก ํ•˜๊ธฐ ์œ„ํ•จ์ž…๋‹ˆ๋‹ค.
19
+
20
+ ## ๊ฐœ๋ฐœํ•œ ์–ธ์–ด๋ชจ๋ธ์— ๋Œ€ํ•œ ์„ค๋ช…
21
+ * ์ œ์ฃผ๋„ ๋ฐฉ์–ธ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ Llama3.1์„ ๋‹ค์–‘ํ•œ ํ…Œ์Šคํฌ๊ฐ€ ๊ฐ€๋Šฅํ•˜๋„๋ก ํŒŒ์ธํŠœ๋‹ํ•˜์—ฌ, ์ œ์ฃผ๋„ ๋ฐฉ์–ธ๊ณผ ๊ด€๋ จ๋œ ์—ฌ๋Ÿฌ ํ…Œ์Šคํฌ๋ฅผ ์ˆ˜ํ–‰ํ•˜๋„๋ก ํ•ฉ๋‹ˆ๋‹ค.
22
+ * `JEJUMA-001`์€ ํ˜„์žฌ ๋ฐฉ์–ธ๊ณผ ํ‘œ์ค€์–ด๊ฐ„ ๋ณ€๊ฒฝ, ๋ฐฉ์–ธ ํƒ์ง€ ๋“ฑ์˜ ์—ญํ• ์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
23
+ * `JEJUMA-001`์„ ๋ชจ๋ธ์„ ํ›ˆ๋ จํ•˜๊ธฐ ์œ„ํ•ด ์•ฝ 105๋งŒ๊ฐœ์˜ ์ œ์ฃผ๋ฐฉ์–ธ-์„œ์šธ๋ง ํŽ˜์–ด ๋ฐ์ดํ„ฐ๋ฅผ ์ˆ˜์ง‘ํ•˜๊ณ , ๊ทธ ์ค‘ ์ œ์ฃผ์–ด๊ฐ€ ์ž˜ ๋“ค์–ด๋‚œ ๋ฐ์ดํ„ฐ 17๋งŒ๊ฐœ๋ฅผ ์„ ๋ณ„ํ•˜์˜€์Šต๋‹ˆ๋‹ค.
24
+ * ์ด๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ์ด 4๊ฐ€์ง€์˜ ํ…Œ์Šคํฌ๋ฅผ ์ˆ˜ํ–‰ํ•˜๋„๋ก ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ๋ฅผ ์ƒ์„ฑํ•˜์˜€์œผ๋ฉฐ, ์ด๋Š” ์ด ์•ฝ 34๋งŒ๊ฐœ์˜ ๋ฐ์ดํ„ฐ์ž…๋‹ˆ๋‹ค.
25
+ * LlamaFactory๋ฅผ ํ†ตํ•ด LoRA ๋ฐฉ์‹์œผ๋กœ ํ›ˆ๋ จํ•˜์˜€์œผ๋ฉฐ, ๋ชจ๋“  ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด 1์—ํญ ํ•™์Šตํ•˜์˜€์Šต๋‹ˆ๋‹ค.
26
+ * ์–ด๋ ค์šด ์ œ์ฃผ๋„ ๋ง์— ๋Œ€ํ•ด์„œ, gpt4o์™€ ๊ตญ์‚ฐ ๋ชจ๋ธ์ธ ์—…์Šคํ…Œ์ด์ง€ Solar, ๋„ค์ด๋ฒ„ HCX ๋†’์€ ๋ฒˆ์—ญ ์ •ํ™•๋„๋ฅผ ๋ณด์ž…๋‹ˆ๋‹ค.
27
+
28
+ ### ์ œ์ฃผ์–ด -> ํ‘œ์ค€์–ด
29
+
30
+ | **์ž…๋ ฅ ๋ฌธ์žฅ** | ์ž์ด ํด์— ๋…์†” ๋ง‰ ๋‚œ ๊ฑฐ ๋ณด๋‚œ ์–ธ ์ƒ์ด์šฐ๋‹ค |
31
+ |-------------------------------|------------------------------------------------------------------------------------|
32
+ | **์ •๋‹ต** | **์žฌ ํŒ”์— ๋‹ญ์‚ด์ด ๋ง‰ ๋‚œ ๊ฑฐ ๋ณด๋‹ˆ, ์ถ”์šด ๋ชจ์–‘์ด๋‹ค.** |
33
+ | Upstage Solar ์ถœ๋ ฅ | ๊ทธ ๋ฐ”์œ„์— ๋ฑ€์ด ๋‚˜ํƒ€๋‚˜๋Š” ๊ฑธ ๋ณด๋‹ˆ๊นŒ ์ •๋ง ๋†€๋ž๋‹ค. |
34
+ | Naver HCX ์ถœ๋ ฅ | ์žฌ์˜ ํ’€์— ๋…์ดˆ๊ฐ€ ๋งˆ๊ตฌ ๋‚œ ๊ฒƒ์„ ๋ณด๋‹ˆ ์–ด๋ฆฐ ์†Œ๋‚˜๋ฌด์ž…๋‹ˆ๋‹ค. |
35
+ | GPT-4o ์ถœ๋ ฅ | ์ €๊ธฐ ๋ฐ”์œ„์— ๋…์‚ฌ๊ฐ€ ๋ง‰ ๋‚˜ํƒ€๋‚œ ๊ฑธ ๋ณด๋‹ˆ๊นŒ ์ •๋ง ๋†€๋ž๋‹ค. |
36
+ | **JEJUMA-001 ์ถœ๋ ฅ** | |
37
+
38
+ ### ํ‘œ์ค€์–ด -> ์ œ์ฃผ์–ด
39
+
40
+ | **์ž…๋ ฅ ๋ฌธ์žฅ** | ๊ทค๋‚˜๋ฌด์— ๊ทธ๋ƒฅ ๊ฐ€์„œ ๋„ˆ๋„ค ์•„๋ฒ„์ง€์ข€ ์ฐพ์•„์™€๋ผ. |
41
+ |-------------------------------|------------------------------------------------------------------------------------|
42
+ | **์ •๋‹ต** | ๋ฏธ๊นก๋‚ญ ๊ฒฝ ๊ฐ€์‹ฌ ๋„ˆ๋„ค ์•„๋ฐฉ ์ข€ ๋ฐ๋ น |
43
+ | Upstage Solar ์ถœ๋ ฅ | ๊ทค ๋‚˜๋ฌด์— ๊ฐ€์„œ ๋„ค ์•„๋ฒ„์ง€๋ฅผ ์ข€ ์ฐพ์•„์™€. |
44
+ | Naver HCX ์ถœ๋ ฅ | ๊ทค๋‚ญ์— ๊ฐ• ๋Š๋„ค ์•„๋ฐฉ ์ข€ ๋ฐ๋ น์˜ค๋ผ. |
45
+ | GPT-4o ์ถœ๋ ฅ | ๊ทค๋‚˜๋ฌด์— ๊ฑ ๊ฐ€์„œ ํ–„์‹  ์•„๋ฐฉ ์ข€ ์ฐพ์•„์™€๋ผ. |
46
+ | **JEJUMA-001 ์ถœ๋ ฅ** | |
47
+
48
+ ## ์–ด๋–ป๊ฒŒ ์‚ฌ์šฉํ•˜๋‚˜์š”?
49
+ * ์ •์˜๋œ ํƒฌํ”Œ๋ฆฟ์—์„œ `dialect_to_standard`, `standard_to_dialect`, `detect_dialect`, `detect_dialect_and_convert` ์ค‘ ํ•˜๋‚˜๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
50
+ * `dialect_to_standard`: ์ œ์ฃผ์–ด๋ฅผ ํ‘œ์ค€์–ด๋กœ ๋ณ€๊ฒฝ
51
+ * `standard_to_dialect`: ํ‘œ์ค€์–ด๋ฅผ ์ œ์ฃผ์–ด๋กœ ๋ณ€๊ฒฝ
52
+ * `detect_dialect`: ์ œ์ฃผ์–ด/ํ‘œ์ค€์–ด ๊ฐ์ง€
53
+ * `detect_dialect_and_convert`: ์ž๋™์œผ๋กœ ์ œ์ฃผ์–ด/ํ‘œ์ค€์–ด๋ฅผ ๊ฐ์ง€ํ•˜์—ฌ ํ‘œ์ค€์–ด/์ œ์ฃผ์–ด๋กœ ๋ณ€๊ฒฝ
54
+
55
+ ```python
56
+ import transformers
57
+ import torch
58
+
59
+ model_id = "JEJUMA/JEJUMA-001"
60
+
61
+ pipeline = transformers.pipeline(
62
+ "text-generation",
63
+ model=model_id,
64
+ model_kwargs={"torch_dtype": torch.bfloat16},
65
+ device_map="auto",
66
+ )
67
+
68
+ terminators = [
69
+ pipeline.tokenizer.eos_token_id,
70
+ pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
71
+ ]
72
+
73
+ class JejuPromptTemplate:
74
+ @staticmethod
75
+ def dialect_to_standard(text):
76
+ return [{"role":"user", "content":"Convert the following sentence or word which is Jeju island dialect to standard Korean: " + text},]
77
+
78
+ @staticmethod
79
+ def standard_to_dialect(text):
80
+ return [{"role":"user", "content":"Convert the following sentence or word which is standard Korean to Jeju island dialect: " + text},]
81
+
82
+ @staticmethod
83
+ def detect_dialect(text):
84
+ return [{"role":"user", "content":"Detect the following sentence or word is Jeju island dialect or standard Korean: " + text},]
85
+
86
+ @staticmethod
87
+ def detect_dialect_and_convert(text):
88
+ return [{"role":"user", "content":"Detect the following sentence or word is Jeju island dialect or standard Korean and convert the following sentence or word to Jeju island dialect or standard Korean: " + text},]
89
+
90
+
91
+ outputs = pipeline(
92
+ JejuPromptTemplate.standard_to_dialect("์ž์ด ํด์— ๋…์†” ๋ง‰ ๋‚œ ๊ฑฐ ๋ณด๋‚œ ์–ธ ์ƒ์ด์šฐ๋‹ค"),
93
+ max_new_tokens=512,
94
+ eos_token_id=terminators,
95
+ do_sample=True,
96
+ temperature=0.1,
97
+ top_p=0.9,
98
+ )
99
+
100
+ print(outputs[0]["generated_text"][-1])
101
+ ```