File size: 9,470 Bytes
139bc19
0305294
 
 
 
 
 
 
 
 
 
139bc19
0305294
c8306dc
139bc19
 
804694b
 
0305294
139bc19
496917f
 
139bc19
f015740
139bc19
f015740
139bc19
0305294
139bc19
0305294
139bc19
f015740
139bc19
0305294
139bc19
0305294
139bc19
0305294
139bc19
0305294
139bc19
0305294
 
2beaa28
139bc19
2beaa28
 
 
 
c4be33d
0305294
139bc19
0305294
139bc19
0305294
 
 
 
139bc19
0305294
139bc19
0305294
8b38e9d
139bc19
0305294
 
 
 
139bc19
0305294
139bc19
0305294
8b38e9d
139bc19
0305294
 
 
 
139bc19
0305294
139bc19
0305294
139bc19
0305294
139bc19
0305294
139bc19
0305294
 
139bc19
47403e4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
139bc19
0305294
139bc19
0305294
139bc19
0305294
139bc19
0305294
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
---
language:
- ko
- en
pipeline_tag: text-generation
inference: false
tags:
- solar
- mistral
- pytorch
- solar-ko
library_name: transformers
license: apache-2.0
base_model: upstage/SOLAR-10.7B-v1.0
---

<img src="https://cdn-uploads.huggingface.co/production/uploads/5e56829137cb5b49818287ea/WuiaS45EAWDurGTOtjR_d.png" style="max-width:250px;margin:0 auto;" />

**Update Log**

- 2024.07.01: Released Solar-Ko-Recovery & Uploaded Benchmark scores
- 2024.05.16: Preview Released Solar-Ko-Recovery

# **Solar-Ko-Recovery-11B** 🌟❤️‍🩹

Solar-Ko-Recovery-11B aimed to recover Solar's capability on Korean with re-arrange of Embeddings and LM head, featuring an expanded vocabulary and the inclusion of a Korean+English corpus for enhanced representation. 

## Model Details

**Model Developers:** Junbum Lee (Beomi)

**Variations:** Solar-Ko-Recovery is available with one parameter sizes — 11B(10.99B🤣).

**Input:** The model accepts only text input.

**Output:** The model produces text output exclusively.

**Model Architecture:** 

Solar-Ko-Recovery is an auto-regressive language model that leverages an optimized transformer architecture derived from Llama-2.

| |Training Data|Parameters|Content Length|GQA|Tokens|Learning Rate|
|---|---|---|---|---|---|---|
|Solar-Ko-Recovery|*A curated mix of Korean+English Corpora*|11B(10.99B)|4k|O|>100B*|5e<sup>-5</sup>|

> NOTE: 2-step training processed
>
> 1) Only Embedding layer and LM Head layer are trained
> 2) Full params trained

**Vocab Expansion**

Vocab expansion is conducted on edited [upstage/solar-1-mini-tokenizer](https://huggingface.co/upstage/solar-1-mini-tokenizer), which is superset of Solar tokenizer.

| Model Name | Vocabulary Size | Description | 
| --- | --- | --- |
| Original Solar | 32000 | Sentencepiece BPE |
| **solar-1-mini-tokenizer** | 64000 | Sentencepiece BPE. Added Ko/JP vocabs |

**Tokenizing "안녕하세요, 오늘은 날씨가 좋네요."**

- SOLAR-10.7B: 26 tokens
- Solar-Ko-Recovery: 7 tokens

| Model | Tokens |
| --- | --- |
| SOLAR-10.7B | `['▁', '안', '<0xEB>', '<0x85>', '<0x95>', '하', '세', '요', ',', '▁', '오', '<0xEB>', '<0x8A>', '<0x98>', '은', '▁', '날', '<0xEC>', '<0x94>', '<0xA8>', '가', '▁', '좋', '네', '요', '.']` |
| Solar-Ko-Recovery | `['▁안녕하세요', ',', '▁오늘은', '▁날씨가', '▁좋', '네요', '.']` |

**Tokenizing "Meet 10.7B Solar: Elevating Performance with Upstage Depth UP Scaling!"**

- SOLAR-10.7B: 22 tokens
- Solar-Ko-Recovery: 22 tokens

| Model | Tokens |
| --- | --- |
| SOLAR-10.7B | `['▁Meet', '▁', '1', '0', '.', '7', 'B', '▁Solar', ':', '▁E', 'lev', 'ating', '▁Performance', '▁with', '▁Up', 'stage', '▁Dep', 'th', '▁UP', '▁Scal', 'ing', '!']` |
| Solar-Ko-Recovery | `['▁Meet', '▁', '1', '0', '.', '7', 'B', '▁Solar', ':', '▁E', 'lev', 'ating', '▁Performance', '▁with', '▁Up', 'stage', '▁Dep', 'th', '▁UP', '▁Scal', 'ing', '!']` |

# LICENSE

Apache 2.0

# **Model Benchmark**

## LM Eval Harness - Korean

- Used EleutherAI's [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness)
- 5-shot scores

|                          Tasks                           |  Metric   |  Value  |   |  Stderr |
|----------------------------------------------------------|-----------|--------:|---|--------:|
|haerae                                                    |acc_norm   | 0.7874  |±  | 0.0118  |
| - haerae_general_knowledge                               |acc        | 0.5000  |±  | 0.0378  |
| - haerae_history                                         |acc        | 0.8723  |±  | 0.0244  |
| - haerae_loan_word                                       |acc        | 0.8402  |±  | 0.0283  |
| - haerae_rare_word                                       |acc        | 0.8346  |±  | 0.0185  |
| - haerae_standard_nomenclature                           |acc        | 0.8301  |±  | 0.0305  |
|kmmlu_direct                                              |exact_match| 0.4205  |±  | 0.0026  |
| - kmmlu_direct_accounting                                |exact_match| 0.3700  |±  | 0.0485  |
| - kmmlu_direct_agricultural_sciences                     |exact_match| 0.3140  |±  | 0.0147  |
| - kmmlu_direct_aviation_engineering_and_maintenance      |exact_match| 0.3870  |±  | 0.0154  |
| - kmmlu_direct_biology                                   |exact_match| 0.3510  |±  | 0.0151  |
| - kmmlu_direct_chemical_engineering                      |exact_match| 0.3910  |±  | 0.0154  |
| - kmmlu_direct_chemistry                                 |exact_match| 0.4000  |±  | 0.0200  |
| - kmmlu_direct_civil_engineering                         |exact_match| 0.4010  |±  | 0.0155  |
| - kmmlu_direct_computer_science                          |exact_match| 0.6520  |±  | 0.0151  |
| - kmmlu_direct_construction                              |exact_match| 0.3080  |±  | 0.0146  |
| - kmmlu_direct_criminal_law                              |exact_match| 0.3100  |±  | 0.0328  |
| - kmmlu_direct_ecology                                   |exact_match| 0.4660  |±  | 0.0158  |
| - kmmlu_direct_economics                                 |exact_match| 0.5385  |±  | 0.0439  |
| - kmmlu_direct_education                                 |exact_match| 0.6200  |±  | 0.0488  |
| - kmmlu_direct_electrical_engineering                    |exact_match| 0.3000  |±  | 0.0145  |
| - kmmlu_direct_electronics_engineering                   |exact_match| 0.4740  |±  | 0.0158  |
| - kmmlu_direct_energy_management                         |exact_match| 0.3560  |±  | 0.0151  |
| - kmmlu_direct_environmental_science                     |exact_match| 0.2980  |±  | 0.0145  |
| - kmmlu_direct_fashion                                   |exact_match| 0.4470  |±  | 0.0157  |
| - kmmlu_direct_food_processing                           |exact_match| 0.3690  |±  | 0.0153  |
| - kmmlu_direct_gas_technology_and_engineering            |exact_match| 0.3000  |±  | 0.0145  |
| - kmmlu_direct_geomatics                                 |exact_match| 0.3820  |±  | 0.0154  |
| - kmmlu_direct_health                                    |exact_match| 0.5700  |±  | 0.0498  |
| - kmmlu_direct_industrial_engineer                       |exact_match| 0.3830  |±  | 0.0154  |
| - kmmlu_direct_information_technology                    |exact_match| 0.6090  |±  | 0.0154  |
| - kmmlu_direct_interior_architecture_and_design          |exact_match| 0.5440  |±  | 0.0158  |
| - kmmlu_direct_korean_history                            |exact_match| 0.3800  |±  | 0.0488  |
| - kmmlu_direct_law                                       |exact_match| 0.4670  |±  | 0.0158  |
| - kmmlu_direct_machine_design_and_manufacturing          |exact_match| 0.3960  |±  | 0.0155  |
| - kmmlu_direct_management                                |exact_match| 0.5030  |±  | 0.0158  |
| - kmmlu_direct_maritime_engineering                      |exact_match| 0.4283  |±  | 0.0202  |
| - kmmlu_direct_marketing                                 |exact_match| 0.7460  |±  | 0.0138  |
| - kmmlu_direct_materials_engineering                     |exact_match| 0.4020  |±  | 0.0155  |
| - kmmlu_direct_math                                      |exact_match| 0.2867  |±  | 0.0262  |
| - kmmlu_direct_mechanical_engineering                    |exact_match| 0.3490  |±  | 0.0151  |
| - kmmlu_direct_nondestructive_testing                    |exact_match| 0.3760  |±  | 0.0153  |
| - kmmlu_direct_patent                                    |exact_match| 0.3700  |±  | 0.0485  |
| - kmmlu_direct_political_science_and_sociology           |exact_match| 0.5300  |±  | 0.0289  |
| - kmmlu_direct_psychology                                |exact_match| 0.4470  |±  | 0.0157  |
| - kmmlu_direct_public_safety                             |exact_match| 0.3520  |±  | 0.0151  |
| - kmmlu_direct_railway_and_automotive_engineering        |exact_match| 0.3220  |±  | 0.0148  |
| - kmmlu_direct_real_estate                               |exact_match| 0.4350  |±  | 0.0351  |
| - kmmlu_direct_refrigerating_machinery                   |exact_match| 0.3240  |±  | 0.0148  |
| - kmmlu_direct_social_welfare                            |exact_match| 0.4970  |±  | 0.0158  |
| - kmmlu_direct_taxation                                  |exact_match| 0.3800  |±  | 0.0344  |
| - kmmlu_direct_telecommunications_and_wireless_technology|exact_match| 0.5480  |±  | 0.0157  |
|kobest_boolq                                              |acc        | 0.9202  |±  | 0.0072  |
|                                                          |f1         | 0.9202  |±  |N/A      |
|kobest_copa                                               |acc        | 0.8680  |±  | 0.0107  |
|                                                          |f1         | 0.8678  |±  |N/A      |
|kobest_hellaswag                                          |acc        | 0.5560  |±  | 0.0222  |
|                                                          |f1         | 0.5520  |±  |N/A      |
|                                                          |acc_norm   | 0.6540  |±  | 0.0213  |
|kobest_sentineg                                           |acc        | 0.9824  |±  | 0.0066  |
|                                                          |f1         | 0.9824  |±  |N/A      |



## Citation

TBD

## Acknowledgements

- Training support was provided by the [TPU Research Cloud](https://sites.research.google/trc/) program.