File size: 11,805 Bytes
25ac664
e60ec83
 
 
 
 
25ac664
e60ec83
25ac664
e60ec83
 
 
 
5355201
8f9904d
 
 
 
 
 
 
 
 
 
 
 
e60ec83
 
8f9904d
03fc19a
8f9904d
e60ec83
 
 
8f9904d
e60ec83
 
 
8f9904d
e60ec83
8f9904d
e60ec83
 
 
 
 
 
a64bb9f
 
e60ec83
 
8f9904d
 
 
e60ec83
8f9904d
 
e60ec83
8f9904d
 
 
 
 
 
 
e60ec83
03fc19a
e60ec83
 
8f9904d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e60ec83
8f9904d
e60ec83
8f9904d
 
 
6a727c1
 
03fc19a
8f9904d
03fc19a
8f9904d
 
 
 
 
e60ec83
8f9904d
e60ec83
8f9904d
 
 
 
 
e60ec83
76c7fa4
8f9904d
 
e60ec83
 
 
 
 
8f9904d
 
e60ec83
 
 
8f9904d
e60ec83
 
 
 
 
 
8f9904d
e60ec83
78a49d4
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
---
language:
- ko
tags:
- pytorch
- causal-lm
license: apache-2.0

---

# GPT-NeoX-Ko-1.3B

## Model Description
GPT-NeoX-Ko is a Korean autoregressive language model made by EleutherAI multilingual team. We collected about 1.2TB Korean dataset for this work, which was done with [TUNiB](https://tunib.ai/). In addition, we used the [GPT-NeoX framework](https://github.com/EleutherAI/gpt-neox) for model training and added some Korean tasks to [LM-Evaluation-Harness](https://github.com/EleutherAI/lm-evaluation-harness) for model evaluation.

| Hyperparameter       | Value                                                                                                                                  |
|----------------------|----------------------------------------------------------------------------------------------------------------------------------------|
| \\(n_{parameters}\\) | 13,3181,0304                                                                                                                           |
| \\(n_{layers}\\)     | 24                                                                                                                                     |
| \\(d_{model}\\)      | 2048                                                                                                                                   |
| \\(d_{ff}\\)         | 8192                                                                                                                                   |
| \\(n_{heads}\\)      | 16                                                                                                                                     |
| \\(d_{head}\\)       | 128                                                                                                                                    |
| \\(n_{ctx}\\)        | 2048                                                                                                                                   |
| \\(n_{vocab}\\)      | 30,000 / 30,080                                                                                                                        |
| Positional Encoding  | [Rotary Position Embedding (RoPE)](https://arxiv.org/abs/2104.09864)                                                                   |
| RoPE Dimensions      | [64](https://github.com/kingoflolz/mesh-transformer-jax/blob/f2aa66e0925de6593dcbb70e72399b97b4130482/mesh_transformer/layers.py#L223) |

The model consists of 24 transformer layers with a model dimension of 2048, and a feedforward dimension of 8192. The model
dimension is split into 16 heads, each with a dimension of 128. Rotary Position Embedding (RoPE) is applied to 64
dimensions of each head. The model is trained with a tokenization vocabulary of 30000.

## Training data

GPT-NeoX-Ko was trained on 1.2TB Korean Dataset, a large-scale curated dataset created by [TUNiB](https://tunib.ai/).

## Training procedure

GPT-NeoX-Ko was trained for 213 billion tokens over 102,000 steps on 256 * A100 GPUs. It was trained as an autoregressive language model, using cross-entropy loss to maximize the likelihood of predicting the next token correctly.

## How to use

This model can be easily loaded using the `AutoModelForCausalLM` functionality:

```python
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neox-ko-1.3b")
model = AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-neox-ko-1.3b")
```

## Privacy considerations and Limitations

GPT-NeoX-Ko learns an inner representation of the Korean that can be used to extract features useful for downstream tasks. The model is best at what it was pretrained for however, which is generating text from a prompt.

### Privacy considerations
General training algorithms for pretrained language model have many hazards that memorize personal information in training data. We added the following tokens to vocabulary to mitigate privacy problem and replaced much personal information to these tokens in data preprocessing steps.

* `<|acc|>` : bank account number
* `<|rrn|>` : resident registration number
* `<|tell|>` : phone number

### Limitations and Biases

The core functionality of GPT-NeoX-Ko is taking a string of text and predicting the next token. While language models are widely used for tasks other than this, there are a lot of unknowns with this work. When prompting GPT-NeoX-Ko it is important to remember that the statistically most likely next token is often not the token that produces the most "accurate" text. Never depend upon GPT-NeoX-Ko to produce factually accurate output.Depending upon use case GPT-NeoX-Ko may produce socially unacceptable text.

As with all language models, it is hard to predict in advance how GPT-NeoX-Ko will respond to particular prompts and offensive content may occur without warning. We recommend having a human curate or filter the outputs before releasing them, both to censor undesirable content and to improve the quality of the results.

## Evaluation results
We used the [KOBEST dataset](https://arxiv.org/abs/2204.04541), which consists of five Korean downstream tasks for model evaluation.
We added the corresponding tasks to [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) and utilized prompt templates described in the paper.
The following tables show the evaluation results with the various number of few-shot examples. You can reproduce these results using [multilingual-ko branch of lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/multilingual-ko).

- the number of few shot examples = 1

| Model                                                                                        | parameters | boolq | copa   | wic    | hellaswag | sentineg | average |
|----------------------------------------------------------------------------------------------|------------|-------|--------|--------|-----------|----------|---------|
| [skt/ko-gpt-trinity-1.2B-v0.5](https://huggingface.co/skt/ko-gpt-trinity-1.2B-v0.5) &dagger; | 1.2B       |       |        |        |           |          |         |
| [kakaobrain/kogpt](https://huggingface.co/kakaobrain/kogpt) &ast;                            | 6.0B       |       |        |        |           |          |         |
| [EleutherAI/gpt-neox-ko-1.3b](https://huggingface.co/EleutherAI/gpt-neox-ko-1.3b) (ours)     | 1.3B       | 0.659 | 0.6993 | 0.6292 | 0.3884    | 0.8427   | 0.64372 |

- the number of few shot examples = 5

| Model                                                                                        | parameters | boolq  | copa   | wic   | hellaswag | sentineg | average |
|----------------------------------------------------------------------------------------------|------------|--------|--------|-------|-----------|----------|---------|
| [skt/ko-gpt-trinity-1.2B-v0.5](https://huggingface.co/skt/ko-gpt-trinity-1.2B-v0.5) &dagger; | 1.2B       |        |        |       |           |          |         |
| [kakaobrain/kogpt](https://huggingface.co/kakaobrain/kogpt) &ast;                            | 6.0B       |        |        |       |           |          |         |
| [EleutherAI/gpt-neox-ko-1.3b](https://huggingface.co/EleutherAI/gpt-neox-ko-1.3b) (ours)     | 1.3B       | 0.6309 | 0.7053 | 0.656 | 0.3984    | 0.7979   | 0.6337  |

- the number of few shot examples = 10

| Model                                                                                        | parameters | boolq      | copa       | wic        | hellaswag  | sentineg   | average    |
|----------------------------------------------------------------------------------------------|------------|------------|------------|------------|------------|------------|------------|
| [skt/ko-gpt-trinity-1.2B-v0.5](https://huggingface.co/skt/ko-gpt-trinity-1.2B-v0.5) &dagger; | 1.2B       | **0.6663** | 0.6222     | 0.656      | 0.4011     | 0.3534     | 0.5398     |
| [kakaobrain/kogpt](https://huggingface.co/kakaobrain/kogpt) &ast;                            | 6.0B       | 0.3241     | **0.719**  | 0.1356     | **0.4616** | 0.8056     | 0.48936    |
| [EleutherAI/gpt-neox-ko-1.3b](https://huggingface.co/EleutherAI/gpt-neox-ko-1.3b) (ours)     | 1.3B       | 0.5174     | 0.7072     | **0.6567** | 0.417      | **0.8444** | **0.5468** |

- the number of few shot examples = 50

| Model                                                                                        | parameters | boolq | copa   | wic    | hellaswag | sentineg | average |
|----------------------------------------------------------------------------------------------|------------|-------|--------|--------|-----------|----------|---------|
| [skt/ko-gpt-trinity-1.2B-v0.5](https://huggingface.co/skt/ko-gpt-trinity-1.2B-v0.5) &dagger; | 1.2B       |       |        |        |           |          |         |
| [kakaobrain/kogpt](https://huggingface.co/kakaobrain/kogpt) &ast;                            | 6.0B       |       |        |        |           |          |         |
| [EleutherAI/gpt-neox-ko-1.3b](https://huggingface.co/EleutherAI/gpt-neox-ko-1.3b) (ours)     | 1.3B       | 0.49  | 0.7097 | 0.5834 | 0.4416    | 0.7382   | 0.59258 |

- the number of few shot examples = 100

| Model                                                                                        | parameters | boolq  | copa   | wic    | hellaswag | sentineg | average |
|----------------------------------------------------------------------------------------------|------------|--------|--------|--------|-----------|----------|---------|
| [skt/ko-gpt-trinity-1.2B-v0.5](https://huggingface.co/skt/ko-gpt-trinity-1.2B-v0.5) &dagger; |            |        |        |        |           |          |         |
| [kakaobrain/kogpt](https://huggingface.co/kakaobrain/kogpt) &ast;                            |            |        |        |        |           |          |         |
| [EleutherAI/gpt-neox-ko-1.3b](https://huggingface.co/EleutherAI/gpt-neox-ko-1.3b) (ours)     |            | 0.4867 | 0.7207 | 0.5877 | 0.5877    | 0.7407   | 0.59234 |

<p><strong>&dagger;</strong> The model card of this model provides evaluation results for the KOBEST dataset, but when we evaluated the model with the prompts described in the paper, we can't get similar results to it. Therefore, we checked the KOBEST paper and found that the results were similar to the fine-tuning results reported in the paper. Because we evaluated by prompt-based generation without fine-tuning the model, the results provided by the model card for the this model may differ.</p>

<p><strong>&ast;</strong> Since this model does not provide evaluation results with KOBEST dataset, we evaluated the model using lm-evaluation-harness ourselves. you can reproduce this result using the source code included in the multilingual-ko branch of lm-evaluation-harness.</p>

## Citation and Related Information

### BibTeX entry

If you find our work useful, please consider citing:

```bibtex
@misc{gpt-neox-ko,
  title = {{GPT-NeoX-Ko: Open-Source Korean Autoregressive Language Model}},
  author = {Ko, Hyunwoong and Yang, Kichang and Ryu, Minho and Kim, Taekyun and Yang, Seungmu and Hyun, Jiwoong and Park, Sungho and Ryu, Myunghyun and Keum, Bitna and Oh, Saechan and Kim, Soohwan and Park, Kyubyong},
  url = {https://www.github.com/eleutherai/multilingual},
  month = {9},
  year = {2022},
}
```

### Acknowledgements

This project would not have been possible without compute generously provided by [Stability.ai](https://stability.ai), thanks them for providing a large amount of GPU resources. And thanks also go to [TUNiB](https://tunib.ai) for providing a large-scale Korean dataset for this work.