hyunwoongko commited on
Commit
8f9904d
1 Parent(s): 632aab2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +79 -67
README.md CHANGED
@@ -6,122 +6,134 @@ tags:
6
  - causal-lm
7
  license: apache-2.0
8
  datasets:
9
- - Created by tunib.
10
 
11
  ---
12
 
13
  # GPT-NeoX-Ko-1.3B
14
 
15
  ## Model Description
16
-
17
- We firstly targeted Korean language because most of our contributors were Korean when we started our research. We collected about 1.2TB Korean dataset for this work, which was done with TUNiB. In addition, we used the GPT-NeoX framework for model training and added 8 Korean tasks to LM-Evaluation-Harness for model evaluation.
18
-
19
-
20
- <figure>
21
-
22
- | Hyperparameter | Value |
23
- |----------------------|------------|
24
- | \\(n_{parameters}\\) | 1331810304 |
25
- | \\(n_{layers}\\) | 24&ast; |
26
- | \\(d_{model}\\) | 2048 |
27
- | \\(d_{ff}\\) | 8192 |
28
- | \\(n_{heads}\\) | 16 |
29
- | \\(d_{head}\\) | 128 |
30
- | \\(n_{ctx}\\) | 2048 |
31
- | \\(n_{vocab}\\) | 30080/30000&dagger; |
32
- | Positional Encoding | [Rotary Position Embedding (RoPE)](https://arxiv.org/abs/2104.09864) |
33
  | RoPE Dimensions | [64](https://github.com/kingoflolz/mesh-transformer-jax/blob/f2aa66e0925de6593dcbb70e72399b97b4130482/mesh_transformer/layers.py#L223) |
34
- <figcaption><p><strong>&ast;</strong> Each layer consists of one feedforward block and one self attention block.</p>
35
 
36
- The model consists of 24 layers with a model dimension of 2048, and a feedforward dimension of 8192. The model
37
  dimension is split into 16 heads, each with a dimension of 128. Rotary Position Embedding (RoPE) is applied to 64
38
- dimensions of each head. The model is trained with a tokenization vocabulary of 30080.
39
 
40
  ## Training data
41
 
42
- GPT-NeoX-Ko was trained on 1.2TB Korean Dataset, a large-scale curated dataset created by [tunib-ai](https://tunib.ai/).
43
 
44
  ## Training procedure
45
 
46
- This model was trained for 402 billion tokens over 102,000 steps on A100 x 256 pods. It was trained as an autoregressive language model, using cross-entropy loss to maximize the likelihood of predicting the next token correctly.
47
-
48
- ## Intended Use and Limitations
49
 
50
- GPT-NeoX-Ko learns an inner representation of the Korean language that can be used to extract features useful for downstream tasks. The model is best at what it was pretrained for however, which is generating text from a prompt.
51
-
52
- ### How to use
53
 
54
  This model can be easily loaded using the `AutoModelForCausalLM` functionality:
55
 
56
  ```python
57
  from transformers import AutoTokenizer, AutoModelForCausalLM
58
 
59
- tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neox-ko-1.3B")
60
- model = AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-neox-ko-1.3B")
61
  ```
62
 
63
- ### Limitations and Biases
 
 
64
 
65
- The core functionality of GPT-NeoX-Ko is taking a string of text and predicting the next token. While language models are widely used for tasks other than this, there are a lot of unknowns with this work. When prompting GPT-NeoX-Ko it is important to remember that the statistically most likely next token is often not the token that produces the most "accurate" text. Never depend upon GPT-NeoX-Ko to produce factually accurate output.
 
66
 
67
- GPT-NeoX-Ko was trained on the Large Scale Korean Datsets, a dataset known to contain profanity, lewd, and otherwise abrasive language. Depending upon use case GPT-NeoX-Ko may produce socially unacceptable text.
 
 
 
 
 
 
68
 
69
  As with all language models, it is hard to predict in advance how GPT-NeoX-Ko will respond to particular prompts and offensive content may occur without warning. We recommend having a human curate or filter the outputs before releasing them, both to censor undesirable content and to improve the quality of the results.
70
 
71
  ## Evaluation results
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
72
 
73
- <figure>
74
 
75
- | Model | Public | Training FLOPs | kobest_boolq ↓ | kobest_copa ↑ | kobest_wic ↑ | kobest_hellaswag ↑ | kobest_sentineg ↑ | Dataset Size (GB) |
76
- |--------------------------|-------------|----------------|--- |--- |--- |--- |--- |-------------------|
77
- | KoGPT-trinity&ddagger; | &cross; | ----- | 0.6663 | 0.6222 | 0.656 | 0.4011 | 0.3534 | ----- |
78
- | KoGPT-KakaoBrain&ddagger; | &cross; | ----- | 0.3241 | 0.719 | 0.1356 | 0.4616 | 0.8065 | ----- |
79
- | GPT-NeoX-Ko-1.3B(ours)&ddagger; | &cross; | ----- | 0.5174 | 0.7072 | 0.6567 | 0.417 | 0.8444 | ----- |
80
 
 
81
 
82
- <figcaption><p>Models roughly sorted by performance, or by FLOPs if not available.</p>
 
 
 
 
83
 
84
- <p><strong>&ast;</strong> Evaluation numbers reported by their respective authors. All other numbers are provided by
85
- running <a href="https://github.com/EleutherAI/lm-evaluation-harness/"><code>lm-evaluation-harness</code></a> either with released
86
- weights or with API access. Due to subtle implementation differences as well as different zero shot task framing, these
87
- might not be directly comparable. See <a href="https://blog.eleuther.ai/gpt3-model-sizes/">this blog post</a> for more
88
- details.</p>
89
 
90
- <p><strong>†</strong> Megatron-11B provides no comparable metrics, and several implementations using the released weights do not
91
- reproduce the generation quality and evaluations. (see <a href="https://github.com/huggingface/transformers/pull/10301">1</a>
92
- <a href="https://github.com/pytorch/fairseq/issues/2358">2</a> <a href="https://github.com/pytorch/fairseq/issues/2719">3</a>)
93
- Thus, evaluation was not attempted.</p>
 
94
 
95
- <p><strong>‡</strong> These models have been trained with data which contains possible test set contamination. The OpenAI GPT-3 models
96
- failed to deduplicate training data for certain test sets, while the GPT-Neo models as well as this one is
97
- trained on the Pile, which has not been deduplicated against any test sets.</p></figcaption></figure>
98
 
99
  ## Citation and Related Information
100
 
101
  ### BibTeX entry
102
 
103
- To cite this model:
 
104
  ```bibtex
105
  @misc{gpt-neox-ko,
106
  title = {{GPT-NeoX-Ko: Open-Source Korean Autoregressive Language Model}},
107
- author = {Hyunwoong, Ko and Kichang, Yang and Minho, Ryu and Taekyun, Kim, ...},
108
  url = {https://www.github.com/eleutherai/multilingual},
109
  month = {9},
110
  year = {2022},
111
  }
112
  ```
113
 
114
- If you use this model, we would love to hear about it! Reach out on [GitHub](https://github.com/kingoflolz/mesh-transformer-jax), Discord, or shoot Ben an email.
115
-
116
- ## Acknowledgements
117
-
118
- This project would not have been possible without compute generously provided by Google through the
119
- [TPU Research Cloud](https://sites.research.google/trc/), as well as the Cloud TPU team for providing early access to the [Cloud TPU VM](https://cloud.google.com/blog/products/compute/introducing-cloud-tpu-vms) Alpha.
120
 
121
- Thanks to everyone who have helped out one way or another (listed alphabetically):
122
- - [James Bradbury](https://twitter.com/jekbradbury) for valuable assistance with debugging JAX issues.
123
- - [Stella Biderman](https://www.stellabiderman.com), [Eric Hallahan](https://twitter.com/erichallahan), [Kurumuz](https://github.com/kurumuz/), and [Finetune](https://github.com/finetuneanon/) for converting the model to be compatible with the `transformers` package.
124
- - [Leo Gao](https://twitter.com/nabla_theta) for running zero shot evaluations for the baseline models for the table.
125
- - [Laurence Golding](https://github.com/researcher2/) for adding some features to the web demo.
126
- - [Aran Komatsuzaki](https://twitter.com/arankomatsuzaki) for advice with experiment design and writing the blog posts.
127
- - [Janko Prester](https://github.com/jprester/) for creating the web demo frontend.
 
6
  - causal-lm
7
  license: apache-2.0
8
  datasets:
9
+ - Large-scale Korean dataset created by tunib.
10
 
11
  ---
12
 
13
  # GPT-NeoX-Ko-1.3B
14
 
15
  ## Model Description
16
+ GPT-NeoX-Ko is a Korean autoregressive language model made by EleutherAI multilingual team. We collected about 1.2TB Korean dataset for this work, which was done with [TUNiB](https://tunib.ai/). In addition, we used the GPT-NeoX framework for model training and added some Korean tasks to LM-Evaluation-Harness for model evaluation.
17
+
18
+ | Hyperparameter | Value |
19
+ |----------------------|----------------------------------------------------------------------------------------------------------------------------------------|
20
+ | \\(n_{parameters}\\) | 13,3181,0304 |
21
+ | \\(n_{layers}\\) | 24 |
22
+ | \\(d_{model}\\) | 2048 |
23
+ | \\(d_{ff}\\) | 8192 |
24
+ | \\(n_{heads}\\) | 16 |
25
+ | \\(d_{head}\\) | 128 |
26
+ | \\(n_{ctx}\\) | 2048 |
27
+ | \\(n_{vocab}\\) | 30,000 / 30,080 |
28
+ | Positional Encoding | [Rotary Position Embedding (RoPE)](https://arxiv.org/abs/2104.09864) |
 
 
 
 
29
  | RoPE Dimensions | [64](https://github.com/kingoflolz/mesh-transformer-jax/blob/f2aa66e0925de6593dcbb70e72399b97b4130482/mesh_transformer/layers.py#L223) |
 
30
 
31
+ The model consists of 24 transformer layers with a model dimension of 2048, and a feedforward dimension of 8192. The model
32
  dimension is split into 16 heads, each with a dimension of 128. Rotary Position Embedding (RoPE) is applied to 64
33
+ dimensions of each head. The model is trained with a tokenization vocabulary of 30000.
34
 
35
  ## Training data
36
 
37
+ GPT-NeoX-Ko was trained on 1.2TB Korean Dataset, a large-scale curated dataset created by [TUNiB](https://tunib.ai/).
38
 
39
  ## Training procedure
40
 
41
+ GPT-NeoX-Ko was trained for 213 billion tokens over 102,000 steps on 256 * A100 GPUs. It was trained as an autoregressive language model, using cross-entropy loss to maximize the likelihood of predicting the next token correctly.
 
 
42
 
43
+ ## How to use
 
 
44
 
45
  This model can be easily loaded using the `AutoModelForCausalLM` functionality:
46
 
47
  ```python
48
  from transformers import AutoTokenizer, AutoModelForCausalLM
49
 
50
+ tokenizer = AutoTokenizer.from_pretrained("[EleutherAI/gpt-neox-ko-1.3b](https://huggingface.co/EleutherAI/gpt-neox-ko-1.3b)")
51
+ model = AutoModelForCausalLM.from_pretrained("[EleutherAI/gpt-neox-ko-1.3b](https://huggingface.co/EleutherAI/gpt-neox-ko-1.3b)")
52
  ```
53
 
54
+ ## Privacy considerations and Limitations
55
+
56
+ GPT-NeoX-Ko learns an inner representation of the Korean that can be used to extract features useful for downstream tasks. The model is best at what it was pretrained for however, which is generating text from a prompt.
57
 
58
+ ### Privacy considerations
59
+ General training algorithms for pretrained language model have many hazards that memorize personal information in training data. We added the following tokens to vocabulary to mitigate privacy problem and replaced much personal information to these tokens in data preprocessing steps.
60
 
61
+ * `<|acc|>` : bank account number
62
+ * `<|rrn|>` : resident registration number
63
+ * `<|tell|>` : phone number
64
+
65
+ ### Limitations and Biases
66
+
67
+ The core functionality of GPT-NeoX-Ko is taking a string of text and predicting the next token. While language models are widely used for tasks other than this, there are a lot of unknowns with this work. When prompting GPT-NeoX-Ko it is important to remember that the statistically most likely next token is often not the token that produces the most "accurate" text. Never depend upon GPT-NeoX-Ko to produce factually accurate output.Depending upon use case GPT-NeoX-Ko may produce socially unacceptable text.
68
 
69
  As with all language models, it is hard to predict in advance how GPT-NeoX-Ko will respond to particular prompts and offensive content may occur without warning. We recommend having a human curate or filter the outputs before releasing them, both to censor undesirable content and to improve the quality of the results.
70
 
71
  ## Evaluation results
72
+ We used the [KOBEST dataset](https://arxiv.org/abs/2204.04541), which consists of five Korean downstream tasks for model evaluation.
73
+ We added the corresponding tasks to [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) and utilized prompt templates described in the paper.
74
+ The following tables show the evaluation results with the various number of few-shot examples. You can reproduce these results using [multilingual-ko branch of lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/multilingual-ko).
75
+
76
+ - the number of few shot examples = 1
77
+
78
+ | Model | parameters | boolq | copa | wic | hellaswag | sentineg | average |
79
+ |----------------------------------------------------------------------------------------------|------------|-------|--------|--------|-----------|----------|---------|
80
+ | [skt/ko-gpt-trinity-1.2B-v0.5](https://huggingface.co/skt/ko-gpt-trinity-1.2B-v0.5) &dagger; | 1.2B | | | | | | |
81
+ | [kakaobrain/kogpt](https://huggingface.co/kakaobrain/kogpt) &ast; | 6.0B | | | | | | |
82
+ | [EleutherAI/gpt-neox-ko-1.3b](https://huggingface.co/EleutherAI/gpt-neox-ko-1.3b) (ours) | 1.3B | 0.659 | 0.6993 | 0.6292 | 0.3884 | 0.8427 | 0.64372 |
83
+
84
+ - the number of few shot examples = 5
85
+
86
+ | Model | parameters | boolq | copa | wic | hellaswag | sentineg | average |
87
+ |----------------------------------------------------------------------------------------------|------------|--------|--------|-------|-----------|----------|---------|
88
+ | [skt/ko-gpt-trinity-1.2B-v0.5](https://huggingface.co/skt/ko-gpt-trinity-1.2B-v0.5) &dagger; | 1.2B | | | | | | |
89
+ | [kakaobrain/kogpt](https://huggingface.co/kakaobrain/kogpt) &ast; | 6.0B | | | | | | |
90
+ | [EleutherAI/gpt-neox-ko-1.3b](https://huggingface.co/EleutherAI/gpt-neox-ko-1.3b) (ours) | 1.3B | 0.6309 | 0.7053 | 0.656 | 0.3984 | 0.7979 | 0.6337 |
91
 
92
+ - the number of few shot examples = 10
93
 
94
+ | Model | parameters | boolq | copa | wic | hellaswag | sentineg | average |
95
+ |----------------------------------------------------------------------------------------------|------------|------------|------------|------------|------------|------------|------------|
96
+ | [skt/ko-gpt-trinity-1.2B-v0.5](https://huggingface.co/skt/ko-gpt-trinity-1.2B-v0.5) &dagger; | 1.2B | **0.6663** | 0.6222 | 0.656 | 0.4011 | 0.3534 | 0.5398 |
97
+ | [kakaobrain/kogpt](https://huggingface.co/kakaobrain/kogpt) &ast; | 6.0B | 0.3241 | 0.719 | 0.1356 | **0.4616** | 0.8056 | 0.48936 |
98
+ | [EleutherAI/gpt-neox-ko-1.3b](https://huggingface.co/EleutherAI/gpt-neox-ko-1.3b) (ours) | 1.3B | 0.5174 | 0.**7072** | **0.6567** | 0.417 | **0.8444** | **0.5468** |
99
 
100
+ - the number of few shot examples = 50
101
 
102
+ | Model | parameters | boolq | copa | wic | hellaswag | sentineg | average |
103
+ |----------------------------------------------------------------------------------------------|------------|-------|--------|--------|-----------|----------|---------|
104
+ | [skt/ko-gpt-trinity-1.2B-v0.5](https://huggingface.co/skt/ko-gpt-trinity-1.2B-v0.5) &dagger; | 1.2B | | | | | | |
105
+ | [kakaobrain/kogpt](https://huggingface.co/kakaobrain/kogpt) &ast; | 6.0B | | | | | | |
106
+ | [EleutherAI/gpt-neox-ko-1.3b](https://huggingface.co/EleutherAI/gpt-neox-ko-1.3b) (ours) | 1.3B | 0.49 | 0.7097 | 0.5834 | 0.4416 | 0.7382 | 0.59258 |
107
 
108
+ - the number of few shot examples = 100
 
 
 
 
109
 
110
+ | Model | parameters | boolq | copa | wic | hellaswag | sentineg | average |
111
+ |----------------------------------------------------------------------------------------------|------------|--------|--------|--------|-----------|----------|---------|
112
+ | [skt/ko-gpt-trinity-1.2B-v0.5](https://huggingface.co/skt/ko-gpt-trinity-1.2B-v0.5) &dagger; | | | | | | | |
113
+ | [kakaobrain/kogpt](https://huggingface.co/kakaobrain/kogpt) &ast; | | | | | | | |
114
+ | [EleutherAI/gpt-neox-ko-1.3b](https://huggingface.co/EleutherAI/gpt-neox-ko-1.3b) (ours) | | 0.4867 | 0.7207 | 0.5877 | 0.5877 | 0.7407 | 0.59234 |
115
 
116
+ <p><strong>&dagger;</strong> The model card of this model provides evaluation results for the KOBEST dataset, but when we evaluated the model with the prompts described in the paper, we can't get similar results to it. Therefore, we checked the KOBEST paper and found that the results were similar to the fine-tuning results reported in the paper. Because we evaluated prompt-based generation without fine-tuning the model, the results provided by the model card for the this model may differ.</p>
117
+
118
+ <p><strong>&ast;</strong> Since this model does not provide evaluation results with KOBEST dataset, we evaluated the model using lm-evaluation-harness ourselves. you can reproduce this result using the source code included in the multilingual-ko branch of lm-evaluation-harness.</p>
119
 
120
  ## Citation and Related Information
121
 
122
  ### BibTeX entry
123
 
124
+ If you find our work useful, please consider citing:
125
+
126
  ```bibtex
127
  @misc{gpt-neox-ko,
128
  title = {{GPT-NeoX-Ko: Open-Source Korean Autoregressive Language Model}},
129
+ author = {Ko, Hyunwoong and Yang, Kichang and Ryu, Minho and Kim, Taekyun and Yang, Seungmu and Hyun, Jiwoong and Park, Sungho and Ryu, Myunghyun and Keum, Bitna and Oh, Saechan and Kim, Soohwan and Park, Kyubyong},
130
  url = {https://www.github.com/eleutherai/multilingual},
131
  month = {9},
132
  year = {2022},
133
  }
134
  ```
135
 
136
+ ### Acknowledgements
 
 
 
 
 
137
 
138
+ This project would not have been possible without compute generously provided by [Stability.ai](https://stability.ai), thanks them for providing a large amount of GPU resources for this work.
139
+ And thanks also go to [TUNiB](https://tunib.ai) for providing a large-scale Korean dataset for this work.