jason9693 commited on
Commit
03fc19a
β€’
1 Parent(s): e60ec83

modified description

Browse files
Files changed (1) hide show
  1. README.md +16 -29
README.md CHANGED
@@ -33,22 +33,21 @@ We firstly targeted Korean language because most of our contributors were Korean
33
  | RoPE Dimensions | [64](https://github.com/kingoflolz/mesh-transformer-jax/blob/f2aa66e0925de6593dcbb70e72399b97b4130482/mesh_transformer/layers.py#L223) |
34
  <figcaption><p><strong>&ast;</strong> Each layer consists of one feedforward block and one self attention block.</p>
35
 
36
- The model consists of 32 layers with a model dimension of 4096, and a feedforward dimension of 16384. The model
37
- dimension is split into 16 heads, each with a dimension of 256. Rotary Position Embedding (RoPE) is applied to 64
38
- dimensions of each head. The model is trained with a tokenization vocabulary of 50257, using the same set of BPEs as
39
- GPT-2/GPT-3.
40
 
41
  ## Training data
42
 
43
- GPT-J 6B was trained on [the Pile](https://pile.eleuther.ai), a large-scale curated dataset created by [EleutherAI](https://www.eleuther.ai).
44
 
45
  ## Training procedure
46
 
47
- This model was trained for 402 billion tokens over 383,500 steps on TPU v3-256 pod. It was trained as an autoregressive language model, using cross-entropy loss to maximize the likelihood of predicting the next token correctly.
48
 
49
  ## Intended Use and Limitations
50
 
51
- GPT-J learns an inner representation of the English language that can be used to extract features useful for downstream tasks. The model is best at what it was pretrained for however, which is generating text from a prompt.
52
 
53
  ### How to use
54
 
@@ -57,17 +56,17 @@ This model can be easily loaded using the `AutoModelForCausalLM` functionality:
57
  ```python
58
  from transformers import AutoTokenizer, AutoModelForCausalLM
59
 
60
- tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")
61
- model = AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-j-6B")
62
  ```
63
 
64
  ### Limitations and Biases
65
 
66
- The core functionality of GPT-J is taking a string of text and predicting the next token. While language models are widely used for tasks other than this, there are a lot of unknowns with this work. When prompting GPT-J it is important to remember that the statistically most likely next token is often not the token that produces the most "accurate" text. Never depend upon GPT-J to produce factually accurate output.
67
 
68
- GPT-J was trained on the Pile, a dataset known to contain profanity, lewd, and otherwise abrasive language. Depending upon use case GPT-J may produce socially unacceptable text. See [Sections 5 and 6 of the Pile paper](https://arxiv.org/abs/2101.00027) for a more detailed analysis of the biases in the Pile.
69
 
70
- As with all language models, it is hard to predict in advance how GPT-J will respond to particular prompts and offensive content may occur without warning. We recommend having a human curate or filter the outputs before releasing them, both to censor undesirable content and to improve the quality of the results.
71
 
72
  ## Evaluation results
73
 
@@ -75,23 +74,11 @@ As with all language models, it is hard to predict in advance how GPT-J will res
75
 
76
  | Model | Public | Training FLOPs | LAMBADA PPL ↓ | LAMBADA Acc ↑ | Winogrande ↑ | Hellaswag ↑ | PIQA ↑ | Dataset Size (GB) |
77
  |--------------------------|-------------|----------------|--- |--- |--- |--- |--- |-------------------|
78
- | Random Chance | &check; | 0 | ~a lot | ~0% | 50% | 25% | 25% | 0 |
79
- | GPT-3 Ada&ddagger; | &cross; | ----- | 9.95 | 51.6% | 52.9% | 43.4% | 70.5% | ----- |
80
- | GPT-2 1.5B | &check; | ----- | 10.63 | 51.21% | 59.4% | 50.9% | 70.8% | 40 |
81
- | GPT-Neo 1.3B&ddagger; | &check; | 3.0e21 | 7.50 | 57.2% | 55.0% | 48.9% | 71.1% | 825 |
82
- | Megatron-2.5B&ast; | &cross; | 2.4e21 | ----- | 61.7% | ----- | ----- | ----- | 174 |
83
- | GPT-Neo 2.7B&ddagger; | &check; | 6.8e21 | 5.63 | 62.2% | 56.5% | 55.8% | 73.0% | 825 |
84
- | GPT-3 1.3B&ast;&ddagger; | &cross; | 2.4e21 | 5.44 | 63.6% | 58.7% | 54.7% | 75.1% | ~800 |
85
- | GPT-3 Babbage&ddagger; | &cross; | ----- | 5.58 | 62.4% | 59.0% | 54.5% | 75.5% | ----- |
86
- | Megatron-8.3B&ast; | &cross; | 7.8e21 | ----- | 66.5% | ----- | ----- | ----- | 174 |
87
- | GPT-3 2.7B&ast;&ddagger; | &cross; | 4.8e21 | 4.60 | 67.1% | 62.3% | 62.8% | 75.6% | ~800 |
88
- | Megatron-11B&dagger; | &check; | 1.0e22 | ----- | ----- | ----- | ----- | ----- | 161 |
89
- | **GPT-J 6B&ddagger;** | **&check;** | **1.5e22** | **3.99** | **69.7%** | **65.3%** | **66.1%** | **76.5%** | **825** |
90
- | GPT-3 6.7B&ast;&ddagger; | &cross; | 1.2e22 | 4.00 | 70.3% | 64.5% | 67.4% | 78.0% | ~800 |
91
- | GPT-3 Curie&ddagger; | &cross; | ----- | 4.00 | 69.3% | 65.6% | 68.5% | 77.9% | ----- |
92
- | GPT-3 13B&ast;&ddagger; | &cross; | 2.3e22 | 3.56 | 72.5% | 67.9% | 70.9% | 78.5% | ~800 |
93
- | GPT-3 175B&ast;&ddagger; | &cross; | 3.1e23 | 3.00 | 76.2% | 70.2% | 78.9% | 81.0% | ~800 |
94
- | GPT-3 Davinci&ddagger; | &cross; | ----- | 3.0 | 75% | 72% | 78% | 80% | ----- |
95
  <figcaption><p>Models roughly sorted by performance, or by FLOPs if not available.</p>
96
 
97
  <p><strong>&ast;</strong> Evaluation numbers reported by their respective authors. All other numbers are provided by
 
33
  | RoPE Dimensions | [64](https://github.com/kingoflolz/mesh-transformer-jax/blob/f2aa66e0925de6593dcbb70e72399b97b4130482/mesh_transformer/layers.py#L223) |
34
  <figcaption><p><strong>&ast;</strong> Each layer consists of one feedforward block and one self attention block.</p>
35
 
36
+ The model consists of 24 layers with a model dimension of 2048, and a feedforward dimension of 8192. The model
37
+ dimension is split into 16 heads, each with a dimension of 128. Rotary Position Embedding (RoPE) is applied to 64
38
+ dimensions of each head. The model is trained with a tokenization vocabulary of 30080.
 
39
 
40
  ## Training data
41
 
42
+ GPT-NeoX-Ko was trained on 1.5TB Korean Dataset, a large-scale curated dataset created by [tunib-ai](https://tunib.ai/).
43
 
44
  ## Training procedure
45
 
46
+ This model was trained for 402 billion tokens over 102,000 steps on A100 x 256 pods. It was trained as an autoregressive language model, using cross-entropy loss to maximize the likelihood of predicting the next token correctly.
47
 
48
  ## Intended Use and Limitations
49
 
50
+ GPT-NeoX-Ko learns an inner representation of the Korean language that can be used to extract features useful for downstream tasks. The model is best at what it was pretrained for however, which is generating text from a prompt.
51
 
52
  ### How to use
53
 
 
56
  ```python
57
  from transformers import AutoTokenizer, AutoModelForCausalLM
58
 
59
+ tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neox-ko-1.3B")
60
+ model = AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-neox-ko-1.3B")
61
  ```
62
 
63
  ### Limitations and Biases
64
 
65
+ The core functionality of GPT-NeoX-Ko is taking a string of text and predicting the next token. While language models are widely used for tasks other than this, there are a lot of unknowns with this work. When prompting GPT-NeoX-Ko it is important to remember that the statistically most likely next token is often not the token that produces the most "accurate" text. Never depend upon GPT-NeoX-Ko to produce factually accurate output.
66
 
67
+ GPT-NeoX-Ko was trained on the Large Scale Korean Datsets, a dataset known to contain profanity, lewd, and otherwise abrasive language. Depending upon use case GPT-NeoX-Ko may produce socially unacceptable text.
68
 
69
+ As with all language models, it is hard to predict in advance how GPT-NeoX-Ko will respond to particular prompts and offensive content may occur without warning. We recommend having a human curate or filter the outputs before releasing them, both to censor undesirable content and to improve the quality of the results.
70
 
71
  ## Evaluation results
72
 
 
74
 
75
  | Model | Public | Training FLOPs | LAMBADA PPL ↓ | LAMBADA Acc ↑ | Winogrande ↑ | Hellaswag ↑ | PIQA ↑ | Dataset Size (GB) |
76
  |--------------------------|-------------|----------------|--- |--- |--- |--- |--- |-------------------|
77
+ | KoGPT-trinity&ddagger; | &cross; | ----- | 3.0 | 75% | 72% | 78% | 80% | ----- |
78
+ | KoGPT-KakaoBrain&ddagger; | &cross; | ----- | 3.0 | 75% | 72% | 78% | 80% | ----- |
79
+ | GPT-NeoX-Ko-1.3B(ours)&ddagger; | &cross; | ----- | 3.0 | 75% | 72% | 78% | 80% | ----- |
80
+
81
+
 
 
 
 
 
 
 
 
 
 
 
 
82
  <figcaption><p>Models roughly sorted by performance, or by FLOPs if not available.</p>
83
 
84
  <p><strong>&ast;</strong> Evaluation numbers reported by their respective authors. All other numbers are provided by