EleutherAI
/

polyglot-ko-1.3b

@@ -33,22 +33,21 @@ We firstly targeted Korean language because most of our contributors were Korean
 | RoPE Dimensions      | [64](https://github.com/kingoflolz/mesh-transformer-jax/blob/f2aa66e0925de6593dcbb70e72399b97b4130482/mesh_transformer/layers.py#L223) |
 <figcaption><p><strong>&ast;</strong> Each layer consists of one feedforward block and one self attention block.</p>
-The model consists of 32 layers with a model dimension of 4096, and a feedforward dimension of 16384. The model
-dimension is split into 16 heads, each with a dimension of 256. Rotary Position Embedding (RoPE) is applied to 64
-dimensions of each head. The model is trained with a tokenization vocabulary of 50257, using the same set of BPEs as
-GPT-2/GPT-3.
 ## Training data
-GPT-J 6B was trained on [the Pile](https://pile.eleuther.ai), a large-scale curated dataset created by [EleutherAI](https://www.eleuther.ai).
 ## Training procedure
-This model was trained for 402 billion tokens over 383,500 steps on TPU v3-256 pod. It was trained as an autoregressive language model, using cross-entropy loss to maximize the likelihood of predicting the next token correctly.
 ## Intended Use and Limitations
-GPT-J learns an inner representation of the English language that can be used to extract features useful for downstream tasks. The model is best at what it was pretrained for however, which is generating text from a prompt.
 ### How to use
@@ -57,17 +56,17 @@ This model can be easily loaded using the `AutoModelForCausalLM` functionality:
 ```python
 from transformers import AutoTokenizer, AutoModelForCausalLM
-tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")
-model = AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-j-6B")
 ```
 ### Limitations and Biases
-The core functionality of GPT-J is taking a string of text and predicting the next token. While language models are widely used for tasks other than this, there are a lot of unknowns with this work. When prompting GPT-J it is important to remember that the statistically most likely next token is often not the token that produces the most "accurate" text. Never depend upon GPT-J to produce factually accurate output.
-GPT-J was trained on the Pile, a dataset known to contain profanity, lewd, and otherwise abrasive language. Depending upon use case GPT-J may produce socially unacceptable text. See [Sections 5 and 6 of the Pile paper](https://arxiv.org/abs/2101.00027) for a more detailed analysis of the biases in the Pile.
-As with all language models, it is hard to predict in advance how GPT-J will respond to particular prompts and offensive content may occur without warning. We recommend having a human curate or filter the outputs before releasing them, both to censor undesirable content and to improve the quality of the results.
 ## Evaluation results
@@ -75,23 +74,11 @@ As with all language models, it is hard to predict in advance how GPT-J will res
 |  Model                   | Public      | Training FLOPs | LAMBADA PPL ↓ | LAMBADA Acc ↑ | Winogrande ↑ | Hellaswag ↑ | PIQA ↑    | Dataset Size (GB) |
 |--------------------------|-------------|----------------|---            |---            |---           |---          |---        |-------------------|
-| Random Chance            | &check;     | 0              | ~a lot        | ~0%           | 50%          | 25%         | 25%       | 0                 |
-| GPT-3 Ada&ddagger;       | &cross;     | -----          | 9.95          | 51.6%         | 52.9%        | 43.4%       | 70.5%     | -----             |
-| GPT-2 1.5B               | &check;     | -----          | 10.63         | 51.21%        | 59.4%        | 50.9%       | 70.8%     | 40                |
-| GPT-Neo 1.3B&ddagger;    | &check;     | 3.0e21         | 7.50          | 57.2%         | 55.0%        | 48.9%       | 71.1%     | 825               |
-| Megatron-2.5B&ast;       | &cross;     | 2.4e21         | -----         | 61.7%         | -----        | -----       | -----     | 174               |
-| GPT-Neo 2.7B&ddagger;    | &check;     | 6.8e21         | 5.63          | 62.2%         | 56.5%        | 55.8%       | 73.0%     | 825               |
-| GPT-3 1.3B&ast;&ddagger; | &cross;     | 2.4e21         | 5.44          | 63.6%         | 58.7%        | 54.7%       | 75.1%     | ~800              |
-| GPT-3 Babbage&ddagger;   | &cross;     | -----          | 5.58          | 62.4%         | 59.0%        | 54.5%       | 75.5%     | -----             |
-| Megatron-8.3B&ast;       | &cross;     | 7.8e21         | -----         | 66.5%         | -----        | -----       | -----     | 174               |
-| GPT-3 2.7B&ast;&ddagger; | &cross;     | 4.8e21         | 4.60          | 67.1%         | 62.3%        | 62.8%       | 75.6%     | ~800              |
-| Megatron-11B&dagger;     | &check;     | 1.0e22         | -----         | -----         | -----        | -----       | -----     | 161               |
-| **GPT-J 6B&ddagger;**    | **&check;** | **1.5e22**     | **3.99**      | **69.7%**     | **65.3%**    | **66.1%**   | **76.5%** | **825**           |
-| GPT-3 6.7B&ast;&ddagger; | &cross;     | 1.2e22         | 4.00          | 70.3%         | 64.5%        | 67.4%       | 78.0%     | ~800              |
-| GPT-3 Curie&ddagger;     | &cross;     | -----          | 4.00          | 69.3%         | 65.6%        | 68.5%       | 77.9%     | -----             |
-| GPT-3 13B&ast;&ddagger;  | &cross;     | 2.3e22         | 3.56          | 72.5%         | 67.9%        | 70.9%       | 78.5%     | ~800              |
-| GPT-3 175B&ast;&ddagger; | &cross;     | 3.1e23         | 3.00          | 76.2%         | 70.2%        | 78.9%       | 81.0%     | ~800              |
-| GPT-3 Davinci&ddagger;   | &cross;     | -----          | 3.0           | 75%           | 72%          | 78%         | 80%       | -----             |
 <figcaption><p>Models roughly sorted by performance, or by FLOPs if not available.</p>
 <p><strong>&ast;</strong> Evaluation numbers reported by their respective authors. All other numbers are provided by

 | RoPE Dimensions      | [64](https://github.com/kingoflolz/mesh-transformer-jax/blob/f2aa66e0925de6593dcbb70e72399b97b4130482/mesh_transformer/layers.py#L223) |
 <figcaption><p><strong>&ast;</strong> Each layer consists of one feedforward block and one self attention block.</p>
+The model consists of 24 layers with a model dimension of 2048, and a feedforward dimension of 8192. The model
+dimension is split into 16 heads, each with a dimension of 128. Rotary Position Embedding (RoPE) is applied to 64
+dimensions of each head. The model is trained with a tokenization vocabulary of 30080.
 ## Training data
+GPT-NeoX-Ko was trained on 1.5TB Korean Dataset, a large-scale curated dataset created by [tunib-ai](https://tunib.ai/).
 ## Training procedure
+This model was trained for 402 billion tokens over 102,000 steps on A100 x 256 pods. It was trained as an autoregressive language model, using cross-entropy loss to maximize the likelihood of predicting the next token correctly.
 ## Intended Use and Limitations
+GPT-NeoX-Ko learns an inner representation of the Korean language that can be used to extract features useful for downstream tasks. The model is best at what it was pretrained for however, which is generating text from a prompt.
 ### How to use
 ```python
 from transformers import AutoTokenizer, AutoModelForCausalLM
+tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neox-ko-1.3B")
+model = AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-neox-ko-1.3B")
 ```
 ### Limitations and Biases
+The core functionality of GPT-NeoX-Ko is taking a string of text and predicting the next token. While language models are widely used for tasks other than this, there are a lot of unknowns with this work. When prompting GPT-NeoX-Ko it is important to remember that the statistically most likely next token is often not the token that produces the most "accurate" text. Never depend upon GPT-NeoX-Ko to produce factually accurate output.
+GPT-NeoX-Ko was trained on the Large Scale Korean Datsets, a dataset known to contain profanity, lewd, and otherwise abrasive language. Depending upon use case GPT-NeoX-Ko may produce socially unacceptable text.
+As with all language models, it is hard to predict in advance how GPT-NeoX-Ko will respond to particular prompts and offensive content may occur without warning. We recommend having a human curate or filter the outputs before releasing them, both to censor undesirable content and to improve the quality of the results.
 ## Evaluation results
 |  Model                   | Public      | Training FLOPs | LAMBADA PPL ↓ | LAMBADA Acc ↑ | Winogrande ↑ | Hellaswag ↑ | PIQA ↑    | Dataset Size (GB) |
 |--------------------------|-------------|----------------|---            |---            |---           |---          |---        |-------------------|
+| KoGPT-trinity&ddagger;   | &cross;     | -----          | 3.0           | 75%           | 72%          | 78%         | 80%       | -----             |
+| KoGPT-KakaoBrain&ddagger;   | &cross;     | -----          | 3.0           | 75%           | 72%          | 78%         | 80%       | -----             |
+| GPT-NeoX-Ko-1.3B(ours)&ddagger;   | &cross;     | -----          | 3.0           | 75%           | 72%          | 78%         | 80%       | -----             |
 <figcaption><p>Models roughly sorted by performance, or by FLOPs if not available.</p>
 <p><strong>&ast;</strong> Evaluation numbers reported by their respective authors. All other numbers are provided by