nicholasKluge commited on
Commit
4010b31
1 Parent(s): 3ee9a3e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +200 -1
README.md CHANGED
@@ -1,11 +1,162 @@
1
  ---
2
  license: apache-2.0
 
 
3
  language:
4
  - pt
 
 
5
  library_name: transformers
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6
  ---
 
7
 
8
- size: 468,239,360
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
 
10
  | Steps | Evaluation Loss | Perplexity | Total Energy Consumption | Emissions |
11
  |-----------|-----------------|------------|--------------------------|---------------|
@@ -25,3 +176,51 @@ size: 468,239,360
25
  - Note: Each evaluation consumed around 0.26 kWh of energy (~ 0.09 KgCO2eq), totaling 3.12 kWh (~ 1,11
26
  KgCO2eq).
27
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ datasets:
4
+ - nicholasKluge/portuguese-corpus-v3
5
  language:
6
  - pt
7
+ metrics:
8
+ - perplexity
9
  library_name: transformers
10
+ pipeline_tag: text-generation
11
+ tags:
12
+ - text-generation-inference
13
+ widget:
14
+ - text: "A PUCRS é uma universidade "
15
+ example_title: Exemplo
16
+ - text: "A muitos anos atrás, em uma galáxia muito distante, vivia uma raça de"
17
+ example_title: Exemplo
18
+ - text: "Em meio a um escândalo, a frente parlamentar pediu ao Senador Silva para"
19
+ example_title: Exemplo
20
+ inference:
21
+ parameters:
22
+ repetition_penalty: 1.2
23
+ temperature: 0.2
24
+ top_k: 20
25
+ top_p: 0.2
26
+ max_new_tokens: 150
27
+ co2_eq_emissions:
28
+ emissions: 41.1
29
+ source: CodeCarbon
30
+ training_type: pre-training
31
+ geographical_location: Germany
32
+ hardware_used: NVIDIA A100-SXM4-40GB
33
  ---
34
+ # TeenyTinyLlama-460m
35
 
36
+ <img src="./logo.png" alt="A little llama wearing a mushroom hat and a monocle." height="200">
37
+
38
+ ## Model Summary
39
+
40
+ Given the lack of available monolingual foundational models in non-English languages and the fact that some of the most used and downloaded models by the community are those small enough to allow individual researchers and hobbyists to use them in low-resource environments, we developed the TeenyTinyLlama: _a pair of small foundational models trained in Brazilian Portuguese._
41
+
42
+ TeenyTinyLlama is a pair of compact language models based on the Llama 2 architecture. These models are designed to deliver efficient natural language processing capabilities while being resource-conscious.
43
+
44
+ Also, TeenyTinyLlama models were trained by leveraging [scaling laws](https://arxiv.org/abs/2203.15556) to determine the optimal number of tokens per parameter while incorporating [preference pre-training](https://arxiv.org/abs/2112.00861).
45
+
46
+ ## Details
47
+
48
+ - **Architecture:** a Transformer-based model pre-trained via causal language modeling
49
+ - **Size:** 468,239,360 parameters
50
+ - **Context length:** 2048 tokens
51
+ - **Dataset:** [Portuguese-Corpus-v3](https://huggingface.co/datasets/nicholasKluge/portuguese-corpus-v3) (6.2B tokens)
52
+ - **Language:** Portuguese
53
+ - **Number of steps:** 1,200,000
54
+ - **GPU:** 1 NVIDIA A100-SXM4-40GB
55
+ - **Training time**: ~ 280 hours
56
+ - **Emissions:** 41.1 KgCO2 (Germany)
57
+ - **Total energy consumption:** 115.69 kWh
58
+
59
+ This repository has the [source code](https://github.com/Nkluge-correa/Aira) used to train this model. The main libraries used are:
60
+
61
+ - [Transformers](https://github.com/huggingface/transformers)
62
+ - [PyTorch](https://github.com/pytorch/pytorch)
63
+ - [Datasets](https://github.com/huggingface/datasets)
64
+ - [Tokenizers](https://github.com/huggingface/tokenizers)
65
+ - [Sentencepiece](https://github.com/google/sentencepiece)
66
+ - [Accelerate](https://github.com/huggingface/accelerate)
67
+ - [Codecarbon](https://github.com/mlco2/codecarbon)
68
+
69
+ ## Training Set-up
70
+
71
+ These are the main arguments used in the training of this model:
72
+
73
+ | Arguments | Value |
74
+ |-------------------------------|--------------------------------------|
75
+ | vocabulary size | 32000 |
76
+ | hidden dimension size | 1024 |
77
+ | intermediate dimension size | 4096 |
78
+ | context length | 2048 |
79
+ | nº attention heads | 16 |
80
+ | nº hidden layers | 24 |
81
+ | nº key value heads | 16 |
82
+ | nº training samples | 3033690 |
83
+ | nº validation samples | 30000 |
84
+ | nº epochs | 1.5 |
85
+ | evaluation steps | 100000 |
86
+ | train batch size | 2 |
87
+ | eval batch size | 4 |
88
+ | gradient accumulation steps | 2 |
89
+ | optimizer | torch.optim.AdamW |
90
+ | learning rate | 0.0003 |
91
+ | adam epsilon | 0.00000001 |
92
+ | weight decay | 0.01 |
93
+ | scheduler type | "cosine" |
94
+ | warmup steps | 10000 |
95
+ | gradient checkpointing | false |
96
+ | seed | 42 |
97
+ | mixed precision | 'no' |
98
+ | torch dtype | "float32" |
99
+ | tf32 | true |
100
+
101
+ ## Intended Uses
102
+
103
+ The primary intended use of TeenyTinyLlama is to research the behavior, functionality, and limitations of large language models. Checkpoints saved during training are intended to provide a controlled setting for performing scientific experiments. You may also further fine-tune and adapt TeenyTinyLlama-160m for deployment, as long as your use is in accordance with the Apache 2.0 license. If you decide to use pre-trained TeenyTinyLlama-160m as a basis for your fine-tuned model, please conduct your own risk and bias assessment.
104
+
105
+ ## Basic usage
106
+
107
+ Using the `pipeline`:
108
+
109
+ ```python
110
+ from transformers import pipeline
111
+
112
+ generator = pipeline("text-generation", model="nicholasKluge/TeenyTinyLlama-460m")
113
+
114
+ completions = generator("Astronomia é a ciência", num_return_sequences=2, max_new_tokens=100)
115
+
116
+ for comp in completions:
117
+ print(f"🤖 {comp['generated_text']}")
118
+ ```
119
+
120
+ Using the `AutoTokenizer` and `AutoModelForCausalLM`:
121
+
122
+ ```python
123
+ from transformers import AutoTokenizer, AutoModelForCausalLM
124
+ import torch
125
+
126
+ # Load model and the tokenizer
127
+ tokenizer = AutoTokenizer.from_pretrained("nicholasKluge/TeenyTinyLlama-460m", revision='main')
128
+ model = AutoModelForCausalLM.from_pretrained("nicholasKluge/TeenyTinyLlama-460m", revision='main')
129
+
130
+ # Pass the model to your device
131
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
132
+
133
+ model.eval()
134
+ model.to(device)
135
+
136
+ # Tokenize the inputs and pass them to the device
137
+ inputs = tokenizer("Astronomia é a ciência", return_tensors="pt").to(device)
138
+
139
+ # Generate some text
140
+ completions = model.generate(**inputs, num_return_sequences=2, max_new_tokens=100)
141
+
142
+ # Print the generated text
143
+ for i, completion in enumerate(completions):
144
+ print(f'🤖 {tokenizer.decode(completion)}')
145
+ ```
146
+
147
+ ## Limitations
148
+
149
+ - **Hallucinations:** This model can produce content that can be mistaken for truth but is, in fact, misleading or entirely false, i.e., hallucination.
150
+
151
+ - **Biases and Toxicity:** This model inherits the social and historical stereotypes from the data used to train it. Given these biases, the model can produce toxic content, i.e., harmful, offensive, or detrimental to individuals, groups, or communities.
152
+
153
+ - **Unreliable Code:** The model may produce incorrect code snippets and statements. These code generations should not be treated as suggestions or accurate solutions.
154
+
155
+ - **Language Limitations:** The model is primarily designed to understand standard Portuguese (BR). Other languages might challenge its comprehension, leading to potential misinterpretations or errors in response.
156
+
157
+ - **Repetition and Verbosity:** The model may get stuck on repetition loops (especially if the repetition penalty during generations is set to a meager value) or produce verbose responses unrelated to the prompt it was given.
158
+
159
+ ## Evaluations
160
 
161
  | Steps | Evaluation Loss | Perplexity | Total Energy Consumption | Emissions |
162
  |-----------|-----------------|------------|--------------------------|---------------|
 
176
  - Note: Each evaluation consumed around 0.26 kWh of energy (~ 0.09 KgCO2eq), totaling 3.12 kWh (~ 1,11
177
  KgCO2eq).
178
 
179
+ ## Benchmarks
180
+
181
+ | Models | Average | [ARC](https://arxiv.org/abs/1803.05457) | [Hellaswag](https://arxiv.org/abs/1905.07830) | [MMLU](https://arxiv.org/abs/2009.03300) | [TruthfulQA](https://arxiv.org/abs/2109.07958) |
182
+ |-------------------------------------------------------------------------------------|---------|-----------------------------------------|-----------------------------------------------|------------------------------------------|------------------------------------------------|
183
+ | [TeenyTinyLlama-460m](https://huggingface.co/nicholasKluge/TeenyTinyLlama-460m) | 31.16 | 29.40 | 33.00 | 28.55 | 41.10 |
184
+ | [TeenyTinyLlama-160m](https://huggingface.co/nicholasKluge/TeenyTinyLlama-160m) | 31.16 | 26.15 | 29.29 | 28.11 | 41.12 |
185
+ | [Pythia-160m](https://huggingface.co/EleutherAI/pythia-160m-deduped)* | 31.16 | 24.06 | 31.39 | 24.86 | 44.34 |
186
+ | [OPT-125m](https://huggingface.co/facebook/opt-125m)* | 30.80 | 22.87 | 31.47 | 26.02 | 42.87 |
187
+ | [Gpt2-portuguese-small](https://huggingface.co/pierreguillou/gpt2-small-portuguese) | 30.22 | 22.48 | 29.62 | 27.36 | 41.44 |
188
+ | [Gpt2-small](https://huggingface.co/gpt2)* | 29.97 | 21.48 | 31.60 | 25.79 | 40.65 |
189
+ | [Xglm-564M](https://huggingface.co/facebook/xglm-564M)* | 31.20 | 24.57 | 34.64 | 25.18 | 40.43 |
190
+ | [Bloom-560m](https://huggingface.co/bigscience/bloom-560m)* | 32.13 | 24.74 | 37.15 | 24.22 | 42.44 |
191
+ | [Multilingual GPT](https://huggingface.co/ai-forever/mGPT)* | 28.73 | 23.81 | 26.37 | 25.17 | 39.62 |
192
+
193
+ - Evaluations on benchmarks were performed using the [Language Model Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness) (by [EleutherAI](https://www.eleuther.ai/)). Thanks to [Laiviet](https://github.com/laiviet/lm-evaluation-harness) for translating some of the tasks in the LM-Evaluation-Harness. The results of models marked with an "*" were retirved from the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).
194
+
195
+ ## Fine-Tuning Comparisons
196
+
197
+ | Models | [IMDB](https://huggingface.co/datasets/christykoh/imdb_pt) | [FaQuAD-NLI](https://huggingface.co/datasets/ruanchaves/faquad-nli) | [HateBr](https://huggingface.co/datasets/ruanchaves/hatebr) | [Assin2](https://huggingface.co/datasets/assin2)| [AgNews](https://huggingface.co/datasets/maritaca-ai/ag_news_pt) |
198
+ |--------------------------------------------------------------------------------------------|------------------------------------------------------------|---------------------------------------------------------------------|-------------------------------------------------------------|-------------------------------------------------|------------------------------------------------------------------|
199
+ | [Bert-large-portuguese-cased](https://huggingface.co/neuralmind/bert-base-portuguese-cased)| 93.58 | 92.26 | 91.57 | 88.97 | 94.11 |
200
+ | [Bert-base-portuguese-cased](https://huggingface.co/neuralmind/bert-base-portuguese-cased) | 92.22 | 93.07 | 91.28 | 87.45 | 94.19 |
201
+ | [TeenyTinyLlama-160m](https://huggingface.co/nicholasKluge/TeenyTinyLlama-160m) | 91.14 | 90.00 | 90.71 | 85.78 | 94.05 |
202
+ | [Gpt2-small-portuguese](https://huggingface.co/pierreguillou/gpt2-small-portuguese) | 91.60 | 86.46 | 87.42 | 86.11 | 94.07 |
203
+
204
+ ## Cite as 🤗
205
+
206
+ ```latex
207
+
208
+ @misc{nicholas22llama,
209
+ doi = {10.5281/zenodo.6989727},
210
+ url = {https://huggingface.co/nicholasKluge/TeenyTinyLlama-460m},
211
+ author = {Nicholas Kluge Corrêa},
212
+ title = {TeenyTinyLlama},
213
+ year = {2023},
214
+ publisher = {HuggingFace},
215
+ journal = {HuggingFace repository},
216
+ }
217
+
218
+ ```
219
+
220
+ ## Funding
221
+
222
+ This repository was built as part of the RAIES ([Rede de Inteligência Artificial Ética e Segura](https://www.raies.org/)) initiative, a project supported by FAPERGS - ([Fundação de Amparo à Pesquisa do Estado do Rio Grande do Sul](https://fapergs.rs.gov.br/inicial)), Brazil.
223
+
224
+ ## License
225
+
226
+ TeenyTinyLlama-460m is licensed under the Apache License, Version 2.0. See the [LICENSE](LICENSE) file for more details.