Muennighoff commited on
Commit
7b5134c
1 Parent(s): ef09465
Files changed (1) hide show
  1. README.md +8 -15
README.md CHANGED
@@ -122,13 +122,15 @@ Please see [the BLOOM training README](https://github.com/bigscience-workshop/bi
122
 
123
  * ALiBI positional encodings (see [paper](https://arxiv.org/pdf/2108.12409.pdf)), with GeLU activation functions
124
 
125
- * 350 million parameters:
 
 
126
 
127
  * 24 layers, 16 attention heads
128
 
129
  * Hidden layers are 1024-dimensional
130
 
131
- * Sequence length of 2048 tokens used (see [BLOOM tokenizer](https://huggingface.co/bigscience/tokenizer), [tokenizer description](#tokenization))
132
 
133
  **Objective Function:** Cross Entropy with mean reduction (see [API documentation](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html#torch.nn.CrossEntropyLoss)).
134
 
@@ -165,18 +167,9 @@ Please see [the BLOOM training README](https://github.com/bigscience-workshop/bi
165
 
166
  #### **Training**
167
 
168
-
169
- _In progress._
170
-
171
- Current training logs: [Tensorboard link](https://huggingface.co/tensorboard/bigscience/tr11-176B-ml-logs/)
172
-
173
- - Checkpoint size:
174
-
175
- - Bf16 weights: 329GB
176
-
177
- - Full checkpoint with optimizer states: 2.3TB
178
 
179
- - Training throughput: About 150 TFLOP per GPU per second
180
 
181
  - Number of epochs: 1 (*current target*)
182
 
@@ -184,9 +177,9 @@ Current training logs: [Tensorboard link](https://huggingface.co/tensorboard/big
184
 
185
  - Started 11th March, 2022 11:42am PST
186
 
187
- - Estimated end: 5th July, 2022
188
 
189
- - Estimated cost of training: Equivalent of $2-5M in cloud computing (including preliminary experiments)
190
 
191
  - Server training location: Île-de-France, France
192
 
 
122
 
123
  * ALiBI positional encodings (see [paper](https://arxiv.org/pdf/2108.12409.pdf)), with GeLU activation functions
124
 
125
+ * 559,214,592 parameters:
126
+
127
+ * 256,901,120 embedding parameters
128
 
129
  * 24 layers, 16 attention heads
130
 
131
  * Hidden layers are 1024-dimensional
132
 
133
+ * Sequence length of 2048 tokens (see [BLOOM tokenizer](https://huggingface.co/bigscience/tokenizer), [tokenizer description](#tokenization))
134
 
135
  **Objective Function:** Cross Entropy with mean reduction (see [API documentation](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html#torch.nn.CrossEntropyLoss)).
136
 
 
167
 
168
  #### **Training**
169
 
170
+ Training logs: [Tensorboard link](https://huggingface.co/bigscience/tr11e-350M-logs)
 
 
 
 
 
 
 
 
 
171
 
172
+ - Training throughput: About 150 TFLOPs per GPU
173
 
174
  - Number of epochs: 1 (*current target*)
175
 
 
177
 
178
  - Started 11th March, 2022 11:42am PST
179
 
180
+ - Ended 5th July, 2022
181
 
182
+ - Estimated cost of training: Equivalent of $2-5M in cloud computing (including preliminary experiments and other model sizes)
183
 
184
  - Server training location: Île-de-France, France
185