stefan-it commited on
Commit
1e92d67
1 Parent(s): 7a630fb

readme: add initial version

Browse files
README.md ADDED
@@ -0,0 +1,335 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Historic Language Models (HLMs)
2
+
3
+ ## Languages
4
+
5
+ Our Historic Language Models Zoo contains support for the following languages - incl. their training data source:
6
+
7
+ | Language | Training data | Size
8
+ | -------- | ------------- | ----
9
+ | German | [Europeana](http://www.europeana-newspapers.eu/) | 13-28GB (filtered)
10
+ | French | [Europeana](http://www.europeana-newspapers.eu/) | 11-31GB (filtered)
11
+ | English | [British Library](https://data.bl.uk/digbks/db14.html) | 24GB (year filtered)
12
+ | Finnish | [Europeana](http://www.europeana-newspapers.eu/) | 1.2GB
13
+ | Swedish | [Europeana](http://www.europeana-newspapers.eu/) | 1.1GB
14
+
15
+ ## Models
16
+
17
+ At the moment, the following models are available on the model hub:
18
+
19
+ | Model identifier | Model Hub link
20
+ | --------------------------------------------- | --------------------------------------------------------------------------
21
+ | `dbmdz/bert-base-historic-multilingual-cased` | [here](https://huggingface.co/dbmdz/bert-base-historic-multilingual-cased)
22
+ | `dbmdz/bert-base-historic-english-cased` | [here](https://huggingface.co/dbmdz/bert-base-historic-english-cased)
23
+ | `dbmdz/bert-base-finnish-europeana-cased` | [here](https://huggingface.co/dbmdz/bert-base-finnish-europeana-cased)
24
+ | `dbmdz/bert-base-swedish-europeana-cased` | [here](https://huggingface.co/dbmdz/bert-base-swedish-europeana-cased)
25
+
26
+ We also released smaller models for the multilingual model:
27
+
28
+ | Model identifier | Model Hub link
29
+ | ----------------------------------------------- | ---------------------------------------------------------------------------
30
+ | `dbmdz/bert-tiny-historic-multilingual-cased` | [here](https://huggingface.co/dbmdz/bert-tiny-historic-multilingual-cased)
31
+ | `dbmdz/bert-mini-historic-multilingual-cased` | [here](https://huggingface.co/dbmdz/bert-mini-historic-multilingual-cased)
32
+ | `dbmdz/bert-small-historic-multilingual-cased` | [here](https://huggingface.co/dbmdz/bert-small-historic-multilingual-cased)
33
+ | `dbmdz/bert-medium-historic-multilingual-cased` | [here](https://huggingface.co/dbmdz/bert-base-historic-multilingual-cased)
34
+
35
+ **Notice**: We have released language models for Historic German and French trained on more noisier data earlier - see
36
+ [this repo](https://github.com/stefan-it/europeana-bert) for more information:
37
+
38
+ | Model identifier | Model Hub link
39
+ | --------------------------------------------- | --------------------------------------------------------------------------
40
+ | `dbmdz/bert-base-german-europeana-cased` | [here](https://huggingface.co/dbmdz/bert-base-german-europeana-cased)
41
+ | `dbmdz/bert-base-french-europeana-cased` | [here](https://huggingface.co/dbmdz/bert-base-french-europeana-cased)
42
+
43
+ # Corpora Stats
44
+
45
+ ## German Europeana Corpus
46
+
47
+ We provide some statistics using different thresholds of ocr confidences, in order to shrink down the corpus size
48
+ and use less-noisier data:
49
+
50
+ | OCR confidence | Size
51
+ | -------------- | ----
52
+ | **0.60** | 28GB
53
+ | 0.65 | 18GB
54
+ | 0.70 | 13GB
55
+
56
+ For the final corpus we use a OCR confidence of 0.6 (28GB). The following plot shows a tokens per year distribution:
57
+
58
+ ![German Europeana Corpus Stats](stats/figures/german_europeana_corpus_stats.png)
59
+
60
+ ## French Europeana Corpus
61
+
62
+ Like German, we use different ocr confidence thresholds:
63
+
64
+ | OCR confidence | Size
65
+ | -------------- | ----
66
+ | 0.60 | 31GB
67
+ | 0.65 | 27GB
68
+ | **0.70** | 27GB
69
+ | 0.75 | 23GB
70
+ | 0.80 | 11GB
71
+
72
+ For the final corpus we use a OCR confidence of 0.7 (27GB). The following plot shows a tokens per year distribution:
73
+
74
+ ![French Europeana Corpus Stats](stats/figures/french_europeana_corpus_stats.png)
75
+
76
+ ## British Library Corpus
77
+
78
+ Metadata is taken from [here](https://data.bl.uk/digbks/DB21.html). Stats incl. year filtering:
79
+
80
+ | Years | Size
81
+ | ----------------- | ----
82
+ | ALL | 24GB
83
+ | >= 1800 && < 1900 | 24GB
84
+
85
+ We use the year filtered variant. The following plot shows a tokens per year distribution:
86
+
87
+ ![British Library Corpus Stats](stats/figures/bl_corpus_stats.png)
88
+
89
+ ## Finnish Europeana Corpus
90
+
91
+ | OCR confidence | Size
92
+ | -------------- | ----
93
+ | 0.60 | 1.2GB
94
+
95
+ The following plot shows a tokens per year distribution:
96
+
97
+ ![Finnish Europeana Corpus Stats](stats/figures/finnish_europeana_corpus_stats.png)
98
+
99
+ ## Swedish Europeana Corpus
100
+
101
+ | OCR confidence | Size
102
+ | -------------- | ----
103
+ | 0.60 | 1.1GB
104
+
105
+ The following plot shows a tokens per year distribution:
106
+
107
+ ![Swedish Europeana Corpus Stats](stats/figures/swedish_europeana_corpus_stats.png)
108
+
109
+ ## All Corpora
110
+
111
+ The following plot shows a tokens per year distribution of the complete training corpus:
112
+
113
+ ![All Corpora Stats](stats/figures/all_corpus_stats.png)
114
+
115
+ # Multilingual Vocab generation
116
+
117
+ For the first attempt, we use the first 10GB of each pretraining corpus. We upsample both Finnish and Swedish to ~10GB.
118
+ The following tables shows the exact size that is used for generating a 32k and 64k subword vocabs:
119
+
120
+ | Language | Size
121
+ | -------- | ----
122
+ | German | 10GB
123
+ | French | 10GB
124
+ | English | 10GB
125
+ | Finnish | 9.5GB
126
+ | Swedish | 9.7GB
127
+
128
+ We then calculate the subword fertility rate and portion of `[UNK]`s over the following NER corpora:
129
+
130
+ | Language | NER corpora
131
+ | -------- | ------------------
132
+ | German | CLEF-HIPE, NewsEye
133
+ | French | CLEF-HIPE, NewsEye
134
+ | English | CLEF-HIPE
135
+ | Finnish | NewsEye
136
+ | Swedish | NewsEye
137
+
138
+ Breakdown of subword fertility rate and unknown portion per language for the 32k vocab:
139
+
140
+ | Language | Subword fertility | Unknown portion
141
+ | -------- | ------------------ | ---------------
142
+ | German | 1.43 | 0.0004
143
+ | French | 1.25 | 0.0001
144
+ | English | 1.25 | 0.0
145
+ | Finnish | 1.69 | 0.0007
146
+ | Swedish | 1.43 | 0.0
147
+
148
+ Breakdown of subword fertility rate and unknown portion per language for the 64k vocab:
149
+
150
+ | Language | Subword fertility | Unknown portion
151
+ | -------- | ------------------ | ---------------
152
+ | German | 1.31 | 0.0004
153
+ | French | 1.16 | 0.0001
154
+ | English | 1.17 | 0.0
155
+ | Finnish | 1.54 | 0.0007
156
+ | Swedish | 1.32 | 0.0
157
+
158
+ # Final pretraining corpora
159
+
160
+ We upsample Swedish and Finnish to ~27GB. The final stats for all pretraining corpora can be seen here:
161
+
162
+ | Language | Size
163
+ | -------- | ----
164
+ | German | 28GB
165
+ | French | 27GB
166
+ | English | 24GB
167
+ | Finnish | 27GB
168
+ | Swedish | 27GB
169
+
170
+ Total size is 130GB.
171
+
172
+ # Smaller multilingual models
173
+
174
+ Inspired by the ["Well-Read Students Learn Better: On the Importance of Pre-training Compact Models"](https://arxiv.org/abs/1908.08962)
175
+ paper, we train smaller models (different layers and hidden sizes), and report number of parameters and pre-training costs:
176
+
177
+ | Model (Layer / Hidden size) | Parameters | Pre-Training time
178
+ | --------------------------- | ----------: | ----------------------:
179
+ | hmBERT Tiny ( 2/128) | 4.58M | 4.3 sec / 1,000 steps
180
+ | hmBERT Mini ( 4/256) | 11.55M | 10.5 sec / 1,000 steps
181
+ | hmBERT Small ( 4/512) | 29.52M | 20.7 sec / 1,000 steps
182
+ | hmBERT Medium ( 8/512) | 42.13M | 35.0 sec / 1,000 steps
183
+ | hmBERT Base (12/768) | 110.62M | 80.0 sec / 1,000 steps
184
+
185
+ We then perform downstream evaluations on the multilingual [NewsEye](https://zenodo.org/record/4573313#.Ya3oVr-ZNzU) dataset:
186
+
187
+ ![NewsEye hmBERT Evaluation](stats/figures/newseye-hmbert-evaluation.png)
188
+
189
+ # Pretraining
190
+
191
+ ## Multilingual model - hmBERT Base
192
+
193
+ We train a multilingual BERT model using the 32k vocab with the official BERT implementation
194
+ on a v3-32 TPU using the following parameters:
195
+
196
+ ```bash
197
+ python3 run_pretraining.py --input_file gs://histolectra/historic-multilingual-tfrecords/*.tfrecord \
198
+ --output_dir gs://histolectra/bert-base-historic-multilingual-cased \
199
+ --bert_config_file ./config.json \
200
+ --max_seq_length=512 \
201
+ --max_predictions_per_seq=75 \
202
+ --do_train=True \
203
+ --train_batch_size=128 \
204
+ --num_train_steps=3000000 \
205
+ --learning_rate=1e-4 \
206
+ --save_checkpoints_steps=100000 \
207
+ --keep_checkpoint_max=20 \
208
+ --use_tpu=True \
209
+ --tpu_name=electra-2 \
210
+ --num_tpu_cores=32
211
+ ```
212
+
213
+ The following plot shows the pretraining loss curve:
214
+
215
+ ![Training loss curve](stats/figures/pretraining_loss_historic-multilingual.png)
216
+
217
+ ## Smaller multilingual models
218
+
219
+ We use the same parameters as used for training the base model.
220
+
221
+ ### hmBERT Tiny
222
+
223
+ The following plot shows the pretraining loss curve for the tiny model:
224
+
225
+ ![Training loss curve](stats/figures/pretraining_loss_hmbert-tiny.png)
226
+
227
+ ### hmBERT Mini
228
+
229
+ The following plot shows the pretraining loss curve for the mini model:
230
+
231
+ ![Training loss curve](stats/figures/pretraining_loss_hmbert-mini.png)
232
+
233
+ ### hmBERT Small
234
+
235
+ The following plot shows the pretraining loss curve for the small model:
236
+
237
+ ![Training loss curve](stats/figures/pretraining_loss_hmbert-small.png)
238
+
239
+ ### hmBERT Medium
240
+
241
+ The following plot shows the pretraining loss curve for the medium model:
242
+
243
+ ![Training loss curve](stats/figures/pretraining_loss_hmbert-medium.png)
244
+
245
+ ## English model
246
+
247
+ The English BERT model - with texts from British Library corpus - was trained with the Hugging Face
248
+ JAX/FLAX implementation for 10 epochs (approx. 1M steps) on a v3-8 TPU, using the following command:
249
+
250
+ ```bash
251
+ python3 run_mlm_flax.py --model_type bert \
252
+ --config_name /mnt/datasets/bert-base-historic-english-cased/ \
253
+ --tokenizer_name /mnt/datasets/bert-base-historic-english-cased/ \
254
+ --train_file /mnt/datasets/bl-corpus/bl_1800-1900_extracted.txt \
255
+ --validation_file /mnt/datasets/bl-corpus/english_validation.txt \
256
+ --max_seq_length 512 \
257
+ --per_device_train_batch_size 16 \
258
+ --learning_rate 1e-4 \
259
+ --num_train_epochs 10 \
260
+ --preprocessing_num_workers 96 \
261
+ --output_dir /mnt/datasets/bert-base-historic-english-cased-512-noadafactor-10e \
262
+ --save_steps 2500 \
263
+ --eval_steps 2500 \
264
+ --warmup_steps 10000 \
265
+ --line_by_line \
266
+ --pad_to_max_length
267
+ ```
268
+
269
+ The following plot shows the pretraining loss curve:
270
+
271
+ ![Training loss curve](stats/figures/pretraining_loss_historic_english.png)
272
+
273
+ ## Finnish model
274
+
275
+ The BERT model - with texts from Finnish part of Europeana - was trained with the Hugging Face
276
+ JAX/FLAX implementation for 40 epochs (approx. 1M steps) on a v3-8 TPU, using the following command:
277
+
278
+ ```bash
279
+ python3 run_mlm_flax.py --model_type bert \
280
+ --config_name /mnt/datasets/bert-base-finnish-europeana-cased/ \
281
+ --tokenizer_name /mnt/datasets/bert-base-finnish-europeana-cased/ \
282
+ --train_file /mnt/datasets/hlms/extracted_content_Finnish_0.6.txt \
283
+ --validation_file /mnt/datasets/hlms/finnish_validation.txt \
284
+ --max_seq_length 512 \
285
+ --per_device_train_batch_size 16 \
286
+ --learning_rate 1e-4 \
287
+ --num_train_epochs 40 \
288
+ --preprocessing_num_workers 96 \
289
+ --output_dir /mnt/datasets/bert-base-finnish-europeana-cased-512-dupe1-noadafactor-40e \
290
+ --save_steps 2500 \
291
+ --eval_steps 2500 \
292
+ --warmup_steps 10000 \
293
+ --line_by_line \
294
+ --pad_to_max_length
295
+ ```
296
+
297
+ The following plot shows the pretraining loss curve:
298
+
299
+ ![Training loss curve](stats/figures/pretraining_loss_finnish_europeana.png)
300
+
301
+ ## Swedish model
302
+
303
+ The BERT model - with texts from Swedish part of Europeana - was trained with the Hugging Face
304
+ JAX/FLAX implementation for 40 epochs (approx. 660K steps) on a v3-8 TPU, using the following command:
305
+
306
+ ```bash
307
+ python3 run_mlm_flax.py --model_type bert \
308
+ --config_name /mnt/datasets/bert-base-swedish-europeana-cased/ \
309
+ --tokenizer_name /mnt/datasets/bert-base-swedish-europeana-cased/ \
310
+ --train_file /mnt/datasets/hlms/extracted_content_Swedish_0.6.txt \
311
+ --validation_file /mnt/datasets/hlms/swedish_validation.txt \
312
+ --max_seq_length 512 \
313
+ --per_device_train_batch_size 16 \
314
+ --learning_rate 1e-4 \
315
+ --num_train_epochs 40 \
316
+ --preprocessing_num_workers 96 \
317
+ --output_dir /mnt/datasets/bert-base-swedish-europeana-cased-512-dupe1-noadafactor-40e \
318
+ --save_steps 2500 \
319
+ --eval_steps 2500 \
320
+ --warmup_steps 10000 \
321
+ --line_by_line \
322
+ --pad_to_max_length
323
+ ```
324
+
325
+ The following plot shows the pretraining loss curve:
326
+
327
+ ![Training loss curve](stats/figures/pretraining_loss_swedish_europeana.png)
328
+
329
+ # Acknowledgments
330
+
331
+ Research supported with Cloud TPUs from Google's TPU Research Cloud (TRC) program, previously known as
332
+ TensorFlow Research Cloud (TFRC). Many thanks for providing access to the TRC ❤️
333
+
334
+ Thanks to the generous support from the [Hugging Face](https://huggingface.co/) team,
335
+ it is possible to download both cased and uncased models from their S3 storage 🤗
stats/figures/all_corpus_stats.png ADDED
stats/figures/bl_corpus_stats.png ADDED
stats/figures/finnish_europeana_corpus_stats.png ADDED
stats/figures/french_europeana_corpus_stats.png ADDED
stats/figures/german_europeana_corpus_stats.png ADDED
stats/figures/newseye-hmbert-evaluation.png ADDED
stats/figures/pretraining_loss_finnish_europeana.png ADDED
stats/figures/pretraining_loss_historic-multilingual.png ADDED
stats/figures/pretraining_loss_historic_english.png ADDED
stats/figures/pretraining_loss_hmbert-medium.png ADDED
stats/figures/pretraining_loss_hmbert-mini.png ADDED
stats/figures/pretraining_loss_hmbert-small.png ADDED
stats/figures/pretraining_loss_hmbert-tiny.png ADDED
stats/figures/pretraining_loss_swedish_europeana.png ADDED
stats/figures/swedish_europeana_corpus_stats.png ADDED