|
--- |
|
license: mit |
|
widget: |
|
- text: Universis presentes [MASK] inspecturis |
|
- text: eandem [MASK] per omnia parati observare |
|
- text: yo [MASK] rey de Galicia, de las Indias |
|
- text: en avant contre les choses [MASK] contenues |
|
datasets: |
|
- cc100 |
|
- bigscience-historical-texts/Open_Medieval_French |
|
- latinwikipedia |
|
language: |
|
- la |
|
- fr |
|
- es |
|
--- |
|
|
|
## Model Details |
|
|
|
This is a Fine-tuned version of the multilingual Bert model on medieval texts. The model is intended to be used as a fondation for other ML tasks on NLP and HTR environments. |
|
|
|
The train dataset entails 650M of tokens coming from texts on classical and medieval latin; old french and old Spanish from a period ranging from 5th BC to 16th centuries. |
|
|
|
Several big corpora were cleaned ans transformed to be used during the process training: |
|
|
|
| dataset | size | Lang | |
|
| ------------- |:-------------:| -----:| |
|
| CC100 | 3,2Gb | la | |
|
| Corpus Corporum | 3,0Gb | la | |
|
| CEMA | 320Mb | la+fro | |
|
| HOME | 38Mb | la+fro | |
|
| BFM | 34Mb | fro | |
|
| AND | 19Mb | fro | |
|
| CODEA | 13Mb | spa | |
|
| | ~6,5Gb | | |
|
| | 650M tk (4,5Gb) | | |
|
|
|
|