magistermilitum's picture
Update README.md
dd45bb1 verified
|
raw
history blame
No virus
1.16 kB
---
license: mit
widget:
- text: Universis presentes [MASK] inspecturis
- text: eandem [MASK] per omnia parati observare
- text: yo [MASK] rey de Galicia, de las Indias
- text: en avant contre les choses [MASK] contenues
datasets:
- cc100
- bigscience-historical-texts/Open_Medieval_French
- latinwikipedia
language:
- la
- fr
- es
---
## Model Details
This is a Fine-tuned version of the multilingual Bert model on medieval texts. The model is intended to be used as a fondation for other ML tasks on NLP and HTR environments.
The train dataset entails 650M of tokens coming from texts on classical and medieval latin; old french and old Spanish from a period ranging from 5th BC to 16th centuries.
Several big corpora were cleaned ans transformed to be used during the process training:
| dataset | size | Lang |
| ------------- |:-------------:| -----:|
| CC100 | 3,2Gb | la |
| Corpus Corporum | 3,0Gb | la |
| CEMA | 320Mb | la+fro |
| HOME | 38Mb | la+fro |
| BFM | 34Mb | fro |
| AND | 19Mb | fro |
| CODEA | 13Mb | spa |
| | ~6,5Gb | |
| | 650M tk (4,5Gb) | |