metadata
license: mit
widget:
- text: Universis presentes [MASK] inspecturis
- text: eandem [MASK] per omnia parati observare
- text: yo [MASK] rey de Galicia, de las Indias
- text: en avant contre les choses [MASK] contenues
datasets:
- cc100
- bigscience-historical-texts/Open_Medieval_French
- latinwikipedia
language:
- la
- fr
- es
Model Details
This is a Fine-tuned version of the multilingual Bert model on medieval texts. The model is intended to be used as a fondation for other ML tasks on NLP and HTR environments.
The train dataset entails 650M of tokens coming from texts on classical and medieval latin; old french and old Spanish from a period ranging from 5th BC to 16th centuries.
Several big corpora were cleaned ans transformed to be used during the process training:
dataset | size | Lang |
---|---|---|
CC100 | 3,2Gb | la |
Corpus Corporum | 3,0Gb | la |
CEMA | 320Mb | la+fro |
HOME | 38Mb | la+fro |
BFM | 34Mb | fro |
AND | 19Mb | fro |
CODEA | 13Mb | spa |
~6,5Gb | ||
650M tk (4,5Gb) |