metadata
license: mit
widget:
- text: Universis presentes [MASK] inspecturis
- text: eandem [MASK] per omnia parati observare
- text: yo [MASK] rey de Galicia, de las Indias
- text: en avant contre les choses [MASK] contenues
datasets:
- cc100
- bigscience-historical-texts/Open_Medieval_French
- latinwikipedia
language:
- la
- fr
- es
Model Details
This is a Fine-tuned version of the multilingual Bert model on medieval texts. The model is intended to be used as a fondation for other ML tasks on NLP and HTR environments.
The train dataset entails 650M of tokens coming from texts on classical and medieval latin; old french and old Spanish from a period ranging from 5th BC to 16th centuries.
Several big corpora were cleaned ans transformed to be used during the process training:
dataset | size | Lang | dates |
---|---|---|---|
CC100 | 3,2Gb | la | 5th BC - 18th |
Corpus Corporum | 3,0Gb | la | 5th BC - 16th |
CEMA | 320Mb | la+fro | 9th - 15th |
HOME | 38Mb | la+fro | 12th - 15th |
BFM | 34Mb | fro | 13th - 15th |
AND | 19Mb | fro | 13th - 15th |
CODEA | 13Mb | spa | 12th - 16th |
~6,5Gb | |||
650M tk (4,5Gb) |