magistermilitum's picture
Update README.md
f91cfa4 verified
|
raw
history blame
No virus
1.28 kB
metadata
license: mit
widget:
  - text: Universis presentes [MASK] inspecturis
  - text: eandem [MASK] per omnia parati observare
  - text: yo [MASK] rey de Galicia, de las Indias
  - text: en avant contre les choses [MASK] contenues
datasets:
  - cc100
  - bigscience-historical-texts/Open_Medieval_French
  - latinwikipedia
language:
  - la
  - fr
  - es

Model Details

This is a Fine-tuned version of the multilingual Bert model on medieval texts. The model is intended to be used as a fondation for other ML tasks on NLP and HTR environments.

The train dataset entails 650M of tokens coming from texts on classical and medieval latin; old french and old Spanish from a period ranging from 5th BC to 16th centuries.

Several big corpora were cleaned ans transformed to be used during the process training:

dataset size Lang dates
CC100 3,2Gb la 5th BC - 18th
Corpus Corporum 3,0Gb la 5th BC - 16th
CEMA 320Mb la+fro 9th - 15th
HOME 38Mb la+fro 12th - 15th
BFM 34Mb fro 13th - 15th
AND 19Mb fro 13th - 15th
CODEA 13Mb spa 12th - 16th
~6,5Gb
650M tk (4,5Gb)