magistermilitum
/

bert_medieval_multilingual

Inference Endpoints

Model card Files Files and versions Community

bert_medieval_multilingual / README.md

magistermilitum's picture

magistermilitum

Update README.md

dd45bb1 verified 6 months ago

|

No virus

1.16 kB

	---
	license: mit
	widget:
	- text: Universis presentes [MASK] inspecturis
	- text: eandem [MASK] per omnia parati observare
	- text: yo [MASK] rey de Galicia, de las Indias
	- text: en avant contre les choses [MASK] contenues
	datasets:
	- cc100
	- bigscience-historical-texts/Open_Medieval_French
	- latinwikipedia
	language:
	- la
	- fr
	- es
	---

	## Model Details

	This is a Fine-tuned version of the multilingual Bert model on medieval texts. The model is intended to be used as a fondation for other ML tasks on NLP and HTR environments.

	The train dataset entails 650M of tokens coming from texts on classical and medieval latin; old french and old Spanish from a period ranging from 5th BC to 16th centuries.

	Several big corpora were cleaned ans transformed to be used during the process training:

	\| dataset \| size \| Lang \|
	\| ------------- \|:-------------:\| -----:\|
	\| CC100 \| 3,2Gb \| la \|
	\| Corpus Corporum \| 3,0Gb \| la \|
	\| CEMA \| 320Mb \| la+fro \|
	\| HOME \| 38Mb \| la+fro \|
	\| BFM \| 34Mb \| fro \|
	\| AND \| 19Mb \| fro \|
	\| CODEA \| 13Mb \| spa \|
	\| \| ~6,5Gb \| \|
	\| \| 650M tk (4,5Gb) \| \|