magistermilitum commited on
Commit
b5923cc
1 Parent(s): f91cfa4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +17 -8
README.md CHANGED
@@ -25,13 +25,22 @@ Several big corpora were cleaned ans transformed to be used during the process t
25
 
26
  | dataset | size | Lang | dates |
27
  | ------------- |:-------------:| -----:|-----:|
28
- | CC100 | 3,2Gb | la | 5th BC - 18th|
29
- | Corpus Corporum | 3,0Gb | la | 5th BC - 16th |
30
- | CEMA | 320Mb | la+fro |9th - 15th |
31
- | HOME | 38Mb | la+fro | 12th - 15th |
32
- | BFM | 34Mb | fro | 13th - 15th|
33
- | AND | 19Mb | fro | 13th - 15th|
34
- | CODEA | 13Mb | spa |12th - 16th |
35
  | | ~6,5Gb | |
36
- | | 650M tk (4,5Gb) | | |
 
 
 
 
 
 
 
 
 
37
 
 
25
 
26
  | dataset | size | Lang | dates |
27
  | ------------- |:-------------:| -----:|-----:|
28
+ | CC100 [1] | 3,2Gb | la | 5th BC - 18th|
29
+ | Corpus Corporum [2] | 3,0Gb | la | 5th BC - 16th |
30
+ | CEMA [3] | 320Mb | la+fro |9th - 15th |
31
+ | HOME-Alcar [4] | 38Mb | la+fro | 12th - 15th |
32
+ | BFM [5] | 34Mb | fro | 13th - 15th|
33
+ | AND [6] | 19Mb | fro | 13th - 15th|
34
+ | CODEA [7] | 13Mb | spa |12th - 16th |
35
  | | ~6,5Gb | |
36
+ | | 650M tokens (4,5Gb) | | |
37
+
38
+ [1] CC-NET Repository : https://huggingface.co/datasets/cc100
39
+ [2] Repositorium operum lationorum apud universitatem Turicensem : https://mlat.uzh.ch/
40
+ [3] Cartae Europae Medii Aevi (5th-15th c.) : https://cema.lamop.fr/
41
+ [4] History of Medieval Europe : https://doi.org/10.5281/zenodo.5600884
42
+ [5] Base du Français Médieval : https://txm-bfm.huma-num.fr/txm/
43
+ [6] Anglo-Normand Dictionary : https://anglo-norman.net/
44
+ [7] Corpus de Docuemntos Españoles anteriores a 1900: https://www.corpuscodea.es/
45
+
46