magistermilitum commited on
Commit
70411c7
1 Parent(s): b5923cc

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +10 -1
README.md CHANGED
@@ -33,14 +33,23 @@ Several big corpora were cleaned ans transformed to be used during the process t
33
  | AND [6] | 19Mb | fro | 13th - 15th|
34
  | CODEA [7] | 13Mb | spa |12th - 16th |
35
  | | ~6,5Gb | |
36
- | | 650M tokens (4,5Gb) | | |
 
 
 
37
 
38
  [1] CC-NET Repository : https://huggingface.co/datasets/cc100
 
39
  [2] Repositorium operum lationorum apud universitatem Turicensem : https://mlat.uzh.ch/
 
40
  [3] Cartae Europae Medii Aevi (5th-15th c.) : https://cema.lamop.fr/
 
41
  [4] History of Medieval Europe : https://doi.org/10.5281/zenodo.5600884
 
42
  [5] Base du Français Médieval : https://txm-bfm.huma-num.fr/txm/
 
43
  [6] Anglo-Normand Dictionary : https://anglo-norman.net/
 
44
  [7] Corpus de Docuemntos Españoles anteriores a 1900: https://www.corpuscodea.es/
45
 
46
 
 
33
  | AND [6] | 19Mb | fro | 13th - 15th|
34
  | CODEA [7] | 13Mb | spa |12th - 16th |
35
  | | ~6,5Gb | |
36
+ | | 650M tokens (4,5Gb)* | | |
37
+
38
+
39
+ * A significant overlapped quantity of text was detected across the corpora, specially on medieval collections. Besides, synthetic text ("Lorem ipsum dolorem...") was iteratively deleted.
40
 
41
  [1] CC-NET Repository : https://huggingface.co/datasets/cc100
42
+
43
  [2] Repositorium operum lationorum apud universitatem Turicensem : https://mlat.uzh.ch/
44
+
45
  [3] Cartae Europae Medii Aevi (5th-15th c.) : https://cema.lamop.fr/
46
+
47
  [4] History of Medieval Europe : https://doi.org/10.5281/zenodo.5600884
48
+
49
  [5] Base du Français Médieval : https://txm-bfm.huma-num.fr/txm/
50
+
51
  [6] Anglo-Normand Dictionary : https://anglo-norman.net/
52
+
53
  [7] Corpus de Docuemntos Españoles anteriores a 1900: https://www.corpuscodea.es/
54
 
55