anegda commited on
Commit
c074354
1 Parent(s): 7adec6d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -74,7 +74,7 @@ The 11,489,433 sentence pairs of synthetic parallel data were created by transla
74
 
75
  #### Preprocessing
76
 
77
- After concatenation, all datasets are cleaned and deduplicated using [bifixer](https://github.com/bitextor/bifixer) and [bicleaner](https://github.com/bitextor/bicleaner) tools [(Ramírez-Sánchez et al., 2020)](https://aclanthology.org/2020.eamt-1.31/). Any sentence pairs with a classification score of less than 0.5 is removed. The filtered corpus is composed of 9,033,998 parallel sentences.
78
 
79
  #### Tokenization
80
  All data is tokenized using sentencepiece, with a 32,000 token sentencepiece model learned from the combination of all filtered training data. This model is included.
 
74
 
75
  #### Preprocessing
76
 
77
+ After concatenation, all datasets are cleaned and deduplicated using [bifixer](https://github.com/bitextor/bifixer) [(Ramírez-Sánchez et al., 2020)](https://aclanthology.org/2020.eamt-1.31/) for identifying repetions and cleaning encoding problems and LaBSE embeddings to filter missaligned sentences. Any sentence pairs with a LaBSE similarity score of less than 0.5 is removed. The filtered corpus is composed of 9,033,998 parallel sentences.
78
 
79
  #### Tokenization
80
  All data is tokenized using sentencepiece, with a 32,000 token sentencepiece model learned from the combination of all filtered training data. This model is included.