milmor commited on
Commit
3a806c4
1 Parent(s): 55fac30

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +6 -3
README.md CHANGED
@@ -6,7 +6,7 @@ tags:
6
  ---
7
 
8
  # t5-small-spanish-nahuatl
9
- Nahuatl is the most widely spoken indigenous language in Mexico. However, training a neural network for the task of neural machine tranlation is hard due to the lack of structured data. The most popular datasets such as the Axolot dataset and the bible-corpus only consists of ~16,000 and ~7,000 samples respectivly. Moreover, there are multiple variants of Nahuatl, which makes this task even more difficult. For example, a single word from the Axolot dataset can be found written in more than three different ways. Therefore, in this work we leverage the T5 text-to-text sufix training strategy to compensate the lack of data. We first teach the multilingual model Spanish unsing English, then we make the transition to Spanish-Nahuatl. The resulting model successfully translates short sentences from Spanish to Nahuatl. We report Chrf and BLEU results.
10
 
11
 
12
  ## Model description
@@ -31,7 +31,7 @@ outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
31
 
32
  ## Approach
33
  ### Dataset
34
- Since the Axolotl corpus contains misaligments, we just select the best samples (~8,000 samples). We also use the [bible-sorpus](https://github.com/christos-c/bible-corpus) (7,821 samples).
35
 
36
  | Axolotl best aligned books |
37
  |:-----------------------------------------------------:|
@@ -45,7 +45,10 @@ Since the Axolotl corpus contains misaligments, we just select the best samples
45
  | Nican Mopohua |
46
  | Quinta Relación (Libro las ocho relaciones) |
47
  | Recetario Nahua de Milpa Alta D.F |
48
- | Tercera Relación (Libro las ocho relaciones) |
 
 
 
49
 
50
 
51
  ### Model and training
 
6
  ---
7
 
8
  # t5-small-spanish-nahuatl
9
+ Nahuatl is the most widely spoken indigenous language in Mexico. However, training a neural network for the task of neural machine tranlation is hard due to the lack of structured data. The most popular datasets such as the Axolot dataset and the bible-corpus only consists of ~16,000 and ~7,000 samples respectivly. Moreover, there are multiple variants of Nahuatl, which makes this task even more difficult. For example, a single word from the Axolot dataset can be found written in more than three different ways. Therefore, in this work we leverage the T5 text-to-text sufix training strategy to compensate the lack of data. We first teach the multilingual model Spanish using English, then we make the transition to Spanish-Nahuatl. The resulting model successfully translates short sentences from Spanish to Nahuatl. We report Chrf and BLEU results.
10
 
11
 
12
  ## Model description
 
31
 
32
  ## Approach
33
  ### Dataset
34
+ Since the Axolotl corpus contains misaligments, we just select the best samples (~8,000 samples). We also use the [bible-corpus](https://github.com/christos-c/bible-corpus) (7,821 samples).
35
 
36
  | Axolotl best aligned books |
37
  |:-----------------------------------------------------:|
 
45
  | Nican Mopohua |
46
  | Quinta Relación (Libro las ocho relaciones) |
47
  | Recetario Nahua de Milpa Alta D.F |
48
+ | Testimonios de la antigua palabra |
49
+ | Trece Poetas del Mundo Azteca |
50
+ | Una tortillita nomás - Se taxkaltsin saj |
51
+ | Vida económica de Tenochtitlan |
52
 
53
 
54
  ### Model and training