somosnlp-hackathon-2022
/

t5-small-spanish-nahuatl

@@ -6,7 +6,7 @@ tags:
 ---
 # t5-small-spanish-nahuatl
-Nahuatl is the most widely spoken indigenous language in Mexico. However, training a neural network for the task of neural machine tranlation is hard due to the lack of structured data. The most popular datasets such as the Axolot dataset and the bible-corpus only consists of ~16,000 and ~7,000 samples respectivly. Moreover, there are multiple variants of Nahuatl, which makes this task even more difficult. For example, a single word from the Axolot dataset can be found written in more than three different ways. Therefore, in this work we leverage the T5 text-to-text sufix training strategy to compensate the lack of data. We first teach the  multilingual model Spanish unsing English, then we make the transition to Spanish-Nahuatl. The resulting model successfully translates short sentences from Spanish to Nahuatl. We report Chrf and BLEU results.
 ## Model description
@@ -31,7 +31,7 @@ outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
 ## Approach
 ### Dataset
-Since the Axolotl corpus contains misaligments, we just select the best samples (~8,000 samples). We also use the [bible-sorpus](https://github.com/christos-c/bible-corpus) (7,821 samples).
 | Axolotl best aligned books                            |
 |:-----------------------------------------------------:|
@@ -45,7 +45,10 @@ Since the Axolotl corpus contains misaligments, we just select the best samples
 | Nican Mopohua                                         |
 | Quinta Relación (Libro las ocho relaciones)           |
 | Recetario Nahua de Milpa Alta D.F                     |
-| Tercera Relación (Libro las ocho relaciones)          |
 ### Model and training

 ---
 # t5-small-spanish-nahuatl
+Nahuatl is the most widely spoken indigenous language in Mexico. However, training a neural network for the task of neural machine tranlation is hard due to the lack of structured data. The most popular datasets such as the Axolot dataset and the bible-corpus only consists of ~16,000 and ~7,000 samples respectivly. Moreover, there are multiple variants of Nahuatl, which makes this task even more difficult. For example, a single word from the Axolot dataset can be found written in more than three different ways. Therefore, in this work we leverage the T5 text-to-text sufix training strategy to compensate the lack of data. We first teach the  multilingual model Spanish using English, then we make the transition to Spanish-Nahuatl. The resulting model successfully translates short sentences from Spanish to Nahuatl. We report Chrf and BLEU results.
 ## Model description
 ## Approach
 ### Dataset
+Since the Axolotl corpus contains misaligments, we just select the best samples (~8,000 samples). We also use the [bible-corpus](https://github.com/christos-c/bible-corpus) (7,821 samples).
 | Axolotl best aligned books                            |
 |:-----------------------------------------------------:|
 | Nican Mopohua                                         |
 | Quinta Relación (Libro las ocho relaciones)           |
 | Recetario Nahua de Milpa Alta D.F                     |
+| Testimonios de la antigua palabra                     |
+| Trece Poetas del Mundo Azteca                         |
+| Una tortillita nomás - Se taxkaltsin saj              |
+| Vida económica de Tenochtitlan                        |
 ### Model and training