Update README.md
Browse files
README.md
CHANGED
@@ -6,7 +6,7 @@ tags:
|
|
6 |
---
|
7 |
|
8 |
# t5-small-spanish-nahuatl
|
9 |
-
Nahuatl is the most widely spoken indigenous language in Mexico. However, training a neural network for the task of neural machine tranlation is hard due to the lack of structured data. The most popular datasets such as the Axolot dataset and the bible-corpus only consists of ~16,000 and ~7,000 samples respectivly. Moreover, there are multiple variants of Nahuatl, which makes this task even more difficult. For example, a single word from the Axolot dataset can be found written in more than three different ways. Therefore, in this work we leverage the T5 text-to-text sufix training strategy to compensate the lack of data. We first teach the multilingual model Spanish
|
10 |
|
11 |
|
12 |
## Model description
|
@@ -31,7 +31,7 @@ outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
|
|
31 |
|
32 |
## Approach
|
33 |
### Dataset
|
34 |
-
Since the Axolotl corpus contains misaligments, we just select the best samples (~8,000 samples). We also use the [bible-
|
35 |
|
36 |
| Axolotl best aligned books |
|
37 |
|:-----------------------------------------------------:|
|
@@ -45,7 +45,10 @@ Since the Axolotl corpus contains misaligments, we just select the best samples
|
|
45 |
| Nican Mopohua |
|
46 |
| Quinta Relación (Libro las ocho relaciones) |
|
47 |
| Recetario Nahua de Milpa Alta D.F |
|
48 |
-
|
|
|
|
|
|
|
|
49 |
|
50 |
|
51 |
### Model and training
|
|
|
6 |
---
|
7 |
|
8 |
# t5-small-spanish-nahuatl
|
9 |
+
Nahuatl is the most widely spoken indigenous language in Mexico. However, training a neural network for the task of neural machine tranlation is hard due to the lack of structured data. The most popular datasets such as the Axolot dataset and the bible-corpus only consists of ~16,000 and ~7,000 samples respectivly. Moreover, there are multiple variants of Nahuatl, which makes this task even more difficult. For example, a single word from the Axolot dataset can be found written in more than three different ways. Therefore, in this work we leverage the T5 text-to-text sufix training strategy to compensate the lack of data. We first teach the multilingual model Spanish using English, then we make the transition to Spanish-Nahuatl. The resulting model successfully translates short sentences from Spanish to Nahuatl. We report Chrf and BLEU results.
|
10 |
|
11 |
|
12 |
## Model description
|
|
|
31 |
|
32 |
## Approach
|
33 |
### Dataset
|
34 |
+
Since the Axolotl corpus contains misaligments, we just select the best samples (~8,000 samples). We also use the [bible-corpus](https://github.com/christos-c/bible-corpus) (7,821 samples).
|
35 |
|
36 |
| Axolotl best aligned books |
|
37 |
|:-----------------------------------------------------:|
|
|
|
45 |
| Nican Mopohua |
|
46 |
| Quinta Relación (Libro las ocho relaciones) |
|
47 |
| Recetario Nahua de Milpa Alta D.F |
|
48 |
+
| Testimonios de la antigua palabra |
|
49 |
+
| Trece Poetas del Mundo Azteca |
|
50 |
+
| Una tortillita nomás - Se taxkaltsin saj |
|
51 |
+
| Vida económica de Tenochtitlan |
|
52 |
|
53 |
|
54 |
### Model and training
|