File size: 5,486 Bytes
cc20d6a
 
d00360d
31daa56
 
cc20d6a
31daa56
 
3a806c4
602078b
72c9d8f
602078b
 
235bc01
72c9d8f
 
 
 
 
 
068670d
 
72c9d8f
 
 
 
 
 
 
 
31daa56
602078b
940dceb
3a806c4
55fac30
82a1b2d
55fac30
 
 
 
 
 
 
 
 
 
 
3a806c4
 
 
 
55fac30
940dceb
 
c8310b0
940dceb
 
 
 
 
 
b096ea2
940dceb
 
9f524a2
31daa56
72c9d8f
602078b
940dceb
 
 
 
 
 
72c9d8f
9f524a2
72c9d8f
 
 
 
 
ba4f742
72c9d8f
31daa56
0e6ce71
d861236
d5d2070
d10add9
0e6ce71
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
---

license: apache-2.0
language: es 
tags:
- translation Spanish Nahuatl 
---


# t5-small-spanish-nahuatl
Nahuatl is the most widely spoken indigenous language in Mexico. However, training a neural network for the task of neural machine tranlation is hard due to the lack of structured data. The most popular datasets such as the Axolot dataset and the bible-corpus only consists of ~16,000 and ~7,000 samples respectivly. Moreover, there are multiple variants of Nahuatl, which makes this task even more difficult. For example, a single word from the Axolot dataset can be found written in more than three different ways. Therefore, in this work we leverage the T5 text-to-text sufix training strategy to compensate the lack of data. We first teach the  multilingual model Spanish using English, then we make the transition to Spanish-Nahuatl. The resulting model successfully translates short sentences from Spanish to Nahuatl. We report Chrf and BLEU results.


## Model description
This model is a T5 Transformer ([t5-small](https://huggingface.co/t5-small)) fine-tuned on spanish and nahuatl sentences collected from the web. The dataset is normalized using 'sep' normalization from [py-elotl](https://github.com/ElotlMX/py-elotl).


## Usage
```python

from transformers import AutoModelForSeq2SeqLM

from transformers import AutoTokenizer



model = AutoModelForSeq2SeqLM.from_pretrained('hackathon-pln-es/t5-small-spanish-nahuatl')

tokenizer = AutoTokenizer.from_pretrained('hackathon-pln-es/t5-small-spanish-nahuatl')



model.eval()

sentence = 'muchas flores son blancas'

input_ids = tokenizer('translate Spanish to Nahuatl: ' + sentence, return_tensors='pt').input_ids

outputs = model.generate(input_ids)

# outputs = miak xochitl istak

outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]

```

## Approach
### Dataset
Since the Axolotl corpus contains misaligments, we just select the best samples (~8,000 samples). We also use the [bible-corpus](https://github.com/christos-c/bible-corpus) (7,821 samples). 

| Axolotl books best aligned                            | 
|:-----------------------------------------------------:|
| Anales de Tlatelolco                                  | 
| Diario                                                |  
| Documentos nauas de la Ciudad de México del siglo XVI |  
| Historia de México narrada en náhuatl y español       |  
| La tinta negra y roja (antología de poesía náhuatl)   |  
| Memorial Breve (Libro las ocho relaciones)            |  
| Método auto-didáctico náhuatl-español                 |  
| Nican Mopohua                                         | 
| Quinta Relación (Libro las ocho relaciones)           |   
| Recetario Nahua de Milpa Alta D.F                     | 
| Testimonios de la antigua palabra                     |
| Trece Poetas del Mundo Azteca                         |
| Una tortillita nomás - Se taxkaltsin saj              |
| Vida económica de Tenochtitlan                        |


### Model and training
We employ two training-stages using a multilingual T5-small. This model was chosen because it can handle different vocabularies and suffixes. The model is pretrained on different tasks and languages (French, Romanian, English, German).

### Training-stage 1 (learning Spanish)
In training stage 1 we first introduce Spanish to the model. The objective is to learn a new language rich in data (Spanish) and not lose the previous knowledge acquired. We use the English-Spanish [Anki](https://www.manythings.org/anki/) dataset, which consists of 118.964 text pairs. We train the model till convergence adding the suffix "Translate Spanish to English: ".


### Training-stage 2 (learning Nahuatl)
We use the pretrained Spanish-English model to learn Spanish-Nahuatl. Since the amount of Nahuatl pairs is limited, we also add to our dataset 20,000 samples from the English-Spanish Anki dataset. This two-task-training avoids overfitting end makes the model more robust.

### Training setup
We train the models on the same datasets for 660k steps using batch size = 16 and 2e-5 learning rate.


## Evaluation results
For a fair comparison, the models are evaluated on the same 505 validation  Nahuatl sentences. We report the results using chrf and sacrebleu hugging face metrics:

| English-Spanish pretraining  | Validation loss | BLEU | Chrf   |
|:----------------------------:|:---------------:|:-----|-------:|
| False                        | 1.34            | 6.17 | 26.96  | 
| True                         | 1.31            | 6.18 | 28.21  | 

The English-Spanish pretrained model improves BLEU and Chrf, and leads to faster convergence.

## References
- Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2019. Exploring the limits
of transfer learning with a unified Text-to-Text transformer.

- Ximena Gutierrez-Vasques, Gerardo Sierra, and Hernandez Isaac. 2016. Axolotl: a web accessible parallel corpus for Spanish-Nahuatl. In International Conference on Language Resources and Evaluation (LREC).


## Team members
- Emilio Alejandro Morales [(milmor)](https://huggingface.co/milmor)
- Rodrigo Martínez Arzate  [(rockdrigoma)](https://huggingface.co/rockdrigoma)
- Luis Armando Mercado [(luisarmando)](https://huggingface.co/luisarmando)
- Jacobo del Valle [(jjdv)](https://huggingface.co/jjdv)