metadata

license: apache-2.0
language: es
tags:
  - translation Spanish Nahuatl

t5-small-spanish-nahuatl

Nahuatl is the most widely spoken indigenous language in Mexico. However, training a neural network for the task of neural machine tranlation is hard due to the lack of structured data. The most popular datasets such as the Axolot dataset and the bible-corpus only consists of ~16,000 and ~7,000 samples respectivly. Moreover, there are multiple variants of Nahuatl, which makes this task even more difficult. For example, a single word from the Axolot dataset can be found written in more than three different ways. In this work we leverage the T5 text-to-text training strategy to compensate for the lack of data. The resulting model successfully translates short sentences from Spanish to Nahuatl. We report Chrf and BLEU results.

Model description

This model is a T5 Transformer (t5-small) fine-tuned on spanish and nahuatl sentences collected from the web. The dataset is normalized using 'sep' normalization from py-elotl.

Usage

from transformers import AutoModelForSeq2SeqLM
from transformers import AutoTokenizer

model = AutoModelForSeq2SeqLM.from_pretrained('hackathon-pln-es/t5-small-spanish-nahuatl')
tokenizer = AutoTokenizer.from_pretrained('hackathon-pln-es/t5-small-spanish-nahuatl')

model.eval()
sentence = 'muchas flores son blancas'
input_ids = tokenizer('translate Spanish to Nahuatl: ' + sentence, return_tensors='pt').input_ids
outputs = model.generate(input_ids)
# outputs = miak xochitl istak
outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]

Approach

Since the Axolotl corpus contains misaligments, we just select the best samples (~10,000 samples). We use the bible-sorpus (7,821 samples) to compensate the lack of nahuatl data.

Evaluation results

The model is evaluated on 505 validation sentences. We report the results using chrf and sacrebleu hugging face metrics:

Validation loss: 1.31
BLEU: 6.18
Chrf: 28.21

References

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2019. Exploring the limits of transfer learning with a unified Text-to-Text transformer.
Ximena Gutierrez-Vasques, Gerardo Sierra, and Hernandez Isaac. 2016. Axolotl: a web accessible parallel corpus for Spanish-Nahuatl. In International Conference on Language Resources and Evaluation (LREC).

Team members

Emilio Alejandro Morales (milmor)
Rodrigo Martínez Arzate (rockdrigoma)
Luis Armando Mercado (luisarmando)
Jacobo del Valle (jjdv)