English to Persian Colloquial Translation Model

Overview

This model is based on the M2M100 418M Facebook model and trained with approximately one million and 600 thousand pairs of English text with Persian human translation. The training data consists of colloquial sentences, resulting in fluent and colloquial translations. This model is designed to facilitate the translation of informal English to Persian.

Model Details

Architecture: M2M100 418M Facebook model.
Configurations: The model's config.json includes the specifications below:
- Model type: m2m_100
- Maximum sequence length: 200 tokens
- Number of hidden layers: 12
- Model dimensions (d_model): 1024
- Encoder and decoder layers: 12
- Attention and dropout rates: 0.1
- Beam search: Enabled with 5 beams
- Initialization standard deviation: 0.02

Training Procedure

Training Data: The model was trained on a dataset comprising colloquial sentences pairs consisting of English text and their corresponding human translations in Persian.
Training Arguments: The following training arguments were used:
- Batch size: 16
- Learning rate: 2e-5
- Weight decay: 0.01
- Number of training epochs: 2
- Evaluation strategy: After each epoch
- Save total limit: 3 checkpoints
- Predict with generate: Enabled

Performance

The model's performance was evaluated using the sacrebleu metric on a separate test dataset. The results are as follows:

BLEU Score: 15.31
Precision (1-gram to 4-gram): 42.87, 21.00, 11.41, 6.27
Brevity Penalty: 0.961
System Length: 896,975 tokens
Reference Length: 932,792 tokens

Example Usage

from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer

model_name = "mittynem/m2m100_418M_en2fa_colloquial"
model = M2M100ForConditionalGeneration.from_pretrained(model_name)
tokenizer = M2M100Tokenizer.from_pretrained(model_name)

# English text to translate
english_text = "Last night's party was lit!"

# Tokenize and translate
input_ids = tokenizer.encode(english_text, return_tensors="pt")
translated_ids = model.generate(input_ids)
translated_text = tokenizer.decode(translated_ids[0], skip_special_tokens=True)

print(translated_text)  # Output: "مهمونی دیشب عالی بود!"

Fine-tuning and Transfer Learning

You can further fine-tune this model on specific downstream tasks or domains by using Hugging Face's Transformers library. Refer to the official documentation for guidance on fine-tuning procedures and techniques.

❗Important Note: Potential for Objectionable Words

⚠️ Caution: This model has been trained on colloquial and informal data, aiming to produce translations that closely resemble the vernacular language used by people. While it generally generates accurate and fluent translations, it is important to be aware that it has the possibility of producing words or phrases that some individuals may find objectionable or inappropriate.

Please exercise caution and responsibility when using this model, especially in situations where communication with others is involved. The generated translations should be carefully reviewed and edited to ensure they align with the intended context and comply with social norms and guidelines.

Disclaimer

The developer and contributors of this model cannot be held responsible for any potential misuse or unintended consequences resulting from the use of the model. It is the user's responsibility to review and refine the generated translations to ensure they meet their specific requirements, conform to acceptable standards, and are suitable for the target audience.

We strongly recommend that users exercise due diligence and employ human judgment to validate and modify the model's outputs before disseminating them in public or sensitive settings. It is essential to consider cultural, linguistic, and ethical factors to ensure respectful and appropriate communication.

Contact and Licensing

For any questions, suggestions, or issues related to this model, please feel free to reach out to the model developer at mehdinemati.nmt@gmail.com. This model is released under the cc-by-nc-4.0 license, allowing researchers and developers to use it for their projects and experiments for non-commercial purposes.

mittynem
/

m2m100_418M_en2fa_colloquial