nllb-ft-darija / README.md
tachicart's picture
Update README.md
98fe2a5 verified
---
language: "ar"
tags:
- translation
- nllb
- fine-tuning
- darija
- moroccan
- transformers
datasets:
- json
library_name: transformers
model_name: tachicart/nllb-ft-darija
---
# NLLB Fine-tuned for Darija to Modern Standard Arabic Translation
This model is a fine-tuned version of `facebook/nllb-200-distilled-600M` for translating Moroccan Darija (ary) to Modern Standard Arabic (ar). The model was fine-tuned on a custom dataset using the Hugging Face `transformers` library.
The model is developed by : Tachicart Ridouane, Bouzoubaa Karim
## Model Details
- **Base Model**: `facebook/nllb-200-distilled-600M`
- **Fine-tuning Library**: Hugging Face `transformers`
- **Languages Supported**: Moroccan Darija (ary), Modern Standard Arabic (ar)
- **Training Dataset**: Custom dataset of Moroccan Darija and Modern Standard Arabic pairs in JSON format.
## Performance
The model has been evaluated on a validation set to ensure translation quality. While it excels at capturing colloquial Moroccan Arabic, ongoing improvements and additional data can further enhance its performance.
## Limitations
Dataset Size: The custom dataset consists of 21,000 samples, which may limit coverage of diverse expressions and rare terms.
Colloquial Variations: Moroccan Arabic has many dialectal variations, which might not all be covered equally.
## How to Use
You can use the model with the `transformers` library as follows:
```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("tachicart/nllb-ft-darija")
model = AutoModelForSeq2SeqLM.from_pretrained("tachicart/nllb-ft-darija")
# Example translation
inputs = tokenizer("كيفاش نقدر نربح بزاف ديال الفلوس بالزربة ", return_tensors="pt")
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))