|
--- |
|
language: "ar" |
|
tags: |
|
- translation |
|
- nllb |
|
- fine-tuning |
|
- darija |
|
- moroccan |
|
- transformers |
|
datasets: |
|
- json |
|
library_name: transformers |
|
model_name: tachicart/nllb-ft-darija |
|
--- |
|
|
|
# NLLB Fine-tuned for Darija to Modern Standard Arabic Translation |
|
|
|
This model is a fine-tuned version of `facebook/nllb-200-distilled-600M` for translating Moroccan Darija (ary) to Modern Standard Arabic (ar). The model was fine-tuned on a custom dataset using the Hugging Face `transformers` library. |
|
The model is developed by : Tachicart Ridouane, Bouzoubaa Karim |
|
## Model Details |
|
|
|
- **Base Model**: `facebook/nllb-200-distilled-600M` |
|
- **Fine-tuning Library**: Hugging Face `transformers` |
|
- **Languages Supported**: Moroccan Darija (ary), Modern Standard Arabic (ar) |
|
- **Training Dataset**: Custom dataset of Moroccan Darija and Modern Standard Arabic pairs in JSON format. |
|
## Performance |
|
The model has been evaluated on a validation set to ensure translation quality. While it excels at capturing colloquial Moroccan Arabic, ongoing improvements and additional data can further enhance its performance. |
|
|
|
## Limitations |
|
Dataset Size: The custom dataset consists of 21,000 samples, which may limit coverage of diverse expressions and rare terms. |
|
Colloquial Variations: Moroccan Arabic has many dialectal variations, which might not all be covered equally. |
|
|
|
## How to Use |
|
|
|
You can use the model with the `transformers` library as follows: |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("tachicart/nllb-ft-darija") |
|
model = AutoModelForSeq2SeqLM.from_pretrained("tachicart/nllb-ft-darija") |
|
|
|
# Example translation |
|
inputs = tokenizer("كيفاش نقدر نربح بزاف ديال الفلوس بالزربة ", return_tensors="pt") |
|
outputs = model.generate(**inputs) |
|
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
|
|
|
|
|
|