tachicart
/

nllb-ft-darija

text2text-generation

Inference Endpoints

Model card Files Files and versions Community

nllb-ft-darija / README.md

tachicart's picture

Update README.md

98fe2a5 verified 12 days ago

|

history blame contribute delete

1.88 kB

	---
	language: "ar"
	tags:
	- translation
	- nllb
	- fine-tuning
	- darija
	- moroccan
	- transformers
	datasets:
	- json
	library_name: transformers
	model_name: tachicart/nllb-ft-darija
	---

	# NLLB Fine-tuned for Darija to Modern Standard Arabic Translation

	This model is a fine-tuned version of `facebook/nllb-200-distilled-600M` for translating Moroccan Darija (ary) to Modern Standard Arabic (ar). The model was fine-tuned on a custom dataset using the Hugging Face `transformers` library.
	The model is developed by : Tachicart Ridouane, Bouzoubaa Karim
	## Model Details

	- Base Model: `facebook/nllb-200-distilled-600M`
	- Fine-tuning Library: Hugging Face `transformers`
	- Languages Supported: Moroccan Darija (ary), Modern Standard Arabic (ar)
	- Training Dataset: Custom dataset of Moroccan Darija and Modern Standard Arabic pairs in JSON format.
	## Performance
	The model has been evaluated on a validation set to ensure translation quality. While it excels at capturing colloquial Moroccan Arabic, ongoing improvements and additional data can further enhance its performance.

	## Limitations
	Dataset Size: The custom dataset consists of 21,000 samples, which may limit coverage of diverse expressions and rare terms.
	Colloquial Variations: Moroccan Arabic has many dialectal variations, which might not all be covered equally.

	## How to Use

	You can use the model with the `transformers` library as follows:

	```python
	from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

	tokenizer = AutoTokenizer.from_pretrained("tachicart/nllb-ft-darija")
	model = AutoModelForSeq2SeqLM.from_pretrained("tachicart/nllb-ft-darija")

	# Example translation
	inputs = tokenizer("كيفاش نقدر نربح بزاف ديال الفلوس بالزربة ", return_tensors="pt")
	outputs = model.generate(**inputs)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))