monsoon-nlp
/

dv-muril

Inference Endpoints

Model card Files Files and versions Community

Edit model card

dv-muril

This is an experiment in transfer learning, to insert Dhivehi word and word-piece tokens into Google's MuRIL model.

This BERT-based model currently performs better than dv-wave ELECTRA on the Maldivian News Classification task https://github.com/Sofwath/DhivehiDatasets

Training

Start with MuRIL (similar to mBERT) with no Thaana vocabulary
Based on PanLex dictionaries, attach 1,100 Dhivehi words to Malayalam or English embeddings
Add remaining words and word-pieces from BertWordPieceTokenizer / vocab.txt
Continue BERT pretraining

Performance

mBERT: 52%
dv-wave (ELECTRA, 30k vocab): 89%
dv-muril (10k vocab) before BERT pretraining step: 89.8%
previous dv-muril (30k vocab): 90.7%
dv-muril (10k vocab): 91.6%

CoLab notebook: https://colab.research.google.com/drive/113o6vkLZRkm6OwhTHrvE0x6QPpavj0fn

Downloads last month: 4

Inference API

This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.