monsoon-nlp
/

dv-labse

Inference Endpoints

Model card Files Files and versions Community

Edit model card

dv-labse

This is an experiment in cross-lingual transfer learning, to insert Dhivehi word and word-piece tokens into Google's LaBSE model.

Original model weights: https://huggingface.co/setu4993/LaBSE
Original model announcement: https://ai.googleblog.com/2020/08/language-agnostic-bert-sentence.html

This currently outperforms dv-wave and dv-MuRIL (a similar transfer learning model) on the Maldivian News Classification task https://github.com/Sofwath/DhivehiDatasets

mBERT: 52%
dv-wave (ELECTRA): 89%
dv-muril: 90.7%
dv-labse: 91.3-91.5% (may continue training)

Training

Start with LaBSE (similar to mBERT) with no Thaana vocabulary
Based on PanLex dictionaries, attach 1,100 Dhivehi words to Sinhalese or English embeddings
Add remaining words and word-pieces from dv-wave's vocabulary to vocab.txt
Continue BERT pretraining on Dhivehi text

CoLab notebook: https://colab.research.google.com/drive/1CUn44M2fb4Qbat2pAvjYqsPvWLt1Novi

Downloads last month: 7

Inference API

This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.