Edit model card

zuBERTa

zuBERTa is a RoBERTa style transformer language model trained on zulu text.

Intended uses & limitations

The model can be used for getting embeddings to use on a down-stream task such as question answering.

How to use

>>> from transformers import pipeline
>>> from transformers import AutoTokenizer, AutoModelWithLMHead

>>> tokenizer = AutoTokenizer.from_pretrained("MoseliMotsoehli/zuBERTa")
>>> model = AutoModelWithLMHead.from_pretrained("MoseliMotsoehli/zuBERTa")
>>> unmasker = pipeline('fill-mask', model=model, tokenizer=tokenizer)
>>> unmasker("Abafika eNkandla bafika sebeholwa <mask> uMpongo kaZingelwayo.")

[
  {
    "sequence": "<s>Abafika eNkandla bafika sebeholwa khona uMpongo kaZingelwayo.</s>",
    "score": 0.050459690392017365,
    "token": 555,
    "token_str": "Ġkhona"
  },
  {
    "sequence": "<s>Abafika eNkandla bafika sebeholwa inkosi uMpongo kaZingelwayo.</s>",
    "score": 0.03668094798922539,
    "token": 2321,
    "token_str": "Ġinkosi"
  },
  {
    "sequence": "<s>Abafika eNkandla bafika sebeholwa ubukhosi uMpongo kaZingelwayo.</s>",
    "score": 0.028774697333574295,
    "token": 5101,
    "token_str": "Ġubukhosi"
  }
]

Training data

  1. 30k sentences of text, came from the Leipzig Corpora Collection of zulu 2018. These were collected from news articles and creative writtings.
  2. ~7500 articles of human generated translations were scraped from the zulu wikipedia.

BibTeX entry and citation info

@inproceedings{author = {Moseli Motsoehli},
  title = {Towards transformation of Southern African language models through transformers.},
  year={2020}
}
Downloads last month
5
Inference API
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.