Question about tokenizing

#1
by lagoinst - opened

When I run the following program, it returns the result that 'var' is NOT in the vocabulary.
What could be the cause?
I would appreciate it if you could enlighten me.

from transformers import AutoTokenizer, AutoModelForTokenClassification

tokenizer = AutoTokenizer.from_pretrained(“language-ml-lab/postagger-azb”)
model = AutoModelForTokenClassification.from_pretrained(“language-ml-lab/postagger-azb”)

word = “var“
if word in tokenizer.get_vocab():
    print(f“‘{word}’ is in the vocabulary.”)
else: if word in tokenizer.get_vocab()
    print(f“‘{word}’ is NOT in the vocabulary.”)
DH and NLP Lab org

The issue you encountered, stems from the difference in writing systems. The model postagger-azb is trained on Iranian Azerbaijani, which uses the Perso-Arabic script rather than the Latin script. As a result, the model's vocabulary contains words in the Perso-Arabic script, not their Latin script equivalents.

In this case, 'var' is in the Latin script, but the equivalent in Iranian Azerbaijani is 'وار'. This Perso-Arabic word is present in the model's vocabulary, while the Latin script 'var' is not.

For more details on this model and Iranian Azerbaijani language processing, please refer to our paper.

Thank you for your prompt and detailed reply.

I understand that the text to be used should be in Perso-Arabic script, not Latin script.

This is out of the scope of my original question, but do you know of any tools that can analyze the part-of-speech of Azerbaijani in Latin script?

lagoinst changed discussion status to closed

Sign up or log in to comment