--- license: mit language: - vi pipeline_tag: token-classification tags: - vietnamese - accents inserter metrics: - accuracy --- # A Transformer model for inserting Vietnamese accent marks This model is finetuned from the XLM-Roberta Large. Example input: Nhin nhung mua thu di Target output: Nhìn những mùa thu đi ## Model training This problem was modelled as a token classification problem. For each input token, the goal is to asssign a "tag" that will transform it to the accented token. For more details on the training process, please refer to this blog post. ## How to use this model There are just a few steps: - Step 1: Load the model as a token classification model (*AutoModelForTokenClassification*). - Step 2: Run the input through the model to obtain the tag index for each input token. - Step 3: Use the tags' index to retreive the actual tags in the file *selected_tags_names.txt*. Then, apply the conversion indicated by the tag to each token to obtain accented tokens. ### Step 1: Load model Note: Install *transformers*, *torch*, *numpy* packages first. ```python from transformers import AutoTokenizer, AutoModelForTokenClassification import torch import numpy as np def load_trained_transformer_model(): model_path = "peterhung/transformer-vnaccent-marker" tokenizer = AutoTokenizer.from_pretrained(model_path, add_prefix_space=True) model = AutoModelForTokenClassification.from_pretrained(model_path) return model, tokenizer model, tokenizer = load_trained_transformer_model() ``` ### Step 2: Run input text through the model ```python # only needed if it's run on GPU device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu") model.to(device) # set to eval mode model.eval() def insert_accents(text, model, tokenizer): our_tokens = text.strip().split() # the tokenizer may further split our tokens inputs = tokenizer(our_tokens, is_split_into_words=True, truncation=True, padding=True, return_tensors="pt" ) input_ids = inputs['input_ids'] tokens = tokenizer.convert_ids_to_tokens(input_ids[0]) tokens = tokens[1:-1] with torch.no_grad(): inputs.to(device) outputs = model(**inputs) predictions = outputs["logits"].cpu().numpy() predictions = np.argmax(predictions, axis=2) # exclude output at index 0 and the last index, which correspond to '' and '' predictions = predictions[0][1:-1] assert len(tokens) == len(predictions) return tokens, predictions text = "Nhin nhung mua thu di, em nghe sau len trong nang." tokens, predictions = insert_accents(text, model, tokenizer) ``` ### Step3: Obtain the accented words 3.1 Download the tags set file from this repo. Then load it ```python def _load_tags_set(fpath): labels = [] with open(fpath, 'r') as f: for line in f: line = line.strip() if line: labels.append(line) return labels label_list = _load_tags_set("/content/training_data/vnaccent/corpus-title.train.selected_tags_names.txt") assert len(label_list) == 528, f"Expect {len(label_list)} tags" ``` 3.2 Print out `tokens` and `predictions` obtained above to see what we're having here ```python print(tokens) print(list(f"{pred} ({label_list[pred]})" for pred in predictions)) ``` Obtained ```python ['▁Nhi', 'n', '▁nhu', 'ng', '▁mua', '▁thu', '▁di', ',', '▁em', '▁nghe', '▁sau', '▁len', '▁trong', '▁nang', '.'] ['217 (i-ì)', '217 (i-ì)', '388 (u-ữ)', '388 (u-ữ)', '407 (ua-ùa)', '378 (u-u)', '120 (di-đi)', '0 (-)', '185 (e-e)', '185 (e-e)', '41 (au-âu)', '188 (e-ê)', '302 (o-o)', '14 (a-ắ)', '0 (-)'] ``` ## Limitations - This model will accept a maximum of 512 tokens, which is a limitation inherited from the base pretrained XLM-Roberta model.