--- widget: - text: teve alguma doença na última semana example_title: Example 1 - text: sofre da doença de parkinson example_title: Example 2 - text: é preciso fazer análises ao sangue à urina e aos ouvidos example_title: Example 3 language: - pt --- # Portuguese punctuation and capitalisation restoration model ## Details of the model This is a reduced version of the Portuguese capitalisation and punctuation restoration model developed by [VÓCALI](https://www.vocali.net) as part of the SANIVERT project. You can try the model in the following [SPACE](https://huggingface.co/spaces/VOCALINLP/punctuation_and_capitalization_restoration_sanivert) ## Details of the dataset This is a neuralmind/bert-base-portuguese-cased model fine-tuned for punctuation restoration using the following data distribution. | Language | Number of text samples | Number of tokens | | -------- | ---------------------- | ---------------- | | Portuguese | 2,974,058 | 49,720,263 | ## Evaluation Metrics The metrics used to the evaluation of the model are the Macro and the Weighted F1 scores. ## Funding This work was funded by the Spanish Government, the Spanish Ministry of Economy and Digital Transformation through the Digital Transformation through the "Recovery, Transformation and Resilience Plan" and also funded by the European Union NextGenerationEU/PRTR through the research project 2021/C005/0015007 ## How to use the model The metrics used to the evaluation of the model are the Macro and the Weighted F1 scores. ```py from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer import torch def get_result_text_es_pt (list_entity, text, lang): result_words = [] tmp_word = "" if lang == "es": punc_tags = ['¿', '?', '¡', '!', ',', '.', ':'] else: punc_tags = ['?', '!', ',', '.', ':'] for idx, entity in enumerate(list_entity): tag = entity["entity"] word = entity["word"] start = entity["start"] end = entity["end"] # check punctuation punc_in = next((p for p in punc_tags if p in tag), "") subword = False # check subwords if word[0] == "#": subword = True if tmp_word == "": p_s = list_entity[idx-1]["start"] p_e = list_entity[idx-1]["end"] tmp_word = text[p_s:p_e] + text[start:end] else: tmp_word = tmp_word + text[start:end] word = tmp_word else: tmp_word = "" word = text[start:end] if tag == "l": word = word elif tag == "u": word = word.capitalize() # case with punctuation else: if tag[-1] == "l": word = (punc_in + word) if punc_in in ["¿", "¡"] else (word + punc_in) elif tag[-1] == "u": word = (punc_in + word.capitalize()) if punc_in in ["¿", "¡"] else (word.capitalize() + punc_in) if subword == True: result_words[-1] = word else: result_words.append(word) return " ".join(result_words) lang = "pt" model_path = "VOCALINLP/portuguese_capitalization_punctuation_restoration_sanivert" model = AutoModelForTokenClassification.from_pretrained(model_path) tokenizer = AutoTokenizer.from_pretrained(model_path) pipe = pipeline("token-classification", model=model, tokenizer=tokenizer) text = "é preciso fazer análises ao sangue à urina e aos ouvidos" result = pipe(text) print("Source text: "+ text) result_text = get_result_text_es_pt(result, text, lang) print("Restored text: " +result_text) ``` > Created by [VOCALI SISSTEMAS INTELIGENTES S.L.](https://www.vocali.net)