VOCALINLP
/

portuguese_capitalization_punctuation_restoration_sanivert

Token Classification

Inference Endpoints

Model card Files Files and versions Community

jcg00v commited on Mar 4

Commit

2be935d

•

1 Parent(s): 8ced499

Update README.md

Files changed (1) hide show

README.md +79 -1

README.md CHANGED Viewed

@@ -8,4 +8,82 @@ widget:
     example_title: Example 3
 language:
   - pt
----

     example_title: Example 3
 language:
   - pt
+---
+# Portuguese punctuation and capitalisation restoration model
+## Details of the model
+This is a reduced version of the Portuguese capitalisation and punctuation restoration model developed by [VÓCALI](https://www.vocali.net) as part of the SANIVERT project.
+You can try the model in the following [SPACE](https://huggingface.co/spaces/VOCALINLP/punctuation_and_capitalization_restoration_sanivert)
+## Details of the dataset
+## Evaluation Metrics
+## Funding
+This work was funded by the Spanish Government, the Spanish Ministry of Economy and Digital Transformation through the Digital Transformation through the "Recovery, Transformation and Resilience Plan" and also funded by the European Union NextGenerationEU/PRTR through the research project 2021/C005/0015007
+## How to use the model
+```py
+from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer
+import torch
+def get_result_text_es_pt (list_entity, text, lang):
+    result_words = []
+    if lang == "es":
+        punc_tags = ['¿', '?', '¡', '!', ',', '.', ':']
+    else:
+        punc_tags = ['?', '!', ',', '.', ':']
+    for entity in list_entity:
+        tag = entity["entity"]
+        word = entity["word"]
+        start = entity["start"]
+        end = entity["end"]
+        # check punctuation
+        punc_in = next((p for p in punc_tags if p in tag), "")
+        subword = False
+        # check subwords
+        if word[0] == "#":
+            subword = True
+            if punc_in != "":
+                word = result_words[-1].replace(punc_in, "") + text[start:end]
+            else:
+                word = result_words[-1] + text[start:end]
+        if tag == "l":
+            word = word
+        elif tag == "u":
+            word = word.capitalize()
+        # case with punctuation
+        else:
+            if tag[-1] == "l":
+                word = (punc_in + word) if punc_in in ["¿", "¡"] else (word + punc_in)
+            elif tag[-1] == "u":
+                word = (punc_in + word.capitalize()) if punc_in in ["¿", "¡"] else (word.capitalize() + punc_in)
+        if subword == True:
+            result_words[-1] = word
+        else:
+            result_words.append(word)
+    return " ".join(result_words)
+lang = "pt"
+model_path = "VOCALINLP/portuguese_capitalization_punctuation_restoration_sanivert"
+model = AutoModelForTokenClassification.from_pretrained(model_path)
+tokenizer = AutoTokenizer.from_pretrained(model_path)
+pipe = pipeline("token-classification", model=model, tokenizer=tokenizer)
+text = "é preciso fazer análises ao sangue à urina e aos ouvidos"
+result = pipe(text)
+print("Source text: "+ text)
+result_text = get_result_text_es_pt(result, text, lang)
+print("Restored text: " +result_text)
+```
+> Created by [VOCALI SISSTEMAS INTELIGENTES S.L.](https://www.vocali.net)