jcg00v's picture
Update README.md
4ad9d39 verified
|
raw
history blame
No virus
3.12 kB
---
widget:
- text: has tingut alguna malaltia a la darrera setmana
example_title: Example 1
- text: pateix la malaltia de parkinson
example_title: Example 2
- text: Cal fer anàlisis de sang de visió i d'oïda
example_title: Example 3
language:
- ca
---
# Catalan punctuation and capitalisation restoration model
## Details of the model
This is a reduced version of the Catalan capitalisation and punctuation restoration model developed by [VÓCALI] (https://www.vocali.net) as part of the SANIVERT project.
You can try the model in the following [SPACE](https://huggingface.co/spaces/VOCALINLP/punctuation_and_capitalization_restoration_sanivert)
## Details of the dataset
## Evaluation Metrics
## Funding
This work was funded by the Spanish Government, the Spanish Ministry of Economy and Digital Transformation through the Digital Transformation through the "Recovery, Transformation and Resilience Plan" and also funded by the European Union NextGenerationEU/PRTR through the research project 2021/C005/0015007
## How to use the model
```py
from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer
import torch
def get_result_text_es_pt (list_entity, text, lang):
result_words = []
if lang == "es":
punc_tags = ['¿', '?', '¡', '!', ',', '.', ':']
else:
punc_tags = ['?', '!', ',', '.', ':']
for entity in list_entity:
tag = entity["entity"]
word = entity["word"]
start = entity["start"]
end = entity["end"]
# check punctuation
punc_in = next((p for p in punc_tags if p in tag), "")
subword = False
# check subwords
if word[0] == "#":
subword = True
if punc_in != "":
word = result_words[-1].replace(punc_in, "") + text[start:end]
else:
word = result_words[-1] + text[start:end]
if tag == "l":
word = word
elif tag == "u":
word = word.capitalize()
# case with punctuation
else:
if tag[-1] == "l":
word = (punc_in + word) if punc_in in ["¿", "¡"] else (word + punc_in)
elif tag[-1] == "u":
word = (punc_in + word.capitalize()) if punc_in in ["¿", "¡"] else (word.capitalize() + punc_in)
if subword == True:
result_words[-1] = word
else:
result_words.append(word)
return " ".join(result_words)
lang = "es"
model_path = "VOCALINLP/spanish_capitalization_punctuation_restoration_sanivert"
model = AutoModelForTokenClassification.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
pipe = pipeline("token-classification", model=model, tokenizer=tokenizer)
text = "el paciente presenta los siguientes síntomas náuseas vértigo disnea fiebre y dolor abdominal"
result = pipe(text)
print("Source text: "+ text)
result_text = get_result_text_es_pt(result, text, lang)
print("Restored text: " +result_text)
```
> Created by [VOCALI SISSTEMAS INTELIGENTES/@VOCALINLP](https://twitter.com/vocalinet)