metadata

widget:
  - text: has tingut alguna malaltia a la darrera setmana
    example_title: Example 1
  - text: pateix la malaltia de parkinson
    example_title: Example 2
  - text: Cal fer anàlisis de sang de visió i d'oïda
    example_title: Example 3
language:
  - ca

Catalan punctuation and capitalisation restoration model

Details of the model

This is a reduced version of the Catalan capitalisation and punctuation restoration model developed by VÓCALI as part of the SANIVERT project.

You can try the model in the following SPACE

Details of the dataset

This a PlanTL-GOB-ES/roberta-base-ca model fine-tuned for punctuation restoration using the following data distribution.

Language	Number of text samples	Number of tokens
Catalan	57,543	2,299,616

Evaluation Metrics

The metrics used to the evaluation of the model are the Macro and the Weighted F1 scores.

Funding

This work was funded by the Spanish Government, the Spanish Ministry of Economy and Digital Transformation through the Digital Transformation through the "Recovery, Transformation and Resilience Plan" and also funded by the European Union NextGenerationEU/PRTR through the research project 2021/C005/0015007

How to use the model

from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer
import torch

def get_result_text_ca (list_entity, text):
    result_words = []
    punc_tags = ['?', '!', ',', '.', ':']
    tmp_word = ""
    for idx, entity in enumerate(list_entity): 
        start = entity["start"]
        end = entity["end"]
        tag = entity["entity"]
        word = entity["word"]
        
        # check punctuation
        punc_in = next((p for p in punc_tags if p in tag), "")
                
        subword = False
        # check subwords
        if word[0] != "Ġ": 
            subword = True
            if tmp_word == "":
                p_s = list_entity[idx-1]["start"]
                p_e = list_entity[idx-1]["end"]
                tmp_word = text[p_s:p_e] + text[start:end]
            else: 
                tmp_word = tmp_word + text[start:end]
            word = tmp_word
        else:
            tmp_word = ""
            word = text[start:end]
        
        if tag == "l": 
            word = word 
        elif tag == "u":
            word = word.capitalize()
        # case with punctuation
        else:
            if tag[-1] == "l":
                word = (punc_in + word) if punc_in in ["¿", "¡"] else (word + punc_in)
            elif tag[-1] == "u":
                word = (punc_in + word.capitalize()) if punc_in in ["¿", "¡"] else (word.capitalize() + punc_in)     
            
        if subword == True: 
            result_words[-1] = word
        else:
            result_words.append(word)

    return " ".join(result_words)


lang = "ca"
model_path = "VOCALINLP/catalan_capitalization_punctuation_restoration_sanivert"

model = AutoModelForTokenClassification.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

pipe = pipeline("token-classification", model=model, tokenizer=tokenizer)

text = "el pacient presenta els símptomes següents febre dispnea nàusees i vòmits"
result = pipe(text)

print("Source text: "+ text)
result_text = get_result_text_ca(result, text)
print("Restored text: " +result_text)

Created by VOCALI SISSTEMAS INTELIGENTES S.L.