File size: 3,033 Bytes
b59eacd
 
 
 
 
 
5ae84d2
b59eacd
 
b5ae16c
60e3a99
4ad9d39
9d8ed09
8e0b485
9d8ed09
4ad9d39
9d8ed09
4ad9d39
9d8ed09
 
 
 
d56bbf6
9d8ed09
 
 
 
 
 
 
f0bef02
9d8ed09
f0bef02
9d8ed09
 
 
 
f0bef02
 
9d8ed09
 
 
 
 
 
f0bef02
9d8ed09
 
 
 
 
f0bef02
 
9d8ed09
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f0bef02
 
 
9d8ed09
 
 
 
 
f0bef02
 
9d8ed09
 
 
f0bef02
268acf8
 
9d8ed09
8e0b485
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
---
widget:
  - text: has tingut alguna malaltia a la darrera setmana
    example_title: Example 1
  - text: pateix la malaltia de parkinson
    example_title: Example 2
  - text: Cal fer anàlisis de sang de visió i d'oïda
    example_title: Example 3
language:
  - ca
---
# Catalan punctuation and capitalisation restoration model
## Details of the model
This is a reduced version of the Catalan capitalisation and punctuation restoration model developed by [VÓCALI](https://www.vocali.net) as part of the SANIVERT project.

You can try the model in the following [SPACE](https://huggingface.co/spaces/VOCALINLP/punctuation_and_capitalization_restoration_sanivert)
## Details of the dataset


## Evaluation Metrics 

## Funding 
This work was funded by the Spanish Government, the Spanish Ministry of Economy and Digital Transformation through the Digital Transformation through the "Recovery, Transformation and Resilience Plan" and also funded by the European Union NextGenerationEU/PRTR through the research project 2021/C005/0015007

## How to use the model

```py
from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer
import torch

def get_result_text_ca (list_entity, text):
    result_words = []
    punc_tags = ['?', '!', ',', '.', ':']

    for entity in list_entity:
        start = entity["start"]
        end = entity["end"]
        tag = entity["entity"]
        word = entity["word"]

        # check punctuation
        punc_in = next((p for p in punc_tags if p in tag), "")

        subword = False
        # check subwords
        if word[0] != "Ġ":
            subword = True
            if punc_in != "":
                word = result_words[-1].replace(punc_in, "") + text[start:end]
            else:
                word = result_words[-1] + text[start:end]
        else:
            word = text[start:end]

        if tag == "l":
            word = word
        elif tag == "u":
            word = word.capitalize()
        # case with punctuation
        else:
            if tag[-1] == "l":
                word = (punc_in + word) if punc_in in ["¿", "¡"] else (word + punc_in)
            elif tag[-1] == "u":
                word = (punc_in + word.capitalize()) if punc_in in ["¿", "¡"] else (word.capitalize() + punc_in)

        if subword == True:
            result_words[-1] = word
        else:
            result_words.append(word)

    return " ".join(result_words)


lang = "ca"
model_path = "VOCALINLP/catalan_capitalization_punctuation_restoration_sanivert"

model = AutoModelForTokenClassification.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

pipe = pipeline("token-classification", model=model, tokenizer=tokenizer)

text = "el pacient presenta els símptomes següents febre dispnea nàusees i vòmits"
result = pipe(text)

print("Source text: "+ text)
result_text = get_result_text_ca(result, text)
print("Restored text: " +result_text)
```
 
> Created by [VOCALI SISSTEMAS INTELIGENTES S.L.](https://www.vocali.net)