File size: 3,908 Bytes
f3de994
 
cea0254
f3de994
cea0254
f3de994
cea0254
f3de994
cea0254
 
f3de994
 
b7fccce
 
 
 
 
 
 
 
3cb4331
 
 
 
b7fccce
 
6d4a2d9
b7fccce
 
 
 
 
 
 
 
 
 
 
 
4bfaf15
b7fccce
 
 
 
4bfaf15
 
b7fccce
 
 
 
4bfaf15
b7fccce
 
4bfaf15
b7fccce
 
4bfaf15
b7fccce
4bfaf15
 
 
 
 
 
 
 
 
 
 
 
 
b7fccce
 
 
 
 
 
 
4bfaf15
 
 
b7fccce
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
---
widget:
  - text: el paciente presenta los siguientes síntomas náuseas vértigo disnea fiebre y dolor abdominal
    example_title: Example 1
  - text: has tenido alguna enfermedad en la última semana
    example_title: Example 2
  - text: sufre la enfermedad de parkinson
    example_title: Example 3
  - text: es necesario realizar análisis de sangre de visión y de oído
    example_title: Example 4
language:
  - es
---

# Spanish punctuation and capitalisation restoration model
## Details of the model
This is a reduced version of the Spanish capitalisation and punctuation restoration model developed by [VÓCALI](https://www.vocali.net) as part of the SANIVERT project.

You can try the model in the following [SPACE](https://huggingface.co/spaces/VOCALINLP/punctuation_and_capitalization_restoration_sanivert)
## Details of the dataset
This a dccuchile/bert-base-spanish-wwm-uncased model fine-tuned for punctuation restoration using the following data distribution.
| Language | Number of text samples| Number of tokens|
| -------- | ----------------- | ----------------- | 
| Spanish  | 2,153,296         | 51,049,602 |

## Evaluation Metrics 
The metrics used to the evaluation of the model are the Macro and the Weighted F1 scores. 

## Funding 
This work was funded by the Spanish Government, the Spanish Ministry of Economy and Digital Transformation through the Digital Transformation through the "Recovery, Transformation and Resilience Plan" and also funded by the European Union NextGenerationEU/PRTR through the research project 2021/C005/0015007

## How to use the model

```py
from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer
import torch

def get_result_text_es_pt (list_entity, text, lang):
    result_words = []
    tmp_word = ""
    if lang == "es":
        punc_tags = ['¿', '?', '¡', '!', ',', '.', ':']
    else:
        punc_tags = ['?', '!', ',', '.', ':']
    
    for idx, entity in enumerate(list_entity): 
        tag = entity["entity"]
        word = entity["word"]
        start = entity["start"]
        end = entity["end"]
        
        # check punctuation
        punc_in = next((p for p in punc_tags if p in tag), "")
                
        subword = False
        # check subwords
        if word[0] == "#": 
            subword = True
            if tmp_word == "":
                p_s = list_entity[idx-1]["start"]
                p_e = list_entity[idx-1]["end"]
                tmp_word = text[p_s:p_e] + text[start:end]
            else: 
                tmp_word = tmp_word + text[start:end]
            word = tmp_word
        else:
            tmp_word = ""
            word = text[start:end]
            
        if tag == "l": 
            word = word 
        elif tag == "u":
            word = word.capitalize()
        # case with punctuation
        else:
            if tag[-1] == "l":
                word = (punc_in + word) if punc_in in ["¿", "¡"] else (word + punc_in)
            elif tag[-1] == "u":
                word = (punc_in + word.capitalize()) if punc_in in ["¿", "¡"] else (word.capitalize() + punc_in)     
		
        if subword == True: 
            result_words[-1] = word
        else:
            result_words.append(word)

    return " ".join(result_words)

lang = "es"
model_path = "VOCALINLP/spanish_capitalization_punctuation_restoration_sanivert"

model = AutoModelForTokenClassification.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

pipe = pipeline("token-classification", model=model, tokenizer=tokenizer)
text = "el paciente presenta los siguientes síntomas náuseas vértigo disnea fiebre y dolor abdominal"
result = pipe(text)

print("Source text: "+ text)
result_text = get_result_text_es_pt(result, text, lang)
print("Restored text: " +result_text)
```
 
> Created by [VOCALI SISSTEMAS INTELIGENTES S.L.](https://www.vocali.net)