VOCALINLP
/

catalan_capitalization_punctuation_restoration_sanivert

Token Classification

Inference Endpoints

Model card Files Files and versions Community

catalan_capitalization_punctuation_restoration_sanivert / README.md

jcg00v's picture

Update README.md

9f081cb verified 7 months ago

|

No virus

3.33 kB

	---
	widget:
	- text: has tingut alguna malaltia a la darrera setmana
	example_title: Example 1
	- text: pateix la malaltia de parkinson
	example_title: Example 2
	- text: Cal fer anàlisis de sang de visió i d'oïda
	example_title: Example 3
	language:
	- ca
	---
	# Catalan punctuation and capitalisation restoration model
	## Details of the model
	This is a reduced version of the Catalan capitalisation and punctuation restoration model developed by [VÓCALI](https://www.vocali.net) as part of the SANIVERT project.

	You can try the model in the following [SPACE](https://huggingface.co/spaces/VOCALINLP/punctuation_and_capitalization_restoration_sanivert)
	## Details of the dataset

	## Evaluation Metrics
	The metrics used to the evaluation of the model are the Macro and the Weighted F1 scores.

	## Funding
	This work was funded by the Spanish Government, the Spanish Ministry of Economy and Digital Transformation through the Digital Transformation through the "Recovery, Transformation and Resilience Plan" and also funded by the European Union NextGenerationEU/PRTR through the research project 2021/C005/0015007

	## How to use the model

	```py
	from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer
	import torch

	def get_result_text_ca (list_entity, text):
	result_words = []
	punc_tags = ['?', '!', ',', '.', ':']
	tmp_word = ""
	for idx, entity in enumerate(list_entity):
	start = entity["start"]
	end = entity["end"]
	tag = entity["entity"]
	word = entity["word"]

	# check punctuation
	punc_in = next((p for p in punc_tags if p in tag), "")

	subword = False
	# check subwords
	if word[0] != "Ġ":
	subword = True
	if tmp_word == "":
	p_s = list_entity[idx-1]["start"]
	p_e = list_entity[idx-1]["end"]
	tmp_word = text[p_s:p_e] + text[start:end]
	else:
	tmp_word = tmp_word + text[start:end]
	word = tmp_word
	else:
	tmp_word = ""
	word = text[start:end]

	if tag == "l":
	word = word
	elif tag == "u":
	word = word.capitalize()
	# case with punctuation
	else:
	if tag[-1] == "l":
	word = (punc_in + word) if punc_in in ["¿", "¡"] else (word + punc_in)
	elif tag[-1] == "u":
	word = (punc_in + word.capitalize()) if punc_in in ["¿", "¡"] else (word.capitalize() + punc_in)

	if subword == True:
	result_words[-1] = word
	else:
	result_words.append(word)

	return " ".join(result_words)


	lang = "ca"
	model_path = "VOCALINLP/catalan_capitalization_punctuation_restoration_sanivert"

	model = AutoModelForTokenClassification.from_pretrained(model_path)
	tokenizer = AutoTokenizer.from_pretrained(model_path)

	pipe = pipeline("token-classification", model=model, tokenizer=tokenizer)

	text = "el pacient presenta els símptomes següents febre dispnea nàusees i vòmits"
	result = pipe(text)

	print("Source text: "+ text)
	result_text = get_result_text_ca(result, text)
	print("Restored text: " +result_text)
	```

	> Created by [VOCALI SISSTEMAS INTELIGENTES S.L.](https://www.vocali.net)