finbert-ner / README.md

Update README.md

373cfeb over 1 year ago

3.85 kB

	---
	license: mit
	language:
	- fi
	metrics:
	- f1
	- accuracy
	library_name: transformers
	pipeline_tag: token-classification
	---

	## Finnish named entity recognition WORK IN PROGRESS

	The model performs named entity recognition from text input in Finnish.
	It was trained by fine-tuning [bert-base-finnish-cased-v1](https://huggingface.co/TurkuNLP/bert-base-finnish-cased-v1),
	using 10 named entity categories. Training data contains the [Turku OntoNotes Entities Corpus](https://github.com/TurkuNLP/turku-one)
	as well as an annotated dataset consisting of Finnish document daa from the 1970s onwards, digitized by the National Archives of Finland.
	Since the latter dataset contains also sensitive data, it has not been made publicly available.


	## Intended uses & limitations

	The model has been trained to recognize the following named entities from a text in Finnish:

	- PERSON (person names)
	- ORG (organizations)
	- LOC (locations)
	- GPE (geopolitical locations)
	- PRODUCT (products)
	- EVENT (events)
	- DATE (dates)
	- JON (Finnish journal numbers (diaarinumero))
	- FIBC (Finnish business identity codes (y-tunnus))
	- NORP (nationality, religious and political groups)

	Some entities, like EVENT, LOC and JON, are less common in the training data than the others, which means that
	recognition accuracy for these entities also tends to be lower.

	The training data is relatively recent, so that the model might face difficulties when the input
	contains for example old names or writing styles.

	## How to use

	The easiest way to use the model is by utilizing the Transformers pipeline for token classification:

	```python
	from transformers import pipeline

	model_checkpoint = "Kansallisarkisto/finbert-ner"
	token_classifier = pipeline(
	"token-classification", model=model_checkpoint, aggregation_strategy="simple"
	)
	token_classifier("'Helsingistä tuli Suomen suuriruhtinaskunnan pääkaupunki vuonna 1812.")
	```

	## Training data

	Some of the entities (for instance WORK_OF_ART, LAW, MONEY) that have been annotated in the [Turku OntoNotes Entities Corpus](https://github.com/TurkuNLP/turku-one)
	dataset were filtered out from the dataset used for training the model.

	In addition to this dataset, OCR'd and annotated content of
	digitized documents from Finnish public administration was also used for model training.
	The number of entities belonging to the different
	entity classes contained in training, validation and test datasets are listed below:

	### Number of entity types in the data
	Dataset\|PERSON\|ORG\|LOC\|GPE\|PRODUCT\|EVENT\|DATE\|JON\|FIBC\|NORP
	-\|-\|-\|-\|-\|-\|-\|-\|-\|-\|-
	Train\|11691\|30026\|868\|12999\|7473\|1184\|14918\|01360\|1879\|2068
	Val\|1542\|4042\|108\|1654\|879\|160\|1858\|177\|257\|299
	Test\|1267\|3698\|86\|1713\|901\|137\|1843\|174\|233\|260

	The annotation of the data was performed as a cooperation between the National Archives of Finland
	and the [FIN-CLARIAH](https://www.kielipankki.fi/organization/fin-clariah/) research infrastructure
	for Social Sciences and Humanities.

	## Training procedure

	This model was trained using a NVIDIA RTX A6000 GPU with the following hyperparameters:

	- learning rate: 2e-05
	- train batch size: 16
	- epochs: 10
	- optimizer: AdamW with betas=(0.9,0.999) and epsilon=1e-08
	- scheduler: linear scheduler with num_warmup_steps=round(len(train_dataloader)/5) and num_training_steps=len(train_dataloader)*epochs
	- maximum length of data sequence: 512
	- patience: 2 epochs

	In the prerocessing stage, the input texts were split into chunks with a maximum length of 300 tokens,
	in order to avoid the tokenized chunks exceeding the maximum length of 512. Tokenization was performed
	using the tokenizer for the [bert-base-finnish-cased-v1](https://huggingface.co/TurkuNLP/bert-base-finnish-cased-v1)
	model.

	The training code with instructions will be available soon [here](https://github.com/DALAI-hanke/BERT_NER).