Kansallisarkisto
/

finbert-ner

Token Classification

Inference Endpoints

Model card Files Files and versions Community

MikkoLipsanen commited on Jul 13, 2023

Commit

0a5858c

•

1 Parent(s): bded733

Update README.md

Files changed (1) hide show

README.md +2 -2

README.md CHANGED Viewed

@@ -15,7 +15,7 @@ pipeline_tag: token-classification
 The model performs named entity recognition from text input in Finnish.
 It was trained by fine-tuning [bert-base-finnish-cased-v1](https://huggingface.co/TurkuNLP/bert-base-finnish-cased-v1),
 using 10 named entity categories. Training data contains the [Turku OntoNotes Entities Corpus](https://github.com/TurkuNLP/turku-one)
-as well as an annotated dataset consisting of Finnish document daa from the 1970s onwards, digitized by the National Archives of Finland.
 Since the latter dataset contains also sensitive data, it has not been made publicly available.
@@ -84,7 +84,7 @@ This model was trained using a NVIDIA RTX A6000 GPU with the following hyperpara
 - maximum length of data sequence: 512
 - patience: 2 epochs
-In the prerocessing stage, the input texts were split into chunks with a maximum length of 300 tokens,
 in order to avoid the tokenized chunks exceeding the maximum length of 512. Tokenization was performed
 using the tokenizer for the [bert-base-finnish-cased-v1](https://huggingface.co/TurkuNLP/bert-base-finnish-cased-v1)
 model.

 The model performs named entity recognition from text input in Finnish.
 It was trained by fine-tuning [bert-base-finnish-cased-v1](https://huggingface.co/TurkuNLP/bert-base-finnish-cased-v1),
 using 10 named entity categories. Training data contains the [Turku OntoNotes Entities Corpus](https://github.com/TurkuNLP/turku-one)
+as well as an annotated dataset consisting of Finnish document data from the 1970s onwards, digitized by the National Archives of Finland.
 Since the latter dataset contains also sensitive data, it has not been made publicly available.
 - maximum length of data sequence: 512
 - patience: 2 epochs
+In the preprocessing stage, the input texts were split into chunks with a maximum length of 300 tokens,
 in order to avoid the tokenized chunks exceeding the maximum length of 512. Tokenization was performed
 using the tokenizer for the [bert-base-finnish-cased-v1](https://huggingface.co/TurkuNLP/bert-base-finnish-cased-v1)
 model.