MikkoLipsanen commited on
Commit
0a5858c
1 Parent(s): bded733

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -2
README.md CHANGED
@@ -15,7 +15,7 @@ pipeline_tag: token-classification
15
  The model performs named entity recognition from text input in Finnish.
16
  It was trained by fine-tuning [bert-base-finnish-cased-v1](https://huggingface.co/TurkuNLP/bert-base-finnish-cased-v1),
17
  using 10 named entity categories. Training data contains the [Turku OntoNotes Entities Corpus](https://github.com/TurkuNLP/turku-one)
18
- as well as an annotated dataset consisting of Finnish document daa from the 1970s onwards, digitized by the National Archives of Finland.
19
  Since the latter dataset contains also sensitive data, it has not been made publicly available.
20
 
21
 
@@ -84,7 +84,7 @@ This model was trained using a NVIDIA RTX A6000 GPU with the following hyperpara
84
  - maximum length of data sequence: 512
85
  - patience: 2 epochs
86
 
87
- In the prerocessing stage, the input texts were split into chunks with a maximum length of 300 tokens,
88
  in order to avoid the tokenized chunks exceeding the maximum length of 512. Tokenization was performed
89
  using the tokenizer for the [bert-base-finnish-cased-v1](https://huggingface.co/TurkuNLP/bert-base-finnish-cased-v1)
90
  model.
 
15
  The model performs named entity recognition from text input in Finnish.
16
  It was trained by fine-tuning [bert-base-finnish-cased-v1](https://huggingface.co/TurkuNLP/bert-base-finnish-cased-v1),
17
  using 10 named entity categories. Training data contains the [Turku OntoNotes Entities Corpus](https://github.com/TurkuNLP/turku-one)
18
+ as well as an annotated dataset consisting of Finnish document data from the 1970s onwards, digitized by the National Archives of Finland.
19
  Since the latter dataset contains also sensitive data, it has not been made publicly available.
20
 
21
 
 
84
  - maximum length of data sequence: 512
85
  - patience: 2 epochs
86
 
87
+ In the preprocessing stage, the input texts were split into chunks with a maximum length of 300 tokens,
88
  in order to avoid the tokenized chunks exceeding the maximum length of 512. Tokenization was performed
89
  using the tokenizer for the [bert-base-finnish-cased-v1](https://huggingface.co/TurkuNLP/bert-base-finnish-cased-v1)
90
  model.