SBB
/

Token Classification
Transformers
PyTorch
German
bert
sequence-tagger-model
Inference Endpoints
cneud commited on
Commit
0601732
1 Parent(s): 6815816

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +22 -23
README.md CHANGED
@@ -18,9 +18,8 @@ license: apache-2.0
18
  # Model Card for sbb_ner
19
 
20
  <!-- Provide a quick summary of what the model is/does. [Optional] -->
21
- A BERT model trained on three German corpora containing contemporary and historical texts for named entity recognition tasks. It predicts the classes PER, LOC and ORG.
22
- The model was developed by the Berlin State Library (SBB) in the [QURATOR](https://staatsbibliothek-berlin.de/die-staatsbibliothek/projekte/project-id-1060-2018)
23
- and [Mensch.Maschine.Kultur]( https://mmk.sbb.berlin/?lang=en) projects.
24
 
25
 
26
 
@@ -68,9 +67,9 @@ and [Mensch.Maschine.Kultur]( https://mmk.sbb.berlin/?lang=en) projects.
68
 
69
  <!-- Provide a longer summary of what this model is/does. -->
70
  A BERT model trained on three German corpora containing contemporary and historical texts for named entity recognition tasks.
71
- It predicts the classes PER, LOC and ORG.
72
 
73
- - **Developed by:** [Kai Labusch](kai.labusch@sbb.spk-berlin.de), [Clemens Neudecker](clemens.neudecker@sbb.spk-berlin.de), David Zellhöfer
74
  - **Shared by [Optional]:** [Staatsbibliothek zu Berlin / Berlin State Library](https://huggingface.co/SBB)
75
  - **Model type:** Language model
76
  - **Language(s) (NLP):** de
@@ -87,8 +86,8 @@ It predicts the classes PER, LOC and ORG.
87
 
88
  ## Direct Use
89
 
90
- The model can directly be used to perform NER on historical German texts obtained by OCR from digitized documents.
91
- Supported entity types are PER, LOC and ORG.
92
 
93
  <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
94
  <!-- If the user enters content, print that. If not, but they enter a task in the list, use that. If neither, say "more info needed." -->
@@ -98,7 +97,7 @@ Supported entity types are PER, LOC and ORG.
98
  <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
99
  <!-- If the user enters content, print that. If not, but they enter a task in the list, use that. If neither, say "more info needed." -->
100
 
101
- The model has been pre-trained on 2.300.000 pages of OCR-text of the digitized collections of Berlin State Library.
102
  Therefore it is adapted to OCR-error prone historical German texts and might be used for particular applications that involve such text material.
103
 
104
 
@@ -146,7 +145,7 @@ The BERT model is trained directly with respect to the NER by implementation of
146
 
147
  ### Preprocessing
148
 
149
- The model was pre-trained on 2.300.000 pages of German texts from the digitized collections of the Berlin State Library.
150
  The texts have been obtained by OCR from the page scans of the documents.
151
 
152
  ### Speeds, Sizes, Times
@@ -221,7 +220,7 @@ See above.
221
 
222
  ### Software
223
 
224
- See published code on [GithHub]( https://github.com/qurator-spk/sbb_ner).
225
 
226
  # Citation
227
 
@@ -229,15 +228,15 @@ See published code on [GithHub]( https://github.com/qurator-spk/sbb_ner).
229
 
230
  **BibTeX:**
231
 
232
- @article{labusch_bert_2019,
233
- title = {{BERT} for {Named} {Entity} {Recognition} in {Contemporary} and {Historical} {German}},
234
- volume = {Conference on Natural Language Processing},
235
- url = {https://konvens.org/proceedings/2019/papers/KONVENS2019_paper_4.pdf},
236
- abstract = {We apply a pre-trained transformer based representational language model, i.e. BERT (Devlin et al., 2018), to named entity recognition (NER) in contemporary and historical German text and observe state of the art performance for both text categories. We further improve the recognition performance for historical German by unsupervised pre-training on a large corpus of historical German texts of the Berlin State Library and show that best performance for historical German is obtained by unsupervised pre-training on historical German plus supervised pre-training with contemporary NER ground-truth.},
237
- language = {en},
238
- author = {Labusch, Kai and Neudecker, Clemens and Zellhöfer, David},
239
- year = {2019},
240
- pages = {9},
241
  }
242
 
243
  **APA:**
@@ -254,13 +253,13 @@ More information needed.
254
 
255
  In addition to what has been documented above, it should be noted that there are two NER Ground Truth datasets available:
256
 
257
- 1) [Data provided for the 2020 HIPE campaign on named entity processing]( https://impresso.github.io/CLEF-HIPE-2020/)
258
- 2) [Data providided for the 2022 HIPE shared task on named entity processing]( https://hipe-eval.github.io/HIPE-2022/)
259
 
260
  Furthermore, two papers have been published on NER/NED, using BERT:
261
 
262
- 1) [Entity Linking in Multilingual Newspapers and Classical Commentaries with BERT]( http://ceur-ws.org/Vol-3180/paper-85.pdf)
263
- 2) [Named Entity Disambiguation and Linking Historic Newspaper OCR with BERT]( http://ceur-ws.org/Vol-2696/paper_163.pdf)
264
 
265
 
266
  # Model Card Authors [optional]
 
18
  # Model Card for sbb_ner
19
 
20
  <!-- Provide a quick summary of what the model is/does. [Optional] -->
21
+ A BERT model trained on three German corpora containing contemporary and historical texts for named entity recognition tasks. It predicts the classes `PER`, `LOC` and `ORG`.
22
+ The model was developed by the Berlin State Library (SBB) in the [QURATOR](https://staatsbibliothek-berlin.de/die-staatsbibliothek/projekte/project-id-1060-2018) project.
 
23
 
24
 
25
 
 
67
 
68
  <!-- Provide a longer summary of what this model is/does. -->
69
  A BERT model trained on three German corpora containing contemporary and historical texts for named entity recognition tasks.
70
+ It predicts the classes `PER`, `LOC` and `ORG`.
71
 
72
+ - **Developed by:** [Kai Labusch](https://huggingface.co/labusch), [Clemens Neudecker](https://huggingface.co/cneud), David Zellhöfer
73
  - **Shared by [Optional]:** [Staatsbibliothek zu Berlin / Berlin State Library](https://huggingface.co/SBB)
74
  - **Model type:** Language model
75
  - **Language(s) (NLP):** de
 
86
 
87
  ## Direct Use
88
 
89
+ The model can directly be used to perform NER on historical German texts obtained by Optical Character Recognition (OCR) from digitized documents.
90
+ Supported entity types are `PER`, `LOC` and `ORG`.
91
 
92
  <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
93
  <!-- If the user enters content, print that. If not, but they enter a task in the list, use that. If neither, say "more info needed." -->
 
97
  <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
98
  <!-- If the user enters content, print that. If not, but they enter a task in the list, use that. If neither, say "more info needed." -->
99
 
100
+ The model has been pre-trained on 2,333,647 pages of OCR-text of the digitized collections of Berlin State Library.
101
  Therefore it is adapted to OCR-error prone historical German texts and might be used for particular applications that involve such text material.
102
 
103
 
 
145
 
146
  ### Preprocessing
147
 
148
+ The model was pre-trained on 2,333,647 pages of German texts from the digitized collections of the Berlin State Library.
149
  The texts have been obtained by OCR from the page scans of the documents.
150
 
151
  ### Speeds, Sizes, Times
 
220
 
221
  ### Software
222
 
223
+ See published code on [GithHub](https://github.com/qurator-spk/sbb_ner).
224
 
225
  # Citation
226
 
 
228
 
229
  **BibTeX:**
230
 
231
+ @article{labusch_bert_2019,
232
+ title = {{BERT} for {Named} {Entity} {Recognition} in {Contemporary} and {Historical} {German}},
233
+ volume = {Conference on Natural Language Processing},
234
+ url = {https://konvens.org/proceedings/2019/papers/KONVENS2019_paper_4.pdf},
235
+ abstract = {We apply a pre-trained transformer based representational language model, i.e. BERT (Devlin et al., 2018), to named entity recognition (NER) in contemporary and historical German text and observe state of the art performance for both text categories. We further improve the recognition performance for historical German by unsupervised pre-training on a large corpus of historical German texts of the Berlin State Library and show that best performance for historical German is obtained by unsupervised pre-training on historical German plus supervised pre-training with contemporary NER ground-truth.},
236
+ language = {en},
237
+ author = {Labusch, Kai and Neudecker, Clemens and Zellhöfer, David},
238
+ year = {2019},
239
+ pages = {9},
240
  }
241
 
242
  **APA:**
 
253
 
254
  In addition to what has been documented above, it should be noted that there are two NER Ground Truth datasets available:
255
 
256
+ 1) [Data provided for the 2020 HIPE campaign on named entity processing](https://impresso.github.io/CLEF-HIPE-2020/)
257
+ 2) [Data providided for the 2022 HIPE shared task on named entity processing](https://hipe-eval.github.io/HIPE-2022/)
258
 
259
  Furthermore, two papers have been published on NER/NED, using BERT:
260
 
261
+ 1) [Entity Linking in Multilingual Newspapers and Classical Commentaries with BERT](http://ceur-ws.org/Vol-3180/paper-85.pdf)
262
+ 2) [Named Entity Disambiguation and Linking Historic Newspaper OCR with BERT](http://ceur-ws.org/Vol-2696/paper_163.pdf)
263
 
264
 
265
  # Model Card Authors [optional]