bert-base-dutch-cased / README.md

wietsedv

Update README.md

484ff5c about 3 years ago

preview code

raw

history blame

No virus

6.6 kB

	---
	language: nl
	thumbnail: "https://raw.githubusercontent.com/wietsedv/bertje/master/bertje.png"
	tags:
	- BERTje
	---

	# BERTje: A Dutch BERT model
	[Wietse de Vries](https://www.semanticscholar.org/author/Wietse-de-Vries/144611157) •
	[Andreas van Cranenburgh](https://www.semanticscholar.org/author/Andreas-van-Cranenburgh/2791585) •
	[Arianna Bisazza](https://www.semanticscholar.org/author/Arianna-Bisazza/3242253) •
	[Tommaso Caselli](https://www.semanticscholar.org/author/Tommaso-Caselli/1864635) •
	[Gertjan van Noord](https://www.semanticscholar.org/author/Gertjan-van-Noord/143715131) •
	[Malvina Nissim](https://www.semanticscholar.org/author/M.-Nissim/2742475)

	## Model description

	BERTje is a Dutch pre-trained BERT model developed at the University of Groningen.

	<img src="https://raw.githubusercontent.com/wietsedv/bertje/master/bertje.png" height="250">

	For details, check out our paper on [arXiv](https://arxiv.org/abs/1912.09582), the code on [Github](https://github.com/wietsedv/bertje) and related work on [Semantic Scholar](https://www.semanticscholar.org/paper/BERTje%3A-A-Dutch-BERT-Model-Vries-Cranenburgh/a4d5e425cac0bf84c86c0c9f720b6339d6288ffa).

	The paper and Github page mention fine-tuned models that are available [here](https://huggingface.co/wietsedv).

	## How to use

	```python
	from transformers import AutoTokenizer, AutoModel, TFAutoModel

	tokenizer = AutoTokenizer.from_pretrained("GroNLP/bert-base-dutch-cased")
	model = AutoModel.from_pretrained("GroNLP/bert-base-dutch-cased") # PyTorch
	model = TFAutoModel.from_pretrained("GroNLP/bert-base-dutch-cased") # Tensorflow
	```

	WARNING: The vocabulary size of BERTje has changed in 2021. If you use an older fine-tuned model and experience problems with the `GroNLP/bert-base-dutch-cased` tokenizer, use use the following tokenizer:

	```python
	tokenizer = AutoTokenizer.from_pretrained("GroNLP/bert-base-dutch-cased", revision="v1") # v1 is the old vocabulary
	```

	## Benchmarks

	The arXiv paper lists benchmarks. Here are a couple of comparisons between BERTje, multilingual BERT, BERT-NL and RobBERT that were done after writing the paper. Unlike some other comparisons, the fine-tuning procedures for these benchmarks are identical for each pre-trained model. You may be able to achieve higher scores for individual models by optimizing fine-tuning procedures.

	More experimental results will be added to this page when they are finished. Technical details about how a fine-tuned these models will be published later as well as downloadable fine-tuned checkpoints.

	All of the tested models are base sized (12) layers with cased tokenization.

	Headers in the tables below link to original data sources. Scores link to the model pages that corresponds to that specific fine-tuned model. These tables will be updated when more simple fine-tuned models are made available.


	### Named Entity Recognition


	\| Model \| [CoNLL-2002](https://www.clips.uantwerpen.be/conll2002/ner/) \| [SoNaR-1](https://ivdnt.org/downloads/taalmaterialen/tstc-sonar-corpus) \| spaCy UD LassySmall \|
	\| ---------------------------------------------------------------------------- \| --------------------------------------------------------------------------------------------- \| ----------------------------------------------------------------------------------------- \| ----------------------------------------------------------------------------------------------- \|
	\| BERTje \| [90.24](https://huggingface.co/wietsedv/bert-base-dutch-cased-finetuned-conll2002-ner) \| [84.93](https://huggingface.co/wietsedv/bert-base-dutch-cased-finetuned-sonar-ner) \| [86.10](https://huggingface.co/wietsedv/bert-base-dutch-cased-finetuned-udlassy-ner) \|
	\| [mBERT](https://github.com/google-research/bert/blob/master/multilingual.md) \| [88.61](https://huggingface.co/wietsedv/bert-base-multilingual-cased-finetuned-conll2002-ner) \| [84.19](https://huggingface.co/wietsedv/bert-base-multilingual-cased-finetuned-sonar-ner) \| [86.77](https://huggingface.co/wietsedv/bert-base-multilingual-cased-finetuned-udlassy-ner) \|
	\| [BERT-NL](http://textdata.nl) \| 85.05 \| 80.45 \| 81.62 \|
	\| [RobBERT](https://github.com/iPieter/RobBERT) \| 84.72 \| 81.98 \| 79.84 \|

	### Part-of-speech tagging

	\| Model \| [UDv2.5 LassySmall](https://universaldependencies.org/treebanks/nl_lassysmall/index.html) \|
	\| ---------------------------------------------------------------------------- \| ----------------------------------------------------------------------------------------- \|
	\| BERTje \| 96.48 \|
	\| [mBERT](https://github.com/google-research/bert/blob/master/multilingual.md) \| 96.20 \|
	\| [BERT-NL](http://textdata.nl) \| 96.10 \|
	\| [RobBERT](https://github.com/iPieter/RobBERT) \| 95.91 \|



	### BibTeX entry and citation info

	```bibtex
	@misc{devries2019bertje,
	\ttitle = {{BERTje}: {A} {Dutch} {BERT} {Model}},
	\tshorttitle = {{BERTje}},
	\tauthor = {de Vries, Wietse and van Cranenburgh, Andreas and Bisazza, Arianna and Caselli, Tommaso and Noord, Gertjan van and Nissim, Malvina},
	\tyear = {2019},
	\tmonth = dec,
	\thowpublished = {arXiv:1912.09582},
	\turl = {http://arxiv.org/abs/1912.09582},
	}
	```