eds-pseudo-public / README.md

percevalw

Update README.md

bbee3fe verified 4 months ago

preview code

raw

history blame contribute delete

No virus

15.9 kB

	---
	language:
	- fr
	pipeline_tag: token-classification
	tags:
	- medical
	- ner
	- nlp
	- pseudonymisation
	license: bsd-3-clause
	library_name: edsnlp
	model-index:
	- name: AP-HP/eds-pseudo-public
	results:
	- task:
	type: token-classification
	dataset:
	name: AP-HP Pseudo Test
	type: private
	metrics:
	- type: precision
	name: Token Scores / ADRESSE / Precision
	value: 0.981694715087097
	- type: recall
	name: Token Scores / ADRESSE / Recall
	value: 0.9693877551020401
	- type: f1
	name: Token Scores / ADRESSE / F1
	value: 0.975502420419539
	- type: recall
	name: Token Scores / ADRESSE / Redact
	value: 0.9763848396501451
	- type: accuracy
	name: Token Scores / ADRESSE / Redact Full
	value: 0.9665697674418601
	- type: precision
	name: Token Scores / DATE / Precision
	value: 0.9899177066870131
	- type: recall
	name: Token Scores / DATE / Recall
	value: 0.984285249810339
	- type: f1
	name: Token Scores / DATE / F1
	value: 0.9870934434692821
	- type: recall
	name: Token Scores / DATE / Redact
	value: 0.9884035981359051
	- type: accuracy
	name: Token Scores / DATE / Redact Full
	value: 0.859011627906976
	- type: precision
	name: Token Scores / DATE_NAISSANCE / Precision
	value: 0.9753867791842471
	- type: recall
	name: Token Scores / DATE_NAISSANCE / Recall
	value: 0.968913726859937
	- type: f1
	name: Token Scores / DATE_NAISSANCE / F1
	value: 0.972139477834238
	- type: recall
	name: Token Scores / DATE_NAISSANCE / Redact
	value: 0.9933636046105481
	- type: accuracy
	name: Token Scores / DATE_NAISSANCE / Redact Full
	value: 0.9941860465116271
	- type: precision
	name: Token Scores / IPP / Precision
	value: 0.918987341772151
	- type: recall
	name: Token Scores / IPP / Recall
	value: 0.9075000000000001
	- type: f1
	name: Token Scores / IPP / F1
	value: 0.9132075471698111
	- type: recall
	name: Token Scores / IPP / Redact
	value: 0.985
	- type: accuracy
	name: Token Scores / IPP / Redact Full
	value: 0.9927325581395341
	- type: precision
	name: Token Scores / MAIL / Precision
	value: 0.9609144542772861
	- type: recall
	name: Token Scores / MAIL / Recall
	value: 0.9977029096477791
	- type: f1
	name: Token Scores / MAIL / F1
	value: 0.978963185574755
	- type: recall
	name: Token Scores / MAIL / Redact
	value: 0.9977029096477791
	- type: accuracy
	name: Token Scores / MAIL / Redact Full
	value: 0.9970930232558141
	- type: precision
	name: Token Scores / NDA / Precision
	value: 0.921428571428571
	- type: recall
	name: Token Scores / NDA / Recall
	value: 0.834951456310679
	- type: f1
	name: Token Scores / NDA / F1
	value: 0.8760611205432931
	- type: recall
	name: Token Scores / NDA / Redact
	value: 0.87378640776699
	- type: accuracy
	name: Token Scores / NDA / Redact Full
	value: 0.9723837209302321
	- type: precision
	name: Token Scores / NOM / Precision
	value: 0.9439770896724531
	- type: recall
	name: Token Scores / NOM / Recall
	value: 0.9525013545241101
	- type: f1
	name: Token Scores / NOM / F1
	value: 0.948220064724919
	- type: recall
	name: Token Scores / NOM / Redact
	value: 0.981578472096803
	- type: accuracy
	name: Token Scores / NOM / Redact Full
	value: 0.895348837209302
	- type: precision
	name: Token Scores / PRENOM / Precision
	value: 0.9348837209302321
	- type: recall
	name: Token Scores / PRENOM / Recall
	value: 0.9663461538461531
	- type: f1
	name: Token Scores / PRENOM / F1
	value: 0.950354609929078
	- type: recall
	name: Token Scores / PRENOM / Redact
	value: 0.99002849002849
	- type: accuracy
	name: Token Scores / PRENOM / Redact Full
	value: 0.9316860465116271
	- type: precision
	name: Token Scores / SECU / Precision
	value: 0.882838283828382
	- type: recall
	name: Token Scores / SECU / Recall
	value: 1
	- type: f1
	name: Token Scores / SECU / F1
	value: 0.9377738825591581
	- type: recall
	name: Token Scores / SECU / Redact
	value: 1
	- type: accuracy
	name: Token Scores / SECU / Redact Full
	value: 1.0
	- type: precision
	name: Token Scores / TEL / Precision
	value: 0.9746407438715131
	- type: recall
	name: Token Scores / TEL / Recall
	value: 0.9993932564791541
	- type: f1
	name: Token Scores / TEL / F1
	value: 0.9868618136688491
	- type: recall
	name: Token Scores / TEL / Redact
	value: 0.999479934124989
	- type: accuracy
	name: Token Scores / TEL / Redact Full
	value: 0.99563953488372
	- type: precision
	name: Token Scores / VILLE / Precision
	value: 0.96684350132626
	- type: recall
	name: Token Scores / VILLE / Recall
	value: 0.9376205787781351
	- type: f1
	name: Token Scores / VILLE / F1
	value: 0.9520078354554351
	- type: recall
	name: Token Scores / VILLE / Redact
	value: 0.9511254019292601
	- type: accuracy
	name: Token Scores / VILLE / Redact Full
	value: 0.9113372093023251
	- type: precision
	name: Token Scores / ZIP / Precision
	value: 0.9675036927621861
	- type: recall
	name: Token Scores / ZIP / Recall
	value: 1
	- type: f1
	name: Token Scores / ZIP / F1
	value: 0.983483483483483
	- type: recall
	name: Token Scores / ZIP / Redact
	value: 1
	- type: accuracy
	name: Token Scores / ZIP / Redact Full
	value: 1.0
	- type: precision
	name: Token Scores / micro / Precision
	value: 0.970393736698084
	- type: recall
	name: Token Scores / micro / Recall
	value: 0.9783320880510371
	- type: f1
	name: Token Scores / micro / F1
	value: 0.9743467434960551
	- type: recall
	name: Token Scores / micro / Redact
	value: 0.9884667701208881
	- type: accuracy
	name: Token Scores / micro / Redact Full
	value: 0.6308139534883721
	extra_gated_fields:
	Organisation: text
	Intended use of the model:
	type: select
	options:
	- NLP Research
	- Education
	- Commercial Product
	- Clinical Data Warehouse
	- label: Other
	value: other
	---
	<div>

	[<img style="display: inline" src="https://img.shields.io/github/actions/workflow/status/aphp/eds-pseudo/tests.yml?branch=main&label=tests&style=flat-square" alt="Tests">]()
	[<img style="display: inline" src="https://img.shields.io/github/actions/workflow/status/aphp/eds-pseudo/documentation.yml?branch=main&label=docs&style=flat-square" alt="Documentation">](https://aphp.github.io/eds-pseudo/latest/)
	[<img style="display: inline" src="https://img.shields.io/codecov/c/github/aphp/eds-pseudo?logo=codecov&style=flat-square" alt="Codecov">](https://codecov.io/gh/aphp/eds-pseudo)
	[<img style="display: inline" src="https://img.shields.io/badge/repro-poetry-blue?style=flat-square" alt="Poetry">](https://python-poetry.org)
	[<img style="display: inline" src="https://img.shields.io/badge/repro-dvc-blue?style=flat-square" alt="DVC">](https://dvc.org)
	[<img style="display: inline" src="https://img.shields.io/badge/demo%20%F0%9F%9A%80-streamlit-purple?style=flat-square" alt="Demo">](https://eds-pseudo-public.streamlit.app/)

	</div>

	# EDS-Pseudo

	This project aims at detecting identifying entities documents, and was primarily tested
	on clinical reports at AP-HP's Clinical Data Warehouse (EDS).

	The model is built on top of [edsnlp](https://github.com/aphp/edsnlp), and consists in a
	hybrid model (rule-based + deep learning) for which we provide
	rules ([`eds-pseudo/pipes`](https://github.com/aphp/eds-pseudo/tree/main/eds_pseudo/pipes))
	and a training recipe [`train.py`](https://github.com/aphp/eds-pseudo/blob/main/scripts/train.py).

	We also provide some fictitious
	templates ([`templates.txt`](https://github.com/aphp/eds-pseudo/blob/main/data/templates.txt)) and a script to
	generate a synthetic
	dataset [`generate_dataset.py`](https://github.com/aphp/eds-pseudo/blob/main/scripts/generate_dataset.py).

	The entities that are detected are listed below.

	\| Label \| Description \|
	\|------------------\|---------------------------------------------------------------\|
	\| `ADRESSE` \| Street address, eg `33 boulevard de Picpus` \|
	\| `DATE` \| Any absolute date other than a birthdate \|
	\| `DATE_NAISSANCE` \| Birthdate \|
	\| `HOPITAL` \| Hospital name, eg `Hôpital Rothschild` \|
	\| `IPP` \| Internal AP-HP identifier for patients, displayed as a number \|
	\| `MAIL` \| Email address \|
	\| `NDA` \| Internal AP-HP identifier for visits, displayed as a number \|
	\| `NOM` \| Any last name (patients, doctors, third parties) \|
	\| `PRENOM` \| Any first name (patients, doctors, etc) \|
	\| `SECU` \| Social security number \|
	\| `TEL` \| Any phone number \|
	\| `VILLE` \| Any city \|
	\| `ZIP` \| Any zip code \|

	## Downloading the public pre-trained model

	The public pretrained model is available on the HuggingFace model hub at
	[AP-HP/eds-pseudo-public](https://hf.co/AP-HP/eds-pseudo-public) and was trained on synthetic data
	(see [`generate_dataset.py`](https://github.com/aphp/eds-pseudo/blob/main/scripts/generate_dataset.py)). You can also
	test it directly on the [demo](https://eds-pseudo-public.streamlit.app/).

	1. Install the latest version of edsnlp

	```shell
	pip install "edsnlp[ml]" -U
	```

	2. Get access to the model at [AP-HP/eds-pseudo-public](https://hf.co/AP-HP/eds-pseudo-public)
	3. Create and copy a huggingface token with permission "READ" at https://huggingface.co/settings/tokens?new_token=true
	4. Register the token (only once) on your machine

	```python
	import huggingface_hub

	huggingface_hub.login(token=YOUR_TOKEN, new_session=False, add_to_git_credential=True)
	```

	5. Load the model

	```python
	import edsnlp

	nlp = edsnlp.load("AP-HP/eds-pseudo-public", auto_update=True)
	doc = nlp(
	"En 2015, M. Charles-François-Bienvenu "
	"Myriel était évêque de Digne. C’était un vieillard "
	"d’environ soixante-quinze ans ; il occupait le "
	"siège de Digne depuis 2006."
	)

	for ent in doc.ents:
	print(ent, ent.label_, str(ent._.date))
	```

	To apply the model on many documents using one or more GPUs, refer to the documentation
	of [edsnlp](https://aphp.github.io/edsnlp/latest/tutorials/multiple-texts/).

	## Metrics

	\| AP-HP Pseudo Test Token Scores \| Precision \| Recall \| F1 \| Redact \| Redact Full \|
	\|:---------------------------------\|------------:\|---------:\|-----:\|---------:\|--------------:\|
	\| ADRESSE \| 98.2 \| 96.9 \| 97.6 \| 97.6 \| 96.7 \|
	\| DATE \| 99 \| 98.4 \| 98.7 \| 98.8 \| 85.9 \|
	\| DATE_NAISSANCE \| 97.5 \| 96.9 \| 97.2 \| 99.3 \| 99.4 \|
	\| IPP \| 91.9 \| 90.8 \| 91.3 \| 98.5 \| 99.3 \|
	\| MAIL \| 96.1 \| 99.8 \| 97.9 \| 99.8 \| 99.7 \|
	\| NDA \| 92.1 \| 83.5 \| 87.6 \| 87.4 \| 97.2 \|
	\| NOM \| 94.4 \| 95.3 \| 94.8 \| 98.2 \| 89.5 \|
	\| PRENOM \| 93.5 \| 96.6 \| 95 \| 99 \| 93.2 \|
	\| SECU \| 88.3 \| 100 \| 93.8 \| 100 \| 100 \|
	\| TEL \| 97.5 \| 99.9 \| 98.7 \| 99.9 \| 99.6 \|
	\| VILLE \| 96.7 \| 93.8 \| 95.2 \| 95.1 \| 91.1 \|
	\| ZIP \| 96.8 \| 100 \| 98.3 \| 100 \| 100 \|
	\| micro \| 97 \| 97.8 \| 97.4 \| 98.8 \| 63.1 \|

	## Installation to reproduce

	If you'd like to reproduce eds-pseudo's training or contribute to its development, you should first clone it:

	```shell
	git clone https://github.com/aphp/eds-pseudo.git
	cd eds-pseudo
	```

	And install the dependencies. We recommend pinning the library version in your projects, or use a strict package manager
	like [Poetry](https://python-poetry.org/).

	```shell
	poetry install
	```

	## How to use without machine learning

	```python
	import edsnlp

	nlp = edsnlp.blank("eds")

	# Some text cleaning
	nlp.add_pipe("eds.normalizer")

	# Various simple rules
	nlp.add_pipe(
	"eds_pseudo.simple_rules",
	config={"pattern_keys": ["TEL", "MAIL", "SECU", "PERSON"]},
	)

	# Address detection
	nlp.add_pipe("eds_pseudo.addresses")

	# Date detection
	nlp.add_pipe("eds_pseudo.dates")

	# Contextual rules (requires a dict of info about the patient)
	nlp.add_pipe("eds_pseudo.context")

	# Apply it to a text
	doc = nlp(
	"En 2015, M. Charles-François-Bienvenu "
	"Myriel était évêque de Digne. C’était un vieillard "
	"d’environ soixante-quinze ans ; il occupait le "
	"siège de Digne depuis 2006."
	)

	for ent in doc.ents:
	print(ent, ent.label_)

	# 2015 DATE
	# Charles-François-Bienvenu NOM
	# Myriel PRENOM
	# 2006 DATE
	```

	## How to train

	Before training a model, you should update the
	[configs/config.cfg](https://github.com/aphp/eds-pseudo/blob/main/configs/config.cfg) and
	[pyproject.toml](https://github.com/aphp/eds-pseudo/blob/main/pyproject.toml) files to
	fit your needs.

	Put your data in the `data/dataset` folder (or edit the paths `configs/config.cfg` file to point
	to `data/gen_dataset/train.jsonl`).

	Then, run the training script

	```shell
	python scripts/train.py --config configs/config.cfg --seed 43
	```

	This will train a model and save it in `artifacts/model-last`. You can evaluate it on the test set (defaults
	to `data/dataset/test.jsonl`) with:

	```shell
	python scripts/evaluate.py --config configs/config.cfg
	```

	To package it, run:

	```shell
	python scripts/package.py
	```

	This will create a `dist/eds-pseudo-aphp-*.whl` file that you can install with `pip install dist/eds-pseudo-aphp-*`.

	You can use it in your code:

	```python
	import edsnlp

	# Either from the model path directly
	nlp = edsnlp.load("artifacts/model-last")

	# Or from the wheel file
	import eds_pseudo_aphp

	nlp = eds_pseudo_aphp.load()
	```

	## Documentation

	Visit the [documentation](https://aphp.github.io/eds-pseudo/) for more information!

	## Publication

	Please find our publication at the following link: https://doi.org/mkfv.

	If you use EDS-Pseudo, please cite us as below:

	```
	@article{eds_pseudo,
	title={Development and validation of a natural language processing algorithm to pseudonymize documents in the context of a clinical data warehouse},
	author={Tannier, Xavier and Wajsb{\"u}rt, Perceval and Calliger, Alice and Dura, Basile and Mouchet, Alexandre and Hilka, Martin and Bey, Romain},
	journal={Methods of Information in Medicine},
	year={2024},
	publisher={Georg Thieme Verlag KG}
	}
	```

	## Acknowledgement

	We would like to thank [Assistance Publique – Hôpitaux de Paris](https://www.aphp.fr/)
	and [AP-HP Foundation](https://fondationrechercheaphp.fr/) for funding this project.