File size: 3,032 Bytes
cb4040c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
21ad1c4
cb4040c
45313d4
 
 
 
 
 
cb4040c
21ad1c4
cb4040c
 
 
 
 
 
 
21ad1c4
cb4040c
 
 
 
 
 
 
 
 
 
 
 
 
21ad1c4
 
cb4040c
379b115
 
 
056767f
379b115
 
 
11619c3
 
cb4040c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11619c3
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
---
license: apache-2.0
datasets:
- rigonsallauka/english_ner_dataset
language:
- en
metrics:
- f1
- precision
- recall
- confusion_matrix
base_model:
- google-bert/bert-base-cased
pipeline_tag: token-classification
tags:
- NER
- medical
- symptom
- extraction
- english
---

# English Medical NER

## Acknowledgement

This model had been created as part of joint research of HUMADEX research group (https://www.linkedin.com/company/101563689/) and has received funding by the European Union Horizon Europe Research and Innovation Program project SMILE (grant number 101080923) and Marie Skłodowska-Curie Actions (MSCA) Doctoral Networks, project BosomShield ((rant number 101073222). Responsibility for the information and views expressed herein lies entirely with the authors.
Authors:
dr. Izidor Mlakar, Rigon Sallauka, dr. Umut Arioz, dr. Matej Rojc

## Use
- **Primary Use Case**: This model is designed to extract medical entities such as symptoms, diagnostic tests, and treatments from clinical text in the English language.
- **Applications**: Suitable for healthcare professionals, clinical data analysis, and research into medical text processing.
- **Supported Entity Types**:
  - `PROBLEM`: Diseases, symptoms, and medical conditions.
  - `TEST`: Diagnostic procedures and laboratory tests.
  - `TREATMENT`: Medications, therapies, and other medical interventions.

## Training Data
- **Data Sources**: Annotated datasets, including clinical data in English.
- **Data Augmentation**: The training dataset underwent data augmentation techniques to improve the model's ability to generalize to different text structures.
- **Dataset Split**:
  - **Training Set**: 80%
  - **Validation Set**: 10%
  - **Test Set**: 10%

## Model Training
- **Training Configuration**:
  - **Optimizer**: AdamW
  - **Learning Rate**: 3e-5
  - **Batch Size**: 64
  - **Epochs**: 200
  - **Loss Function**: Focal Loss to handle class imbalance
- **Frameworks
**: PyTorch, Hugging Face Transformers, SimpleTransformers

## Evaluation metrics
- eval_loss = 0.24279939405748557
- f1_score = 0.8006730836297691

- precision = 0.8084832904884319
- recall = 0.7930123311802701

Visit [HUMADEX/Weekly-Supervised-NER-pipline](https://github.com/HUMADEX/Weekly-Supervised-NER-pipline) for more info.

## How to Use
You can easily use this model with the Hugging Face `transformers` library. Here's an example of how to load and use the model for inference:

```python
from transformers import AutoTokenizer, AutoModelForTokenClassification

model_name = "rigonsallauka/english_medical_ner"

# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

# Sample text for inference
text = "The patient complained of severe headaches and nausea that had persisted for two days. To alleviate the symptoms, he was prescribed paracetamol and advised to rest and drink plenty of fluids."

# Tokenize the input text
inputs = tokenizer(text, return_tensors="pt")