metadata

license: apache-2.0
datasets:
  - rigonsallauka/portugese_ner_dataset
language:
  - pt
metrics:
  - f1
  - precision
  - recall
  - confusion_matrix
base_model:
  - google-bert/bert-base-cased
pipeline_tag: token-classification
tags:
  - NER
  - medical
  - symptoms
  - extraction
  - portugese

Portugese Medical NER

Use

Primary Use Case: This model is designed to extract medical entities such as symptoms, diagnostic tests, and treatments from clinical text in the Portugese language.
Applications: Suitable for healthcare professionals, clinical data analysis, and research into medical text processing.
Supported Entity Types:
- PROBLEM: Diseases, symptoms, and medical conditions.
- TEST: Diagnostic procedures and laboratory tests.
- TREATMENT: Medications, therapies, and other medical interventions.

Training Data

Data Sources: Annotated datasets, including clinical data and translations of English medical text into Portugese.
Data Augmentation: The training dataset underwent data augmentation techniques to improve the model's ability to generalize to different text structures.
Dataset Split:
- Training Set: 80%
- Validation Set: 10%
- Test Set: 10%

Model Training

Training Configuration:
- Optimizer: AdamW
- Learning Rate: 3e-5
- Batch Size: 64
- Epochs: 200
- Loss Function: Focal Loss to handle class imbalance
Frameworks: PyTorch, Hugging Face Transformers, SimpleTransformers

How to Use

You can easily use this model with the Hugging Face transformers library. Here's an example of how to load and use the model for inference:

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

model_name = "rigonsallauka/portugese_medical_ner"

# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

# Sample text for inference
text = "O paciente reclamou de fortes dores de cabeça e náusea que persistiram por dois dias. Para aliviar os sintomas, foi prescrito paracetamol e recomendado descansar e beber bastante líquidos."

# Tokenize the input text
inputs = tokenizer(text, return_tensors="pt")