Transformers
English
medical
health
llama
llama2
Inference Endpoints
Edit model card

Me-LLaMA

Model Overview

The Me-LLaMA model consists of two foundation models: Me-LLaMA 13B and Me-LLaMA 70B, along with their chat-enhanced counterparts, Me-LLaMA 13B-chat and Me-LLaMA 70B-chat. These models are designed for superior chat and instruction-following capabilities. The Me-LLaMA 13B and 70B were continually pretrained from the base LLaMA 2 13B and 70B models with the addition of biomedical, clinical, and general domain data. The chat versions were further instruction-tuned using comprehensive medical instruction tuning data.

Pretraining and Data

Me-LLaMA was developed through continual pre-training and instruction tuning of LLaMA2, incorporating 129B tokens and 214K instruction tuning samples from general, biomedical, and clinical domains. The pretraining data consists of biomedical literature, clinical notes, and general domain data in a 15:1:4 ratio, sourced from:

  • Biomedical: PubMed Central and PubMed Abstracts (Pile dataset)
  • Clinical: De-identified free-text clinical notes from MIMIC III, MIMIC-IV, and MIMIC-CXR
  • General Domain: Subset from the RedPajama dataset

The instruction tuning dataset includes:

  • General Domain: Alpaca, Dolly, and ShareGPT datasets
  • Biomedical: HealthCareMagic, Icliniq, MedInstruct, Medical Flash Cards, MEDIQA, MedicationQA, LiveQA, WikiDocPatient, Guideline QA, Pubmed Central, Pubmed, UMLS Knowledge graph
  • Clinical: MIMIC-III and MIMIC-IV

Evaluation

Me-LLaMA was evaluated on 12 datasets across different tasks:

  • QA: PubMedQA, MedQA, MedMCQA, EmrQA
  • NER: 2010 i2b2
  • Relation Extraction: 2013 DDI
  • Classification: HoC, MTSample
  • Text Summarization: PubMed, MIMIC-CXR
  • NLI: BioNLI, MedNLI

Performance

  • Me-LLaMA 13B: Surpassed PMC-LLaMA 13B on 11/12 datasets and LLaMA2 13B on 10/12 datasets, with competitive performance against larger models like LLaMA2 70B and Meditron 70B on 8/12 datasets.
  • Me-LLaMA 70B: Outperformed LLaMA2 70B and Meditron 70B on 9/12 datasets.
  • Zero-shot setting: Outperformed ChatGPT on 5/8 datasets without privacy concerns, and on 1/8 against GPT-4.
  • Task-specific instruction tuning: Surpassed ChatGPT on 7/8 and GPT-4 on 5/8 datasets.

Despite having significantly fewer parameters (13B/70B vs. 175B+ for ChatGPT and GPT-4), Me-LLaMA models demonstrated impressive performance and strong abilities in supervised and in-context learning across various medical tasks.

Model Details

Included in this repository are four models:

  1. Me-LLaMA 13B: Continually pretrained from LLaMA 2 13B.
  2. Me-LLaMA 70B: Continually pretrained from LLaMA 2 70B.
  3. Me-LLaMA 13B-chat: Further instruction-tuned from Me-LLaMA 13B using a variety of general, biomedical, and clinical datasets.
  4. Me-LLaMA 70B-chat: Further instruction-tuned from Me-LLaMA 70B using a variety of general, biomedical, and clinical datasets.

Each model contains several files, which are standard with the transformers library:

  • config.json: Information about the model
  • model-x-of-y.safetensors: Model weights
  • generation_config.json: Settings for text generation
  • special_tokens_map.json: Special tokens used in training
  • tokenizer.json: Mapping from indices to tokens
  • tokenizer_config.json: Configuration file for the tokenizer

Usage

For more details and to access the models, please visit the Me-LLaMA repository on PhysioNet.

For more technical details, please visit our paper on arXiv.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference API
Unable to determine this model’s pipeline type. Check the docs .

Datasets used to train clinicalnlplab/me-llama