Papers
arxiv:2406.19280

HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale

Published on Jun 27
· Submitted by jymcc on Jul 1
#2 Paper of the day
Authors:
,
,
Ke Ji ,
,
,

Abstract

The rapid development of multimodal large language models (MLLMs), such as GPT-4V, has led to significant advancements. However, these models still face challenges in medical multimodal capabilities due to limitations in the quantity and quality of medical vision-text data, stemming from data privacy concerns and high annotation costs. While pioneering approaches utilize PubMed's large-scale, de-identified medical image-text pairs to address these limitations, they still fall short due to inherent data noise. To tackle this, we refined medical image-text pairs from PubMed and employed MLLMs (GPT-4V) in an 'unblinded' capacity to denoise and reformat the data, resulting in the creation of the PubMedVision dataset with 1.3 million medical VQA samples. Our validation demonstrates that: (1) PubMedVision can significantly enhance the medical multimodal capabilities of current MLLMs, showing significant improvement in benchmarks including the MMMU Health & Medicine track; (2) manual checks by medical experts and empirical results validate the superior data quality of our dataset compared to other data construction methods. Using PubMedVision, we train a 34B medical MLLM HuatuoGPT-Vision, which shows superior performance in medical multimodal scenarios among open-source MLLMs.

Community

Paper author Paper submitter

This study introduces PubMedVision, a dataset of 1.3 million high-quality medical image-text samples, crafted to overcome the challenges faced by multimodal large language models (MLLMs) in medical scenarios. We refined data from image-text pairs on PubMed papers and employed a GPT4V-powered reformatting method to enhance this data. Experiments demonstrate that: (1) PubMedVision could significantly improve the medical multimodal capabilities of MLLMs, enabling models like LLaVA-v1.5-LLaMA-3-8B to outperform other open-source MLLMs in medical multimodal scenarios. (2) Manual checks by medical experts validate the superior data quality of PubMedVision. Based on PubMedVision, we constructe our medical multimodal models, HuatuoGPT-Vision. We open-source our dataset and models.

Paper author Paper submitter

Snipaste_2024-07-01_10-43-34.png

Snipaste_2024-07-01_10-44-39.png

Snipaste_2024-07-01_10-45-57.png

That's cool! I don't see performance against GPT-4o using PubMedVision though. It could be quite interesting to see initial performance of GPT, then use PubMedVision for Few shot learning (maybe a lot of those, like https://arxiv.org/abs/2405.09798) and see how competitive open-source models are against GPT-4o enhanced with PubMedVision

HuatuoGPT-Vision is very interesting! Will it also be integrated into hospital medical systems soon, like HuatuoGPT?

·
Paper author

Thank you for your interest! Integration into hospital systems is currently being planned.

·
Paper author

Thank you very much for sharing. It's a great summary!

Awesome! That work is really interesting, and valuable!

Sign up or log in to comment

Models citing this paper 2

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2406.19280 in a Space README.md to link it from this page.

Collections including this paper 12