arxiv:2408.07246

Seeing and Understanding: Bridging Vision with Chemical Knowledge Via ChemVLM

Published on Aug 14

· Submitted by

qq8933 on Aug 15

Upvote

Authors:

Junxian Li ,

Di Zhang ,

Yuqiang Li ,

Dongzhan Zhou

Abstract

In this technical report, we propose ChemVLM, the first open-source multimodal large language model dedicated to the fields of chemistry, designed to address the incompatibility between chemical image understanding and text analysis. Built upon the VIT-MLP-LLM architecture, we leverage ChemLLM-20B as the foundational large model, endowing our model with robust capabilities in understanding and utilizing chemical text knowledge. Additionally, we employ InternVIT-6B as a powerful image encoder. We have curated high-quality data from the chemical domain, including molecules, reaction formulas, and chemistry examination data, and compiled these into a bilingual multimodal question-answering dataset. We test the performance of our model on multiple open-source benchmarks and three custom evaluation sets. Experimental results demonstrate that our model achieves excellent performance, securing state-of-the-art results in five out of six involved tasks. Our model can be found at https://huggingface.co/AI4Chem/ChemVLM-26B.

View arXiv page View PDF Add to collection

Community

qq8933

Paper author Paper submitter Aug 15

🚀 Introducing ChemVLM, the first open-source multimodal large language model dedicated to chemistry!
🌟Comparable performances with commercial models or specific OCR model but with dialogue capabilities!
✨2B/26B Models Here! https://huggingface.co/AI4Chem/ChemVLM-26B