Papers
arxiv:2409.20566

MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning

Published on Sep 30
· Submitted by haotiz on Oct 1
#1 Paper of the day
Authors:
,
,
,
,
,

Abstract

We present MM1.5, a new family of multimodal large language models (MLLMs) designed to enhance capabilities in text-rich image understanding, visual referring and grounding, and multi-image reasoning. Building upon the MM1 architecture, MM1.5 adopts a data-centric approach to model training, systematically exploring the impact of diverse data mixtures across the entire model training lifecycle. This includes high-quality OCR data and synthetic captions for continual pre-training, as well as an optimized visual instruction-tuning data mixture for supervised fine-tuning. Our models range from 1B to 30B parameters, encompassing both dense and mixture-of-experts (MoE) variants, and demonstrate that careful data curation and training strategies can yield strong performance even at small scales (1B and 3B). Additionally, we introduce two specialized variants: MM1.5-Video, designed for video understanding, and MM1.5-UI, tailored for mobile UI understanding. Through extensive empirical studies and ablations, we provide detailed insights into the training processes and decisions that inform our final designs, offering valuable guidance for future research in MLLM development.

Community

Paper author Paper submitter

TL;DR: MM1.5 is a significant upgrade of MM1. With one single set of weights, MM1.5 excels at (1) reading your charts, tables, and any text-rich images, (2) understanding visual prompts like points and boxes, providing grounded outputs, and (3) multi-image reasoning. Please find the detailed recipes in the paper.

·

Hi @hoatiz congrats on your work!

Would be great to link the models to the paper page, by including https://huggingface.co/papers/2409.20566 in the respective model cards.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2409.20566 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2409.20566 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2409.20566 in a Space README.md to link it from this page.

Collections including this paper 8