--- task_categories: - question-answering - visual-question-answering - multiple-choice size_categories: - 10K

MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models

[📖 Project] [📄 Paper] [💻 Code] [📝 Dataset] [🤖 Evaluation Model] [🏆 Leaderboard] [🌟 Overview] [🔧 Metric Details] [🚩 Citation]

[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](https://opensource.org/license/mit) --- ## 🌟 Overview

We present **MMIE**, a **M**assive **M**ultimodal **I**nterleaved understanding **E**valuation benchmark, designed specifically for **Large Vision-Language Models (LVLMs)**. MMIE offers a robust, automated evaluation metric, powered by **Intern-VL2**, to assess interleaved comprehension and generation capabilities across diverse fields. This **automated evaluation metric** provides a reliable, streamlined approach to scoring LVLMs based on their performance in multimodal reasoning tasks. It is tailored to handle interleaved inputs and outputs, ensuring unbiased and consistent evaluation results. 🎯 **Key Features of the MMIE Evaluation Metric:** - **Automated Scoring System**: Fine-tuned **InternVL-2-4B** is employed as the foundation of the scoring system, offering high performance and support for multi-image input. - **Bias Mitigation**: The model is fine-tuned to minimize biases and provide fair, objective scoring across all models tested. - **Multimodal Focus**: Tailored to handle **interleaved multimodal inputs and outputs**, ensuring models are judged on their ability to integrate and reason with both text and images. - **Human-like Evaluation**: Our metric shows high correlation with human annotations, surpassing alternative automated metrics like GPT-4o, especially in nuanced multimodal tasks. - **Scalable and Consistent**: The evaluation metric is built to handle large-scale datasets, offering consistent and reproducible scoring results, making it perfect for model benchmarking and comparison. --- ## 🔧 Metric Details ### Pipeline

To ensure a comprehensive and unbiased evaluation of various **LVLMs**, we propose an **automated evaluation metric** powered by **InternVL-2-4B**. This model was selected for its **strong performance in multimodal reasoning tasks** and its ability to support **multi-image inputs**. Furthermore, we fine-tuned the model to mitigate potential biases and provide accurate, consistent scoring. The evaluation pipeline leverages the **internally fine-tuned LVLM** to assess models based on key dimensions such as **text quality**, **image quality**, **text-image coherence**, and **stylistic consistency**. This ensures models are rigorously tested on their multimodal reasoning capabilities. ### Results

*Note: In the image, higher values indicate better performance for Pearson and Cosine Similarity, while lower values are better for MSE and MAE.* The MMIE evaluation metric demonstrates superior performance in scoring, achieving the highest correlation with **human annotations** in all aspects of multimodal comprehension and generation. It consistently outperforms GPT-4o and other standard evaluation metrics, proving its reliability for large-scale model benchmarking. --- ## Installation To use our benchmark and evaluation metric, please refer to our Github repo. --- ## 🚩 Citation If you find our benchmark useful in your research, please kindly consider citing us: ```bibtex @article{xia2024mmie, title={MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models}, author={Xia, Peng and Han, Siwei and Qiu, Shi and Zhou, Yiyang and Wang, Zhaoyang and Zheng, Wenhao and Chen, Zhaorun and Cui, Chenhang and Ding, Mingyu and Li, Linjie and Wang, Lijuan and Yao, Huaxiu}, journal={arXiv preprint arXiv:2410.10139}, year={2024} } ```