Papers
arxiv:2409.03420

mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page Document Understanding

Published on Sep 5
· Submitted by xhyandwyy on Sep 6
Authors:
,
,
,
,
,

Abstract

Multimodel Large Language Models(MLLMs) have achieved promising OCR-free Document Understanding performance by increasing the supported resolution of document images. However, this comes at the cost of generating thousands of visual tokens for a single document image, leading to excessive GPU memory and slower inference times, particularly in multi-page document comprehension. In this work, to address these challenges, we propose a High-resolution DocCompressor module to compress each high-resolution document image into 324 tokens, guided by low-resolution global visual features. With this compression module, to strengthen multi-page document comprehension ability and balance both token efficiency and question-answering performance, we develop the DocOwl2 under a three-stage training framework: Single-image Pretraining, Multi-image Continue-pretraining, and Multi-task Finetuning. DocOwl2 sets a new state-of-the-art across multi-page document understanding benchmarks and reduces first token latency by more than 50%, demonstrating advanced capabilities in multi-page questioning answering, explanation with evidence pages, and cross-page structure understanding. Additionally, compared to single-image MLLMs trained on similar data, our DocOwl2 achieves comparable single-page understanding performance with less than 20% of the visual tokens. Our codes, models, and data are publicly available at https://github.com/X-PLUG/mPLUG-DocOwl/tree/main/DocOwl2.

Community

Paper author Paper submitter

In this work, to address these challenges, we propose a High-resolution
DocCompressor module to compress each high-resolution document image into
324 tokens, guided by low-resolution global visual features. With this compression module, to strengthen multi-page document comprehension ability and
balance both token efficiency and question-answering performance, we develop
the DocOwl2 under a three-stage training framework: Single-image Pretraining,
Multi-image Continue-pretraining, and Multi-task Finetuning. DocOwl2 sets a
new state-of-the-art across multi-page document understanding benchmarks and
reduces first token latency by more than 50%, demonstrating advanced capabilities in multi-page questioning answering, explanation with evidence pages,
and cross-page structure understanding. A

@xhyandwyy
is it intentional that the links to the mPlug-DocOwl2 model on the github page are empty? If yes, when can we expect the release? Looks like great work

·
Paper author

our models will be released around 14th Sep~

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2409.03420 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2409.03420 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2409.03420 in a Space README.md to link it from this page.

Collections including this paper 4