arxiv:2406.13923

PIN: A Knowledge-Intensive Dataset for Paired and Interleaved Multimodal Documents

Published on Jun 20

· Submitted by

zhangysk on Jun 21

Upvote

Authors:

Junjie Wang ,

Yin Zhang ,

Yatai Ji ,

Tiezhen Wang ,

Jie Fu ,

Minghao Liu ,

Ge Zhang ,

Abstract

Recent advancements in Large Multimodal Models (LMMs) have leveraged extensive multimodal datasets to enhance capabilities in complex knowledge-driven tasks. However, persistent challenges in perceptual and reasoning errors limit their efficacy, particularly in interpreting intricate visual data and deducing multimodal relationships. Addressing these issues, we introduce a novel dataset format, PIN (Paired and INterleaved multimodal documents), designed to significantly improve both the depth and breadth of multimodal training. The PIN format is built on three foundational principles: knowledge intensity, scalability, and support for diverse training modalities. This innovative format combines markdown files and comprehensive images to enrich training data with a dense knowledge structure and versatile training strategies. We present PIN-14M, an open-source dataset comprising 14 million samples derived from a diverse range of Chinese and English sources, tailored to include complex web and scientific content. This dataset is constructed meticulously to ensure data quality and ethical integrity, aiming to facilitate advanced training strategies and improve model robustness against common multimodal training pitfalls. Our initial results, forming the basis of this technical report, suggest significant potential for the PIN format in refining LMM performance, with plans for future expansions and detailed evaluations of its impact on model capabilities.

View arXiv page View PDF Add to collection

Community

zhangysk

Paper author Paper submitter about 1 month ago

Recent advancements in Large Multimodal Models (LMMs) have leveraged extensive multimodal datasets
to enhance capabilities in complex knowledge-driven tasks. However, persistent challenges in perceptual
and reasoning errors limit their efficacy, particularly in interpreting intricate visual data and deducing
multimodal relationships. Addressing these issues, we introduce a novel dataset format, PIN (Paired
and INterleaved multimodal documents), designed to significantly improve both the depth and breadth
of multimodal training. The PIN format is built on three foundational principles: knowledge intensity,
scalability, and support for diverse training modalities. This innovative format combines markdown
files and comprehensive images to enrich training data with a dense knowledge structure and versatile
training strategies. We present PIN-14M, an open-source dataset comprising 14 million samples derived
from a diverse range of Chinese and English sources, tailored to include complex web and scientific
content. This dataset is constructed meticulously to ensure data quality and ethical integrity, aiming
to facilitate advanced training strategies and improve model robustness against common multimodal
training pitfalls. Our initial results, forming the basis of this technical report, suggest significant
potential for the PIN format in refining LMM performance, with plans for future expansions and detailed
evaluations of its impact on model capabilities.

It's a preview version. The PIN-100M and LMMs adopting the data will be released in the near future.