iVideoGPT: Interactive VideoGPTs are Scalable World Models Paper • 2405.15223 • Published May 24 • 11
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models Paper • 2405.15574 • Published May 24 • 52
Zipper: A Multi-Tower Decoder Architecture for Fusing Modalities Paper • 2405.18669 • Published May 29 • 11
MotionLLM: Understanding Human Behaviors from Human Motions and Videos Paper • 2405.20340 • Published May 30 • 19
PosterLLaVa: Constructing a Unified Multi-modal Layout Generator with LLM Paper • 2406.02884 • Published Jun 5 • 13
What If We Recaption Billions of Web Images with LLaMA-3? Paper • 2406.08478 • Published 26 days ago • 38
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs Paper • 2406.07476 • Published 27 days ago • 30
MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos Paper • 2406.08407 • Published 27 days ago • 24
AV-DiT: Efficient Audio-Visual Diffusion Transformer for Joint Audio and Video Generation Paper • 2406.07686 • Published 27 days ago • 13
Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models Paper • 2406.09403 • Published 25 days ago • 18
Explore the Limits of Omni-modal Pretraining at Scale Paper • 2406.09412 • Published 25 days ago • 10
4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities Paper • 2406.09406 • Published 25 days ago • 12
ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generation Paper • 2406.09961 • Published 25 days ago • 54
OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text Paper • 2406.08418 • Published 27 days ago • 28
mDPO: Conditional Preference Optimization for Multimodal Large Language Models Paper • 2406.11839 • Published 21 days ago • 36
Prism: A Framework for Decoupling and Assessing the Capabilities of VLMs Paper • 2406.14544 • Published 18 days ago • 33
PIN: A Knowledge-Intensive Dataset for Paired and Interleaved Multimodal Documents Paper • 2406.13923 • Published 19 days ago • 21
Improving Visual Commonsense in Language Models via Multiple Image Generation Paper • 2406.13621 • Published 20 days ago • 13
Multimodal Structured Generation: CVPR's 2nd MMFM Challenge Technical Report Paper • 2406.11403 • Published 22 days ago • 4
Towards Fast Multilingual LLM Inference: Speculative Decoding and Specialized Drafters Paper • 2406.16758 • Published 15 days ago • 18
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs Paper • 2406.16860 • Published 14 days ago • 50
video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models Paper • 2406.15704 • Published 17 days ago • 5
HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale Paper • 2406.19280 • Published 12 days ago • 55
Arboretum: A Large Multimodal Dataset Enabling AI for Biodiversity Paper • 2406.17720 • Published 14 days ago • 7
OmniJARVIS: Unified Vision-Language-Action Tokenization Enables Open-World Instruction Following Agents Paper • 2407.00114 • Published 12 days ago • 12
Understanding Alignment in Multimodal LLMs: A Comprehensive Study Paper • 2407.02477 • Published 6 days ago • 18
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output Paper • 2407.03320 • Published 5 days ago • 84
TokenPacker: Efficient Visual Projector for Multimodal LLM Paper • 2407.02392 • Published 7 days ago • 18