multimodal - a zzfive Collection

zzfive 's Collections

3d

image

LLMs

video

agent

cv

audio

multimodal

updated 5 days ago

iVideoGPT: Interactive VideoGPTs are Scalable World Models

Paper • 2405.15223 • Published May 24 • 11
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models

Paper • 2405.15574 • Published May 24 • 52
An Introduction to Vision-Language Modeling

Paper • 2405.17247 • Published May 27 • 77
Matryoshka Multimodal Models

Paper • 2405.17430 • Published May 27 • 29
Zipper: A Multi-Tower Decoder Architecture for Fusing Modalities

Paper • 2405.18669 • Published May 29 • 11
MotionLLM: Understanding Human Behaviors from Human Motions and Videos

Paper • 2405.20340 • Published May 30 • 19
Parrot: Multilingual Visual Instruction Tuning

Paper • 2406.02539 • Published Jun 4 • 35
PosterLLaVa: Constructing a Unified Multi-modal Layout Generator with LLM

Paper • 2406.02884 • Published Jun 5 • 13
What If We Recaption Billions of Web Images with LLaMA-3?

Paper • 2406.08478 • Published 26 days ago • 38
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

Paper • 2406.07476 • Published 27 days ago • 30
MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos

Paper • 2406.08407 • Published 27 days ago • 24
AV-DiT: Efficient Audio-Visual Diffusion Transformer for Joint Audio and Video Generation

Paper • 2406.07686 • Published 27 days ago • 13
OpenVLA: An Open-Source Vision-Language-Action Model

Paper • 2406.09246 • Published 26 days ago • 29
Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models

Paper • 2406.09403 • Published 25 days ago • 18
Explore the Limits of Omni-modal Pretraining at Scale

Paper • 2406.09412 • Published 25 days ago • 10
4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities

Paper • 2406.09406 • Published 25 days ago • 12
ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generation

Paper • 2406.09961 • Published 25 days ago • 54
Needle In A Multimodal Haystack

Paper • 2406.07230 • Published 28 days ago • 52
OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text

Paper • 2406.08418 • Published 27 days ago • 28
mDPO: Conditional Preference Optimization for Multimodal Large Language Models

Paper • 2406.11839 • Published 21 days ago • 36
LLaNA: Large Language and NeRF Assistant

Paper • 2406.11840 • Published 21 days ago • 17
Prism: A Framework for Decoupling and Assessing the Capabilities of VLMs

Paper • 2406.14544 • Published 18 days ago • 33
PIN: A Knowledge-Intensive Dataset for Paired and Interleaved Multimodal Documents

Paper • 2406.13923 • Published 19 days ago • 21
Improving Visual Commonsense in Language Models via Multiple Image Generation

Paper • 2406.13621 • Published 20 days ago • 13
Multimodal Structured Generation: CVPR's 2nd MMFM Challenge Technical Report

Paper • 2406.11403 • Published 22 days ago • 4
Towards Fast Multilingual LLM Inference: Speculative Decoding and Specialized Drafters

Paper • 2406.16758 • Published 15 days ago • 18
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs

Paper • 2406.16860 • Published 14 days ago • 50
Long Context Transfer from Language to Vision

Paper • 2406.16852 • Published 14 days ago • 32
video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models

Paper • 2406.15704 • Published 17 days ago • 5
HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale

Paper • 2406.19280 • Published 12 days ago • 55
Arboretum: A Large Multimodal Dataset Enabling AI for Biodiversity

Paper • 2406.17720 • Published 14 days ago • 7
OmniJARVIS: Unified Vision-Language-Action Tokenization Enables Open-World Instruction Following Agents

Paper • 2407.00114 • Published 12 days ago • 12
Understanding Alignment in Multimodal LLMs: A Comprehensive Study

Paper • 2407.02477 • Published 6 days ago • 18
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output

Paper • 2407.03320 • Published 5 days ago • 84
TokenPacker: Efficient Visual Projector for Multimodal LLM

Paper • 2407.02392 • Published 7 days ago • 18