HEMM: Holistic Evaluation of Multimodal Foundation Models Paper • 2407.03418 • Published 6 days ago • 5
On scalable oversight with weak LLMs judging strong LLMs Paper • 2407.04622 • Published 4 days ago • 9
Safe Unlearning: A Surprisingly Effective and Generalizable Solution to Defend Against Jailbreak Attacks Paper • 2407.02855 • Published 6 days ago • 6
ChartGemma: Visual Instruction-tuning for Chart Reasoning in the Wild Paper • 2407.04172 • Published 5 days ago • 13
AriGraph: Learning Knowledge Graph World Models with Episodic Memory for LLM Agents Paper • 2407.04363 • Published 4 days ago • 20
Searching for Best Practices in Retrieval-Augmented Generation Paper • 2407.01219 • Published 8 days ago • 8
Revealing Fine-Grained Values and Opinions in Large Language Models Paper • 2406.19238 • Published 12 days ago • 12
Summary of a Haystack: A Challenge to Long-Context LLMs and RAG Systems Paper • 2407.01370 • Published 8 days ago • 72
UnUnlearning: Unlearning is not sufficient for content regulation in advanced generative AI Paper • 2407.00106 • Published 12 days ago • 5
Chain-of-Knowledge: Integrating Knowledge Reasoning into Large Language Models by Learning from Knowledge Graphs Paper • 2407.00653 • Published 9 days ago • 8
Wavelets Are All You Need for Autoregressive Image Generation Paper • 2406.19997 • Published 11 days ago • 26
OmniJARVIS: Unified Vision-Language-Action Tokenization Enables Open-World Instruction Following Agents Paper • 2407.00114 • Published 12 days ago • 12
ColPali: Efficient Document Retrieval with Vision Language Models Paper • 2407.01449 • Published 12 days ago • 24
Step-Controlled DPO: Leveraging Stepwise Error for Enhanced Mathematical Reasoning Paper • 2407.00782 • Published 9 days ago • 21
MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient Evaluation Paper • 2407.00468 • Published 10 days ago • 35
ROS-LLM: A ROS framework for embodied AI with task feedback and structured reasoning Paper • 2406.19741 • Published 11 days ago • 54
We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning? Paper • 2407.01284 • Published 8 days ago • 69
HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale Paper • 2406.19280 • Published 12 days ago • 55
Scaling Synthetic Data Creation with 1,000,000,000 Personas Paper • 2406.20094 • Published 11 days ago • 81
AUTOHALLUSION: Automatic Generation of Hallucination Benchmarks for Vision-Language Models Paper • 2406.10900 • Published 23 days ago • 11
Read Anywhere Pointed: Layout-aware GUI Screen Reading with Tree-of-Lens Grounding Paper • 2406.19263 • Published 12 days ago • 9
Aligning Teacher with Student Preferences for Tailored Training Data Generation Paper • 2406.19227 • Published 12 days ago • 22
OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding Paper • 2406.19389 • Published 12 days ago • 51
Investigating Data Contamination in Modern Benchmarks for Large Language Models Paper • 2311.09783 • Published Nov 16, 2023 • 2
Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large Language Models Paper • 2406.17294 • Published 14 days ago • 9
CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs Paper • 2406.18521 • Published 13 days ago • 25
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale Paper • 2406.17557 • Published 14 days ago • 73
MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning Paper • 2406.17770 • Published 14 days ago • 18
DreamBench++: A Human-Aligned Benchmark for Personalized Image Generation Paper • 2406.16855 • Published 15 days ago • 53
VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models Paper • 2406.16338 • Published 15 days ago • 23
Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges Paper • 2406.12624 • Published 21 days ago • 34
4M Models Collection Multimodal models from https://4m.epfl.ch/ • 14 items • Updated 24 days ago • 29
Whiteboard-of-Thought: Thinking Step-by-Step Across Modalities Paper • 2406.14562 • Published 19 days ago • 27
REPOEXEC: Evaluate Code Generation with a Repository-Level Executable Benchmark Paper • 2406.11927 • Published 22 days ago • 8
DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning Paper • 2406.11896 • Published 25 days ago • 17
PIN: A Knowledge-Intensive Dataset for Paired and Interleaved Multimodal Documents Paper • 2406.13923 • Published 19 days ago • 21
MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding Paper • 2406.14515 • Published 19 days ago • 28
Prism: A Framework for Decoupling and Assessing the Capabilities of VLMs Paper • 2406.14544 • Published 19 days ago • 33
Probabilistic Conceptual Explainers: Trustworthy Conceptual Explanations for Vision Foundation Models Paper • 2406.12649 • Published 21 days ago • 15
Multimodal Needle in a Haystack: Benchmarking Long-Context Capability of Multimodal Large Language Models Paper • 2406.11230 • Published 22 days ago • 34
Long Code Arena: a Set of Benchmarks for Long-Context Code Models Paper • 2406.11612 • Published 22 days ago • 20
AgileCoder: Dynamic Collaborative Agents for Software Development based on Agile Methodology Paper • 2406.11912 • Published 23 days ago • 25
DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence Paper • 2406.11931 • Published 22 days ago • 54
Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning Paper • 2406.12742 • Published 21 days ago • 14
VoCo-LLaMA: Towards Vision Compression with Large Language Models Paper • 2406.12275 • Published 21 days ago • 28
DataComp-LM: In search of the next generation of training sets for language models Paper • 2406.11794 • Published 22 days ago • 39
WildVision: Evaluating Vision-Language Models in the Wild with Human Preferences Paper • 2406.11069 • Published 23 days ago • 11
VideoGUI: A Benchmark for GUI Automation from Instructional Videos Paper • 2406.10227 • Published 25 days ago • 8
SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages Paper • 2406.10118 • Published 25 days ago • 25
GUI Odyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices Paper • 2406.08451 • Published 27 days ago • 23
Glyph-ByT5-v2: A Strong Aesthetic Baseline for Accurate Multilingual Visual Text Rendering Paper • 2406.10208 • Published 25 days ago • 21
BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack Paper • 2406.10149 • Published 25 days ago • 47
ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generation Paper • 2406.09961 • Published 25 days ago • 54
XLand-100B: A Large-Scale Multi-Task Dataset for In-Context Reinforcement Learning Paper • 2406.08973 • Published 26 days ago • 85