Scaling Retrieval-Based Language Models with a Trillion-Token Datastore Paper • 2407.12854 • Published 13 days ago • 25
Shape of Motion: 4D Reconstruction from a Single Video Paper • 2407.13764 • Published 3 days ago • 14
Click-Gaussian: Interactive Segmentation to Any 3D Gaussians Paper • 2407.11793 • Published 5 days ago • 3
Ref-AVS: Refer and Segment Objects in Audio-Visual Scenes Paper • 2407.10957 • Published 6 days ago • 23
Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients Paper • 2407.08296 • Published 11 days ago • 28
ANOLE: An Open, Autoregressive, Native Large Multimodal Models for Interleaved Image-Text Generation Paper • 2407.06135 • Published 13 days ago • 19
FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs Paper • 2407.04051 • Published 17 days ago • 33
Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams Paper • 2406.08085 • Published Jun 12 • 11
Revealing Fine-Grained Values and Opinions in Large Language Models Paper • 2406.19238 • Published 24 days ago • 13
MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention Paper • 2407.02490 • Published 19 days ago • 23
OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation Paper • 2407.02371 • Published 19 days ago • 47
GaussianDreamerPro: Text to Manipulable 3D Gaussians with Highly Enhanced Quality Paper • 2406.18462 • Published 25 days ago • 11
OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding Paper • 2406.19389 • Published 24 days ago • 51
CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs Paper • 2406.18521 • Published 25 days ago • 25
A Closer Look into Mixture-of-Experts in Large Language Models Paper • 2406.18219 • Published 26 days ago • 14
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs Paper • 2406.16860 • Published 27 days ago • 52
nabla^2DFT: A Universal Quantum Chemistry Dataset of Drug-Like Molecules and a Benchmark for Neural Network Potentials Paper • 2406.14347 • Published Jun 20 • 99
The Devil is in the Details: StyleFeatureEditor for Detail-Rich StyleGAN Inversion and High Quality Image Editing Paper • 2406.10601 • Published Jun 15 • 65
Self-MoE: Towards Compositional Large Language Models with Self-Specialized Experts Paper • 2406.12034 • Published Jun 17 • 12
Alleviating Distortion in Image Generation via Multi-Resolution Diffusion Models Paper • 2406.09416 • Published Jun 13 • 28
3D-GRAND: A Million-Scale Dataset for 3D-LLMs with Better Grounding and Less Hallucination Paper • 2406.05132 • Published Jun 7 • 27
Simplified and Generalized Masked Diffusion for Discrete Data Paper • 2406.04329 • Published Jun 6 • 4
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs Paper • 2406.07476 • Published Jun 11 • 30
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation Paper • 2406.06525 • Published Jun 10 • 62
Husky: A Unified, Open-Source Language Agent for Multi-Step Reasoning Paper • 2406.06469 • Published Jun 10 • 22
WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild Paper • 2406.04770 • Published Jun 7 • 24
BitsFusion: 1.99 bits Weight Quantization of Diffusion Model Paper • 2406.04333 • Published Jun 6 • 36
ShareGPT4Video: Improving Video Understanding and Generation with Better Captions Paper • 2406.04325 • Published Jun 6 • 69
Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration Paper • 2406.01014 • Published Jun 3 • 29
Seed-TTS: A Family of High-Quality Versatile Speech Generation Models Paper • 2406.02430 • Published Jun 4 • 27
μLO: Compute-Efficient Meta-Generalization of Learned Optimizers Paper • 2406.00153 • Published May 31 • 9
ZeroSmooth: Training-free Diffuser Adaptation for High Frame Rate Video Generation Paper • 2406.00908 • Published Jun 3 • 11
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark Paper • 2406.01574 • Published Jun 3 • 42
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality Paper • 2405.21060 • Published May 31 • 61
MotionLLM: Understanding Human Behaviors from Human Motions and Videos Paper • 2405.20340 • Published May 30 • 19
Similarity is Not All You Need: Endowing Retrieval Augmented Generation with Multi Layered Thoughts Paper • 2405.19893 • Published May 30 • 26
MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model Series Paper • 2405.19327 • Published May 29 • 43
Vidu4D: Single Generated Video to High-Fidelity 4D Reconstruction with Dynamic Gaussian Surfels Paper • 2405.16822 • Published May 27 • 11
Human4DiT: Free-view Human Video Generation with 4D Diffusion Transformer Paper • 2405.17405 • Published May 27 • 14
ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models Paper • 2405.15738 • Published May 24 • 43
Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization Paper • 2405.15071 • Published May 23 • 33
Aya 23: Open Weight Releases to Further Multilingual Progress Paper • 2405.15032 • Published May 23 • 21
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models Paper • 2405.15574 • Published May 24 • 52
CraftsMan: High-fidelity Mesh Generation with 3D Native Generation and Interactive Geometry Refiner Paper • 2405.14979 • Published May 23 • 14
iVideoGPT: Interactive VideoGPTs are Scalable World Models Paper • 2405.15223 • Published May 24 • 11
OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework Paper • 2405.11143 • Published May 20 • 33
FIFO-Diffusion: Generating Infinite Videos from Text without Training Paper • 2405.11473 • Published May 19 • 53