-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 21 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 9 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 33 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 19
Collections
Discover the best community collections!
Collections including paper arxiv:2406.04325
-
ShareGPT4Video: Improving Video Understanding and Generation with Better Captions
Paper • 2406.04325 • Published • 69 -
SF-V: Single Forward Video Generation Model
Paper • 2406.04324 • Published • 22 -
VideoTetris: Towards Compositional Text-to-Video Generation
Paper • 2406.04277 • Published • 21 -
Vript: A Video Is Worth Thousands of Words
Paper • 2406.06040 • Published • 19
-
ShareGPT4Video: Improving Video Understanding and Generation with Better Captions
Paper • 2406.04325 • Published • 69 -
MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs
Paper • 2406.11833 • Published • 61 -
Depth Anything V2
Paper • 2406.09414 • Published • 88 -
Instruction Pre-Training: Language Models are Supervised Multitask Learners
Paper • 2406.14491 • Published • 76
-
Vript: A Video Is Worth Thousands of Words
Paper • 2406.06040 • Published • 19 -
ShareGPT4Video: Improving Video Understanding and Generation with Better Captions
Paper • 2406.04325 • Published • 69 -
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark
Paper • 2406.01574 • Published • 42 -
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
Paper • 2405.21075 • Published • 15
-
Husky: A Unified, Open-Source Language Agent for Multi-Step Reasoning
Paper • 2406.06469 • Published • 22 -
Mixture-of-Agents Enhances Large Language Model Capabilities
Paper • 2406.04692 • Published • 50 -
CRAG -- Comprehensive RAG Benchmark
Paper • 2406.04744 • Published • 38 -
ShareGPT4Video: Improving Video Understanding and Generation with Better Captions
Paper • 2406.04325 • Published • 69
-
ShareGPT4Video: Improving Video Understanding and Generation with Better Captions
Paper • 2406.04325 • Published • 69 -
SF-V: Single Forward Video Generation Model
Paper • 2406.04324 • Published • 22 -
I4VGen: Image as Stepping Stone for Text-to-Video Generation
Paper • 2406.02230 • Published • 15
-
VideoTetris: Towards Compositional Text-to-Video Generation
Paper • 2406.04277 • Published • 21 -
SF-V: Single Forward Video Generation Model
Paper • 2406.04324 • Published • 22 -
ShareGPT4Video: Improving Video Understanding and Generation with Better Captions
Paper • 2406.04325 • Published • 69
-
LanguageBind/MoE-LLaVA-Phi2-2.7B-4e
Text Generation • Updated • 676 • 37 -
LanguageBind/LanguageBind_Video_FT
Zero-Shot Image Classification • Updated • 166k • 3 -
stabilityai/stable-video-diffusion-img2vid-xt
Image-to-Video • Updated • 200k • 2.38k -
ShareGPT4Video: Improving Video Understanding and Generation with Better Captions
Paper • 2406.04325 • Published • 69