Depth Pro: Sharp Monocular Metric Depth in Less Than a Second Paper • 2410.02073 • Published 13 days ago • 37
Loong: Generating Minute-level Long Videos with Autoregressive Language Models Paper • 2410.02757 • Published 12 days ago • 35
Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models Paper • 2410.02740 • Published 12 days ago • 51
LLaVA-OneVision Collection a model good at arbitrary types of visual input • 15 items • Updated 10 days ago • 20
LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models Paper • 2407.12772 • Published Jul 17 • 33
LLaVA-Video Collection Models focus on video understanding (previously known as LLaVA-NeXT-Video). • 6 items • Updated 10 days ago • 48
Octopus: Embodied Vision-Language Programmer from Environmental Feedback Paper • 2310.08588 • Published Oct 12, 2023 • 34