-
From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations
Paper • 2401.01885 • Published • 27 -
Media2Face: Co-speech Facial Animation Generation With Multi-Modality Guidance
Paper • 2401.15687 • Published • 21 -
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action
Paper • 2312.17172 • Published • 26 -
MouSi: Poly-Visual-Expert Vision-Language Models
Paper • 2401.17221 • Published • 7
Collections
Discover the best community collections!
Collections including paper arxiv:2310.12404
-
Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation
Paper • 2211.06687 • Published • 3 -
EnCLAP: Combining Neural Audio Codec and Audio-Text Joint Embedding for Automated Audio Captioning
Paper • 2401.17690 • Published • 5 -
Amphion: An Open-Source Audio, Music and Speech Generation Toolkit
Paper • 2312.09911 • Published • 52 -
Audiobox: Unified Audio Generation with Natural Language Prompts
Paper • 2312.15821 • Published • 12
-
Loop Copilot: Conducting AI Ensembles for Music Generation and Iterative Editing
Paper • 2310.12404 • Published • 15 -
MusicAgent: An AI Agent for Music Understanding and Generation with Large Language Models
Paper • 2310.11954 • Published • 24 -
A Survey of AI Music Generation Tools and Models
Paper • 2308.12982 • Published • 1 -
UniAudio: An Audio Foundation Model Toward Universal Audio Generation
Paper • 2310.00704 • Published • 19
-
Woodpecker: Hallucination Correction for Multimodal Large Language Models
Paper • 2310.16045 • Published • 14 -
HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models
Paper • 2310.14566 • Published • 25 -
SILC: Improving Vision Language Pretraining with Self-Distillation
Paper • 2310.13355 • Published • 6 -
Conditional Diffusion Distillation
Paper • 2310.01407 • Published • 20
-
Large Language Models as Optimizers
Paper • 2309.03409 • Published • 75 -
Natural Language Supervision for General-Purpose Audio Representations
Paper • 2309.05767 • Published • 9 -
Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers
Paper • 2309.08532 • Published • 52 -
AudioSR: Versatile Audio Super-resolution at Scale
Paper • 2309.07314 • Published • 24