Zhongpai Gao's picture

147 13

Zhongpai Gao

gaozhongpai

·

Gaozhongpai

AI & ML interests

3D computer vision

Organizations

gaozhongpai's activity

upvoted 2 papers 2 days ago

MedVisionLlama: Leveraging Pre-Trained Large Language Model Layers to Enhance Medical Image Segmentation

Paper • 2410.02458 • Published 3 days ago • 9

MVGS: Multi-view-regulated Gaussian Splatting for Novel View Synthesis

Paper • 2410.02103 • Published 4 days ago • 8

upvoted a paper 10 days ago

DreamWaltz-G: Expressive 3D Gaussian Avatars from Skeleton-Guided 2D Diffusion

Paper • 2409.17145 • Published 11 days ago • 11

upvoted a paper 12 days ago

Phantom of Latent for Large Language and Vision Models

Paper • 2409.14713 • Published 14 days ago • 27

upvoted a paper 16 days ago

3DGS-LM: Faster Gaussian-Splatting Optimization with Levenberg-Marquardt

Paper • 2409.12892 • Published 17 days ago • 5

upvoted a paper 17 days ago

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Paper • 2409.12191 • Published 18 days ago • 69

upvoted 2 papers 18 days ago

Single-Layer Learnable Activation for Implicit Neural Representation (SL^{2}A-INR)

Paper • 2409.10836 • Published 20 days ago • 4

SplatFields: Neural Gaussian Splats for Sparse 3D and 4D Reconstruction

Paper • 2409.11211 • Published 19 days ago • 7

upvoted a paper 20 days ago

Robust Dual Gaussian Splatting for Immersive Human-centric Volumetric Videos

Paper • 2409.08353 • Published 24 days ago • 10

upvoted 6 papers about 1 month ago

LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture

Paper • 2409.02889 • Published Sep 4 • 54

VideoLLaMB: Long-context Video Understanding with Recurrent Memory Bridges

Paper • 2409.01071 • Published Sep 2 • 26

3D Reconstruction with Spatial Memory

Paper • 2408.16061 • Published Aug 28 • 11

Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders

Paper • 2408.15998 • Published Aug 28 • 83

FLoD: Integrating Flexible Level of Detail into 3D Gaussian Splatting for Customizable Rendering

Paper • 2408.12894 • Published Aug 23 • 3

Sapiens: Foundation for Human Vision Models

Paper • 2408.12569 • Published Aug 22 • 86

upvoted 12 papers about 2 months ago

Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

Paper • 2408.11039 • Published Aug 20 • 56

SpaRP: Fast 3D Object Reconstruction and Pose Estimation from Sparse Views

Paper • 2408.10195 • Published Aug 19 • 12

Segment Anything with Multiple Modalities

Paper • 2408.09085 • Published Aug 17 • 20

LongVILA: Scaling Long-Context Visual Language Models for Long Videos

Paper • 2408.10188 • Published Aug 19 • 51

Surgical SAM 2: Real-time Segment Anything in Surgical Video by Efficient Frame Pruning

Paper • 2408.07931 • Published Aug 15 • 18

xGen-MM (BLIP-3): A Family of Open Large Multimodal Models

Paper • 2408.08872 • Published Aug 16 • 96

Towards flexible perception with visual memory

Paper • 2408.08172 • Published Aug 15 • 19

Imagen 3

Paper • 2408.07009 • Published Aug 13 • 60

ControlNeXt: Powerful and Efficient Control for Image and Video Generation

Paper • 2408.06070 • Published Aug 12 • 52

GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI

Paper • 2408.03361 • Published Aug 6 • 85

Compact 3D Gaussian Splatting for Static and Dynamic Radiance Fields

Paper • 2408.03822 • Published Aug 7 • 9

RayGauss: Volumetric Gaussian-Based Ray Casting for Photorealistic Novel View Synthesis

Paper • 2408.03356 • Published Aug 6 • 8

upvoted 14 papers 2 months ago

MedTrinity-25M: A Large-scale Multimodal Dataset with Multigranular Annotations for Medicine

Paper • 2408.02900 • Published Aug 6 • 25

LLaVA-OneVision: Easy Visual Task Transfer

Paper • 2408.03326 • Published Aug 6 • 59

Unleashing the Power of Data Tsunami: A Comprehensive Survey on Data Assessment and Selection for Instruction Tuning of Language Models

Paper • 2408.02085 • Published Aug 4 • 17

Expressive Whole-Body 3D Gaussian Avatar

Paper • 2407.21686 • Published Jul 31 • 7

MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts

Paper • 2407.21770 • Published Jul 31 • 22

Tora: Trajectory-oriented Diffusion Transformer for Video Generation

Paper • 2407.21705 • Published Jul 31 • 25

Matting by Generation

Paper • 2407.21017 • Published Jul 30 • 22

Theia: Distilling Diverse Vision Foundation Models for Robot Learning

Paper • 2407.20179 • Published Jul 29 • 45

VSSD: Vision Mamba with Non-Casual State Space Duality

Paper • 2407.18559 • Published Jul 26 • 16

SHIC: Shape-Image Correspondences with no Keypoint Supervision

Paper • 2407.18907 • Published Jul 26 • 39

BetterDepth: Plug-and-Play Diffusion Refiner for Zero-Shot Monocular Depth Estimation

Paper • 2407.17952 • Published Jul 25 • 27

VILA^2: VILA Augmented VILA

Paper • 2407.17453 • Published Jul 24 • 38

OutfitAnyone: Ultra-high Quality Virtual Try-On for Any Clothing and Any Person

Paper • 2407.16224 • Published Jul 23 • 23

INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model

Paper • 2407.16198 • Published Jul 23 • 13

upvoted a collection 2 months ago

LMMs-Eval

Dataset Collection of LMMs-Eval • 36 items • Updated 3 days ago • 24

upvoted 15 papers 3 months ago

LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding

Paper • 2407.15754 • Published Jul 22 • 19

SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models

Paper • 2407.15841 • Published Jul 22 • 39

LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models

Paper • 2407.12772 • Published Jul 17 • 33

MambaVision: A Hybrid Mamba-Transformer Vision Backbone

Paper • 2407.08083 • Published Jul 10 • 27

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

Paper • 2407.07895 • Published Jul 10 • 40

Graph-Based Captioning: Enhancing Visual Descriptions by Interconnecting Region Captions

Paper • 2407.06723 • Published Jul 9 • 10

VIMI: Grounding Video Generation through Multi-modal Instruction

Paper • 2407.06304 • Published Jul 8 • 9

MiraData: A Large-Scale Video Dataset with Long Durations and Structured Captions

Paper • 2407.06358 • Published Jul 8 • 17

RodinHD: High-Fidelity 3D Avatar Generation with Diffusion Models

Paper • 2407.06938 • Published Jul 9 • 21

Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion

Paper • 2407.01392 • Published Jul 1 • 39

Consistency Flow Matching: Defining Straight Flows with Velocity Consistency

Paper • 2407.02398 • Published Jul 2 • 14

Wavelets Are All You Need for Autoregressive Image Generation

Paper • 2406.19997 • Published Jun 28 • 28

RaTEScore: A Metric for Radiology Report Generation

Paper • 2406.16845 • Published Jun 24 • 4

EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model

Paper • 2406.20076 • Published Jun 28 • 8

HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale

Paper • 2406.19280 • Published Jun 27 • 59

upvoted 3 papers 4 months ago

Invertible Consistency Distillation for Text-Guided Image Editing in Around 7 Steps

Paper • 2406.14539 • Published Jun 20 • 26

MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding

Paper • 2406.14515 • Published Jun 20 • 32

HumanSplat: Generalizable Single-Image Human Gaussian Splatting with Structure Priors

Paper • 2406.12459 • Published Jun 18 • 11