CCMat
's Collections
toread
updated
MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced
Training
Paper
•
2311.17049
•
Published
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts
Language Model
Paper
•
2405.04434
•
Published
•
11
A Study of Autoregressive Decoders for Multi-Tasking in Computer Vision
Paper
•
2303.17376
•
Published
Sigmoid Loss for Language Image Pre-Training
Paper
•
2303.15343
•
Published
•
4
Better & Faster Large Language Models via Multi-token Prediction
Paper
•
2404.19737
•
Published
•
69
Medusa: Simple LLM Inference Acceleration Framework with Multiple
Decoding Heads
Paper
•
2401.10774
•
Published
•
50
InstantFamily: Masked Attention for Zero-shot Multi-ID Image Generation
Paper
•
2404.19427
•
Published
•
71
CogVLM: Visual Expert for Pretrained Language Models
Paper
•
2311.03079
•
Published
•
20
InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model
Handling Resolutions from 336 Pixels to 4K HD
Paper
•
2404.06512
•
Published
•
29
InternLM-XComposer2: Mastering Free-form Text-Image Composition and
Comprehension in Vision-Language Large Model
Paper
•
2401.16420
•
Published
•
54
InstantStyle: Free Lunch towards Style-Preserving in Text-to-Image
Generation
Paper
•
2404.02733
•
Published
•
20
Demonstration-Regularized RL
Paper
•
2310.17303
•
Published
Vision Transformers Need Registers
Paper
•
2309.16588
•
Published
•
73
StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video
Generation
Paper
•
2405.01434
•
Published
•
49
Visual Fact Checker: Enabling High-Fidelity Detailed Caption Generation
Paper
•
2404.19752
•
Published
•
20
Prometheus 2: An Open Source Language Model Specialized in Evaluating
Other Language Models
Paper
•
2405.01535
•
Published
•
107
LoRA Land: 310 Fine-tuned LLMs that Rival GPT-4, A Technical Report
Paper
•
2405.00732
•
Published
•
116
RLHF Workflow: From Reward Modeling to Online RLHF
Paper
•
2405.07863
•
Published
•
62
What matters when building vision-language models?
Paper
•
2405.02246
•
Published
•
91
Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with
Fine-Grained Chinese Understanding
Paper
•
2405.08748
•
Published
•
18
Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection
Paper
•
2405.10300
•
Published
•
25
Many-Shot In-Context Learning in Multimodal Foundation Models
Paper
•
2405.09798
•
Published
•
25
CAT3D: Create Anything in 3D with Multi-View Diffusion Models
Paper
•
2405.10314
•
Published
•
39
LoRA Learns Less and Forgets Less
Paper
•
2405.09673
•
Published
•
81
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Paper
•
2405.09818
•
Published
•
113
Layer-Condensed KV Cache for Efficient Inference of Large Language
Models
Paper
•
2405.10637
•
Published
•
17
OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework
Paper
•
2405.11143
•
Published
•
33
MoRA: High-Rank Updating for Parameter-Efficient Fine-Tuning
Paper
•
2405.12130
•
Published
•
44
FIFO-Diffusion: Generating Infinite Videos from Text without Training
Paper
•
2405.11473
•
Published
•
53
Face Adapter for Pre-Trained Diffusion Models with Fine-Grained ID and
Attribute Control
Paper
•
2405.12970
•
Published
•
22
Reducing Transformer Key-Value Cache Size with Cross-Layer Attention
Paper
•
2405.12981
•
Published
•
26
Diffusion for World Modeling: Visual Details Matter in Atari
Paper
•
2405.12399
•
Published
•
25
Your Transformer is Secretly Linear
Paper
•
2405.12250
•
Published
•
145
ReVideo: Remake a Video with Motion and Content Control
Paper
•
2405.13865
•
Published
•
22
Matryoshka Multimodal Models
Paper
•
2405.17430
•
Published
•
29
An Introduction to Vision-Language Modeling
Paper
•
2405.17247
•
Published
•
77
ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal
Models
Paper
•
2405.15738
•
Published
•
43
Improving the Training of Rectified Flows
Paper
•
2405.20320
•
Published
•
2
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
Paper
•
2403.03206
•
Published
•
47
BitsFusion: 1.99 bits Weight Quantization of Diffusion Model
Paper
•
2406.04333
•
Published
•
36
ShareGPT4Video: Improving Video Understanding and Generation with Better
Captions
Paper
•
2406.04325
•
Published
•
69
Step-aware Preference Optimization: Aligning Preference with Denoising
Performance at Each Step
Paper
•
2406.04314
•
Published
•
26
Block Transformer: Global-to-Local Language Modeling for Fast Inference
Paper
•
2406.02657
•
Published
•
36
Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and
Resolution
Paper
•
2307.06304
•
Published
•
25
CRAG -- Comprehensive RAG Benchmark
Paper
•
2406.04744
•
Published
•
38
OpenELM: An Efficient Language Model Family with Open-source Training
and Inference Framework
Paper
•
2404.14619
•
Published
•
124
Multi-Head Mixture-of-Experts
Paper
•
2404.15045
•
Published
•
55
Pegasus-v1 Technical Report
Paper
•
2404.14687
•
Published
•
31
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision
Models
Paper
•
2405.15574
•
Published
•
52
Paper
•
2405.18407
•
Published
•
44
Transformers are SSMs: Generalized Models and Efficient Algorithms
Through Structured State Space Duality
Paper
•
2405.21060
•
Published
•
61
Paper
•
2406.09414
•
Published
•
88
An Image is Worth More Than 16x16 Patches: Exploring Transformers on
Individual Pixels
Paper
•
2406.09415
•
Published
•
47
DiTFastAttn: Attention Compression for Diffusion Transformer Models
Paper
•
2406.08552
•
Published
•
20
Depth Anywhere: Enhancing 360 Monocular Depth Estimation via Perspective
Distillation and Unlabeled Data Augmentation
Paper
•
2406.12849
•
Published
•
48
The Devil is in the Details: StyleFeatureEditor for Detail-Rich StyleGAN
Inversion and High Quality Image Editing
Paper
•
2406.10601
•
Published
•
65
EvTexture: Event-driven Texture Enhancement for Video Super-Resolution
Paper
•
2406.13457
•
Published
•
12
Towards Modular LLMs by Building and Reusing a Library of LoRAs
Paper
•
2405.11157
•
Published
•
23
Adam-mini: Use Fewer Learning Rates To Gain More
Paper
•
2406.16793
•
Published
•
63
DreamBench++: A Human-Aligned Benchmark for Personalized Image
Generation
Paper
•
2406.16855
•
Published
•
53
InternLM-XComposer-2.5: A Versatile Large Vision Language Model
Supporting Long-Contextual Input and Output
Paper
•
2407.03320
•
Published
•
84