-
FAX: Scalable and Differentiable Federated Primitives in JAX
Paper • 2403.07128 • Published • 11 -
mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding
Paper • 2403.12895 • Published • 29 -
Measuring Style Similarity in Diffusion Models
Paper • 2404.01292 • Published • 15 -
JetMoE: Reaching Llama2 Performance with 0.1M Dollars
Paper • 2404.07413 • Published • 35
Collections
Discover the best community collections!
Collections including paper arxiv:2404.07413
-
Non-asymptotic oracle inequalities for the Lasso in high-dimensional mixture of experts
Paper • 2009.10622 • Published • 1 -
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
Paper • 2401.15947 • Published • 48 -
MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts
Paper • 2401.04081 • Published • 69 -
MoE-Infinity: Activation-Aware Expert Offloading for Efficient MoE Serving
Paper • 2401.14361 • Published • 2
-
Nemotron-4 15B Technical Report
Paper • 2402.16819 • Published • 42 -
InternLM2 Technical Report
Paper • 2403.17297 • Published • 28 -
Chinese Tiny LLM: Pretraining a Chinese-Centric Large Language Model
Paper • 2404.04167 • Published • 11 -
MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases
Paper • 2402.14905 • Published • 106
-
Linear Transformers with Learnable Kernel Functions are Better In-Context Models
Paper • 2402.10644 • Published • 78 -
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
Paper • 2305.13245 • Published • 5 -
ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition
Paper • 2402.15220 • Published • 19 -
Sequence Parallelism: Long Sequence Training from System Perspective
Paper • 2105.13120 • Published • 5
-
Multilingual Instruction Tuning With Just a Pinch of Multilinguality
Paper • 2401.01854 • Published • 10 -
LLaMA Beyond English: An Empirical Study on Language Capability Transfer
Paper • 2401.01055 • Published • 53 -
LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning
Paper • 2401.01325 • Published • 26 -
Improving Text Embeddings with Large Language Models
Paper • 2401.00368 • Published • 79
-
Scaling MLPs: A Tale of Inductive Bias
Paper • 2306.13575 • Published • 14 -
Trap of Feature Diversity in the Learning of MLPs
Paper • 2112.00980 • Published • 1 -
Understanding the Spectral Bias of Coordinate Based MLPs Via Training Dynamics
Paper • 2301.05816 • Published • 1 -
RaftMLP: How Much Can Be Done Without Attention and with Less Spatial Locality?
Paper • 2108.04384 • Published • 1
-
QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models
Paper • 2310.16795 • Published • 26 -
Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference
Paper • 2308.12066 • Published • 4 -
Towards MoE Deployment: Mitigating Inefficiencies in Mixture-of-Expert (MoE) Inference
Paper • 2303.06182 • Published • 1 -
EvoMoE: An Evolutional Mixture-of-Experts Training Framework via Dense-To-Sparse Gate
Paper • 2112.14397 • Published • 1
-
Efficient Memory Management for Large Language Model Serving with PagedAttention
Paper • 2309.06180 • Published • 25 -
LM-Infinite: Simple On-the-Fly Length Generalization for Large Language Models
Paper • 2308.16137 • Published • 39 -
Scaling Transformer to 1M tokens and beyond with RMT
Paper • 2304.11062 • Published • 2 -
DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models
Paper • 2309.14509 • Published • 17
-
Table-GPT: Table-tuned GPT for Diverse Table Tasks
Paper • 2310.09263 • Published • 39 -
A Zero-Shot Language Agent for Computer Control with Structured Reflection
Paper • 2310.08740 • Published • 14 -
The Consensus Game: Language Model Generation via Equilibrium Search
Paper • 2310.09139 • Published • 12 -
PaLI-3 Vision Language Models: Smaller, Faster, Stronger
Paper • 2310.09199 • Published • 24