kaizuberbuehler 's Collections LM Inference
updated
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
Paper
• 2402.17764
• Published
• 627
BitNet: Scaling 1-bit Transformers for Large Language Models
Paper
• 2310.11453
• Published
• 106
Mixture-of-Depths: Dynamically allocating compute in transformer-based
language models
Paper
• 2404.02258
• Published
• 107
TransformerFAM: Feedback attention is working memory
Paper
• 2404.09173
• Published
• 43
Megalodon: Efficient LLM Pretraining and Inference with Unlimited
Context Length
Paper
• 2404.08801
• Published
• 66
Leave No Context Behind: Efficient Infinite Context Transformers with
Infini-attention
Paper
• 2404.07143
• Published
• 111
Eagle and Finch: RWKV with Matrix-Valued States and Dynamic Recurrence
Paper
• 2404.05892
• Published
• 40
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video
Understanding
Paper
• 2404.05726
• Published
• 23
LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens
Paper
• 2402.13753
• Published
• 116
How Good Are Low-bit Quantized LLaMA3 Models? An Empirical Study
Paper
• 2404.14047
• Published
• 45
SnapKV: LLM Knows What You are Looking for Before Generation
Paper
• 2404.14469
• Published
• 27
LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding
Paper
• 2404.16710
• Published
• 80
Octopus v4: Graph of language models
Paper
• 2404.19296
• Published
• 118
Kangaroo: Lossless Self-Speculative Decoding via Double Early Exiting
Paper
• 2404.18911
• Published
• 30
LoRA Land: 310 Fine-tuned LLMs that Rival GPT-4, A Technical Report
Paper
• 2405.00732
• Published
• 122
Imp: Highly Capable Large Multimodal Models for Mobile Devices
Paper
• 2405.12107
• Published
• 29
Transformers are SSMs: Generalized Models and Efficient Algorithms
Through Structured State Space Duality
Paper
• 2405.21060
• Published
• 68
Samba: Simple Hybrid State Space Models for Efficient Unlimited Context
Language Modeling
Paper
• 2406.07522
• Published
• 40
Accessing GPT-4 level Mathematical Olympiad Solutions via Monte Carlo
Tree Self-refine with LLaMa-3 8B
Paper
• 2406.07394
• Published
• 29
Large Language Monkeys: Scaling Inference Compute with Repeated Sampling
Paper
• 2407.21787
• Published
• 13
ThinK: Thinner Key Cache by Query-Driven Pruning
Paper
• 2407.21018
• Published
• 32
A Comprehensive Evaluation of Quantized Instruction-Tuned Large Language
Models: An Experimental Analysis up to 405B
Paper
• 2409.11055
• Published
• 17
Discovering the Gems in Early Layers: Accelerating Long-Context LLMs
with 1000x Input Token Reduction
Paper
• 2409.17422
• Published
• 25
Thinking LLMs: General Instruction Following with Thought Generation
Paper
• 2410.10630
• Published
• 20
VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large
Language Models
Paper
• 2409.17066
• Published
• 28
Efficiently Serving LLM Reasoning Programs with Certaindex
Paper
• 2412.20993
• Published
• 36
Token-Budget-Aware LLM Reasoning
Paper
• 2412.18547
• Published
• 46
Critical Tokens Matter: Token-Level Contrastive Estimation Enhence LLM's
Reasoning Capability
Paper
• 2411.19943
• Published
• 62
Low-Bit Quantization Favors Undertrained LLMs: Scaling Laws for
Quantized LLMs with 100T Training Tokens
Paper
• 2411.17691
• Published
• 13
Star Attention: Efficient LLM Inference over Long Sequences
Paper
• 2411.17116
• Published
• 53
SageAttention2 Technical Report: Accurate 4 Bit Attention for
Plug-and-play Inference Acceleration
Paper
• 2411.10958
• Published
• 57
BitNet a4.8: 4-bit Activations for 1-bit LLMs
Paper
• 2411.04965
• Published
• 69
1-bit AI Infra: Part 1.1, Fast and Lossless BitNet b1.58 Inference on
CPUs
Paper
• 2410.16144
• Published
• 5
FlatQuant: Flatness Matters for LLM Quantization
Paper
• 2410.09426
• Published
• 15
PrefixQuant: Static Quantization Beats Dynamic through Prefixed Outliers
in LLMs
Paper
• 2410.05265
• Published
• 33
Tensor Product Attention Is All You Need
Paper
• 2501.06425
• Published
• 90
Test-Time Preference Optimization: On-the-Fly Alignment via Iterative
Textual Feedback
Paper
• 2501.12895
• Published
• 61
Qwen2.5-1M Technical Report
Paper
• 2501.15383
• Published
• 72
Reward-Guided Speculative Decoding for Efficient LLM Reasoning
Paper
• 2501.19324
• Published
• 39
DeepRAG: Thinking to Retrieval Step by Step for Large Language Models
Paper
• 2502.01142
• Published
• 24
QuEST: Stable Training of LLMs with 1-Bit Weights and Activations
Paper
• 2502.05003
• Published
• 44
CMoE: Fast Carving of Mixture-of-Experts for Efficient LLM Inference
Paper
• 2502.04416
• Published
• 12
Paper
• 2502.06786
• Published
• 32
Lossless Acceleration of Large Language Models with Hierarchical
Drafting based on Temporal Locality in Speculative Decoding
Paper
• 2502.05609
• Published
• 18
TransMLA: Multi-head Latent Attention Is All You Need
Paper
• 2502.07864
• Published
• 57
InfiniteHiP: Extending Language Model Context Up to 3 Million Tokens on
a Single GPU
Paper
• 2502.08910
• Published
• 148
Diverse Inference and Verification for Advanced Reasoning
Paper
• 2502.09955
• Published
• 18
Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse
Attention
Paper
• 2502.11089
• Published
• 168
LightThinker: Thinking Step-by-Step Compression
Paper
• 2502.15589
• Published
• 31
MoBA: Mixture of Block Attention for Long-Context LLMs
Paper
• 2502.13189
• Published
• 17
SpargeAttn: Accurate Sparse Attention Accelerating Any Model Inference
Paper
• 2502.18137
• Published
• 60
SEAP: Training-free Sparse Expert Activation Pruning Unlock the
Brainpower of Large Language Models
Paper
• 2503.07605
• Published
• 66
TTRL: Test-Time Reinforcement Learning
Paper
• 2504.16084
• Published
• 120
φ-Decoding: Adaptive Foresight Sampling for Balanced Inference-Time
Exploration and Exploitation
Paper
• 2503.13288
• Published
• 51
Plug-and-Play 1.x-Bit KV Cache Quantization for Video Large Language
Models
Paper
• 2503.16257
• Published
• 27
FFN Fusion: Rethinking Sequential Computation in Large Language Models
Paper
• 2503.18908
• Published
• 19
LogQuant: Log-Distributed 2-Bit Quantization of KV Cache with Superior
Accuracy Preservation
Paper
• 2503.19950
• Published
• 12
AdaptiVocab: Enhancing LLM Efficiency in Focused Domains through
Lightweight Vocabulary Adaptation
Paper
• 2503.19693
• Published
• 76
Efficient Inference for Large Reasoning Models: A Survey
Paper
• 2503.23077
• Published
• 46
A Survey of Efficient Reasoning for Large Reasoning Models: Language,
Multimodality, and Beyond
Paper
• 2503.21614
• Published
• 43
Hogwild! Inference: Parallel LLM Generation via Concurrent Attention
Paper
• 2504.06261
• Published
• 110
C3PO: Critical-Layer, Core-Expert, Collaborative Pathway Optimization
for Test-Time Expert Re-Mixing
Paper
• 2504.07964
• Published
• 62
Quantization Hurts Reasoning? An Empirical Study on Quantized Reasoning
Models
Paper
• 2504.04823
• Published
• 31
HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient
MoE Inference
Paper
• 2504.05897
• Published
• 21
Accelerate Parallelizable Reasoning via Parallel Decoding within One
Sequence
Paper
• 2503.20533
• Published
• 12
PRIMA.CPP: Speeding Up 70B-Scale LLM Inference on Low-Resource Everyday
Home Clusters
Paper
• 2504.08791
• Published
• 139
BitNet b1.58 2B4T Technical Report
Paper
• 2504.12285
• Published
• 83
70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU
Inference via Dynamic-Length Float
Paper
• 2504.11651
• Published
• 31
Efficient Reasoning Models: A Survey
Paper
• 2504.10903
• Published
• 21
Sleep-time Compute: Beyond Inference Scaling at Test-time
Paper
• 2504.13171
• Published
• 15
SpecReason: Fast and Accurate Inference-Time Compute via Speculative
Reasoning
Paper
• 2504.07891
• Published
• 5
EasyEdit2: An Easy-to-use Steering Framework for Editing Large Language
Models
Paper
• 2504.15133
• Published
• 26