- [[A Simple and Effective Norm-Based Strategy for KV Cache Compression]]
- Are Sixteen Heads Really Better than One
- I3D Transformer architectures with input-dependent dynamic depth for speech recognition
- Q-Filters Leveraging QK Geometry for Efficient KV Cache Compression
- The Lottery Ticket Hypothesis Finding Sparse, Trainable Neural Networks
- The case for 4-bit precision k-bit Inference Scaling Laws
Themes
- Pruning
- Early exit
- Sparsity
- Distillation
- State Space Models
Surveys and Reviews
Resources
- âš Large Transformer Model Inference Optimization by Lilian Weng
- Research Proposal: Resource-efficient Foundation Models for Automatic Translation (A10) (submitted to FBK in May 2024)
- Designing efficient and modular neural networks - Simone Scardapane -talk
- Efficient Transformers - Ćukasz Kaiser -talk