- [[A Simple and Effective Norm-Based Strategy for KV Cache Compression]]
- Are Sixteen Heads Really Better than One
- I3D Transformer architectures with input-dependent dynamic depth for speech recognition
- Q-Filters Leveraging QK Geometry for Efficient KV Cache Compression
- The Lottery Ticket Hypothesis Finding Sparse, Trainable Neural Networks
- The case for 4-bit precision k-bit Inference Scaling Laws
- Efficient Memory Management for Large Language Model Serving with PagedAttention - the vLLM paper
- Efficiently Scaling Transformer Inference
- NVIDIA/FasterTransformer - Transformer related optimization, including BERT, GPT
- NVIDIA/TensorRT-LLM - TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and support state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that orchestrate the inference execution in performant way.
todo Add cellular batching [16] and iteration-level scheduling [60] to the above - from the vLLM paper §2.3
Themes
- Pruning
- Early exit
- Sparsity
- Distillation
- State Space Models
Surveys and Reviews
Resources
- âš Large Transformer Model Inference Optimization by Lilian Weng
- Research Proposal: Resource-efficient Foundation Models for Automatic Translation (A10) (submitted to FBK in May 2024)
- Designing efficient and modular neural networks - Simone Scardapane -talk
- Efficient Transformers - Ćukasz Kaiser -talk