🪴 Anil's Garden

❯

❯

Efficient Machine Learning

Efficient Machine Learning

18 Jul 20252 min read

topic
efficient
todo
talk

[[A Simple and Effective $L_{2}$ Norm-Based Strategy for KV Cache Compression]]
Are Sixteen Heads Really Better than One
I3D Transformer architectures with input-dependent dynamic depth for speech recognition
Q-Filters Leveraging QK Geometry for Efficient KV Cache Compression
The Lottery Ticket Hypothesis Finding Sparse, Trainable Neural Networks
The case for 4-bit precision k-bit Inference Scaling Laws
Efficient Memory Management for Large Language Model Serving with PagedAttention - the vLLM paper
Efficiently Scaling Transformer Inference
NVIDIA/FasterTransformer - Transformer related optimization, including BERT, GPT
NVIDIA/TensorRT-LLM - TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and support state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that orchestrate the inference execution in performant way.

todo Add cellular batching [16] and iteration-level scheduling [60] to the above - from the vLLM paper §2.3

Themes

Pruning
Early exit
Sparsity
Distillation
State Space Models

Surveys and Reviews

Efficient Transformers A Survey

Resources

✨ Large Transformer Model Inference Optimization by Lilian Weng
Research Proposal: Resource-efficient Foundation Models for Automatic Translation (A10) (submitted to FBK in May 2024)
Designing efficient and modular neural networks - Simone Scardapane -talk
Efficient Transformers - Łukasz Kaiser -talk

Graph View

Themes
Surveys and Reviews
Resources

Backlinks

No backlinks found

Website
Bluesky
Twitter/X
GitHub
LinkedIn
Instagram
Goodreads
Letterboxd
🍋