- Native Sparse Attention Hardware-Aligned and Natively Trainable Sparse Attention
- Matryoshka Quantization
- LLM int8() 8-bit Matrix Multiplication for Transformers at Scale
- The Hardware Lottery
- FlashAttention Fast and Memory-Efficient Exact Attention with IO-Awareness
- FlashAttention-2 Faster Attention with Better Parallelism and Work Partitioning
- Efficient softmax approximation for GPUs from 2016-09-14
Includes papers on quantisation
Resources đ
- âš How to Scale Your Model (clipped) - A Systems View of LLMs on TPUs from the JAX Team
- How is LLaMa.cpp possible
- Understanding GPU Memory 1 Visualizing All Allocations over Time
- Mixed Precision Training
- AI AcceleratorsâââPart II Transistors and Pizza (or Why Do We Need Accelerators)?
- see also the other parts in this series by Adi Fuchs
- GPU Architecture Explained Cherry Servers - mainly useful for its section A brief history of Nvidia GPU Architecture
- AI and Memory Wall
- AI and Memory Wall - write-up of AI and Memory Wall which was published before IEEE paper
- they summarise very well in the closing paragraph: The computational cost of training recent SOTA Transformer models in NLP has been scaling at a rate of 750x/2yrs, and the model parameter size has been scaling at 400x/2yrs. In contrast, the peak hardware FLOPS is scaling at a rate of 3x/2yrs, while both the DRAM and interconnect bandwidth have been increasingly falling behind, with a scaling rate of 1.6x/2yrs and 1.4x/2yrs, respectively. To put these numbers into perspective, peak hardware FLOPS has increased by 60,000x over the past 20 years, while DRAM/Interconnect bandwidth has only scaled by a factor of 100x/30x over the same time period. With these trends, memory â in particular, intra/inter-chip memory transfer â will soon become the main limiting factoring in training large AI models.
- The Best GPUs for Deep Learning in 2023 â An In-depth Analysis
- Understanding DRAM Tech Talk Simms International - DRAM vs SRAM quick overview
- NVIDIA A100 Tensor Core GPU Architecture (Ampere Architecture White paper)
Computer Architecture
Over at đ Computer Architecture