- Insights into DeepSeek-V3 Scaling Challenges and Reflections on Hardware for AI Architectures
- Native Sparse Attention Hardware-Aligned and Natively Trainable Sparse Attention
- Matryoshka Quantization
- LLM int8() 8-bit Matrix Multiplication for Transformers at Scale
- The Hardware Lottery
- FlashAttention Fast and Memory-Efficient Exact Attention with IO-Awareness
- FlashAttention-2 Faster Attention with Better Parallelism and Work Partitioning
- Efficient softmax approximation for GPUs from 2016-09-14
Includes papers on quantisation
Resources đ
- âš How to Scale Your Model (clipped) - A Systems View of LLMs on TPUs from the JAX Team
- How is LLaMa.cpp possible
- Understanding GPU Memory 1 Visualizing All Allocations over Time
- Mixed Precision Training
- AI AcceleratorsâââPart II Transistors and Pizza (or Why Do We Need Accelerators)?
- see also the other parts in this series by Adi Fuchs
- GPU Architecture Explained Cherry Servers - mainly useful for its section A brief history of Nvidia GPU Architecture
- AI and Memory Wall
- AI and Memory Wall - write-up of AI and Memory Wall which was published before IEEE paper
- they summarise very well in the closing paragraph: The computational cost of training recent SOTA Transformer models in NLP has been scaling at a rate of 750x/2yrs, and the model parameter size has been scaling at 400x/2yrs. In contrast, the peak hardware FLOPS is scaling at a rate of 3x/2yrs, while both the DRAM and interconnect bandwidth have been increasingly falling behind, with a scaling rate of 1.6x/2yrs and 1.4x/2yrs, respectively. To put these numbers into perspective, peak hardware FLOPS has increased by 60,000x over the past 20 years, while DRAM/Interconnect bandwidth has only scaled by a factor of 100x/30x over the same time period. With these trends, memory â in particular, intra/inter-chip memory transfer â will soon become the main limiting factoring in training large AI models.
- The Best GPUs for Deep Learning in 2023 â An In-depth Analysis
- Understanding DRAM Tech Talk Simms International - DRAM vs SRAM quick overview
- NVIDIA A100 Tensor Core GPU Architecture (Ampere Architecture White paper)
Tools
nvidia-smi topo
inc.nvidia-smi topo -m
- Display topology information about the system. Use ânvidia-smi topo -hâ for more information. Linux only. Shows all GPUs NVML is able to detect but CPU and NUMA node affinity information will only be shown for GPUs with Kepler or newer architectures. Note: GPU enumeration is the same as NVML. (from the entry fortopo
underman nvidia-smi
; nvidia-smi 550.107, 2024/7/24;nvidia-smi --version
gave NVIDIA-SMI version : 550.107.02 NVML version: 550.107 DRIVER version: 550.107.02 CUDA Version: 12.4)- pyNVML - Python bindings to the NVIDIA Management Library - per PyPI:
pip install nvidia-ml-py
Triton
- Triton â NVIDIA Triton Inference Server - docs
- Triton Inference Server
- srush/Triton-Puzzles - Puzzles for learning Triton
- Triton-Viz: A Visualization Toolkit for programming with Triton: Deep-Learning-Profiling-Tools/triton-viz
Related / See Also
Over at đ Computer Architecture