🪴 Anil's Garden

❯

❯

Hardware

16 Oct 20253 min read

topic
hardware
gpu
quantisation

Insights into DeepSeek-V3 Scaling Challenges and Reflections on Hardware for AI Architectures
Native Sparse Attention Hardware-Aligned and Natively Trainable Sparse Attention
Matryoshka Quantization
LLM int8() 8-bit Matrix Multiplication for Transformers at Scale
The Hardware Lottery
FlashAttention Fast and Memory-Efficient Exact Attention with IO-Awareness
FlashAttention-2 Faster Attention with Better Parallelism and Work Partitioning
Efficient softmax approximation for GPUs from 2016-09-14

Includes papers on quantisation

Resources 📚

✨ How to Scale Your Model (clipped) - A Systems View of LLMs on TPUs from the JAX Team
Inside NVIDIA GPUs Anatomy of high performance matmul kernels - Aleksa Gordić
How is LLaMa.cpp possible
- LLM Parameter Counting kipply’s blog
- Transformer Inference Arithmetic kipply’s blog
Understanding GPU Memory 1 Visualizing All Allocations over Time
Mixed Precision Training
- Blog: Mixed-Precision Training of Deep Neural Networks NVIDIA Technical Blog
AI Accelerators — Part II Transistors and Pizza (or Why Do We Need Accelerators)?
- see also the other parts in this series by Adi Fuchs
GPU Architecture Explained Cherry Servers - mainly useful for its section A brief history of Nvidia GPU Architecture
AI and Memory Wall
AI and Memory Wall - write-up of AI and Memory Wall which was published before IEEE paper
- they summarise very well in the closing paragraph: The computational cost of training recent SOTA Transformer models in NLP has been scaling at a rate of 750x/2yrs, and the model parameter size has been scaling at 400x/2yrs. In contrast, the peak hardware FLOPS is scaling at a rate of 3x/2yrs, while both the DRAM and interconnect bandwidth have been increasingly falling behind, with a scaling rate of 1.6x/2yrs and 1.4x/2yrs, respectively. To put these numbers into perspective, peak hardware FLOPS has increased by 60,000x over the past 20 years, while DRAM/Interconnect bandwidth has only scaled by a factor of 100x/30x over the same time period. With these trends, memory — in particular, intra/inter-chip memory transfer — will soon become the main limiting factoring in training large AI models.
The Best GPUs for Deep Learning in 2023 — An In-depth Analysis
Understanding DRAM Tech Talk Simms International - DRAM vs SRAM quick overview
NVIDIA A100 Tensor Core GPU Architecture (Ampere Architecture White paper)

Tools

nvidia-smi topo inc. nvidia-smi topo -m - Display topology information about the system. Use “nvidia-smi topo -h” for more information. Linux only. Shows all GPUs NVML is able to detect but CPU and NUMA node affinity information will only be shown for GPUs with Kepler or newer architectures. Note: GPU enumeration is the same as NVML. (from the entry for topo under man nvidia-smi; nvidia-smi 550.107, 2024/7/24; nvidia-smi --version gave NVIDIA-SMI version : 550.107.02 NVML version: 550.107 DRIVER version: 550.107.02 CUDA Version: 12.4)
pyNVML - Python bindings to the NVIDIA Management Library - per PyPI: pip install nvidia-ml-py

Triton

Triton — NVIDIA Triton Inference Server - docs
Triton Inference Server
- https://github.com/triton-inference-server/server
srush/Triton-Puzzles - Puzzles for learning Triton
Triton-Viz: A Visualization Toolkit for programming with Triton: Deep-Learning-Profiling-Tools/triton-viz

Related / See Also

Over at 👉 Computer Architecture

Graph View

Resources 📚
Tools
Triton
Related / See Also

Backlinks

Computer Architecture

Website
Bluesky
Twitter/X
GitHub
LinkedIn
Instagram
Goodreads
Letterboxd
🍋