PyTorch Distributed and Multi-GPU Training
- ✨ Writing Distributed Applications with PyTorch — PyTorch Tutorials 2.6.0+cu124 documentation - tutorial by Séb Arnold
- Distributed communication package - torch.distributed — PyTorch 2.6 documentation
- Multi node PyTorch Distributed Training Guide For People In A Hurry || link - amazing + very condensed
- PyTorch Distributed Training - Lei Mao
- PyTorch Distributed Evaluation - Leo Mao
- pytorch/torch/distributed/ as of torch v2.0.1 bugfix
PyTorch Parallel and Distributed Training Tutorials
The Parallel and Distributed Training section of the PyTorch Tutorials contain the following several resources (as of 2025-04-08):
- Distributed and Parallel Training Tutorials
- PyTorch Distributed Overview
- Distributed Data Parallel in PyTorch - Video Tutorials
- Single-Machine Model Parallel Best Practices
- Getting Started with Distributed Data Parallel
- Writing Distributed Applications with PyTorch
- Getting Started with Fully Sharded Data Parallel(FSDP)
- Advanced Model Training with Fully Sharded Data Parallel (FSDP)
- Introduction to Libuv TCPStore Backend
- Large Scale Transformer model training with Tensor Parallel (TP)
- Introduction to Distributed Pipeline Parallelism
- Customize Process Group Backends Using Cpp Extensions
- Getting Started with Distributed RPC Framework
- Implementing a Parameter Server Using Distributed RPC Framework
- Implementing Batch RPC Processing Using Asynchronous Executions
- Combining Distributed DataParallel with Distributed RPC Framework
- Distributed Training with Uneven Inputs Using the Join Context Manager
Notes - PyTorch Distributed and Multi-GPU
- class torch.distributed.ReduceOp
- Distributed Key-Value Store - The distributed package comes with a distributed key-value store, which can be used to share information between processes in the group as well as to initialize the distributed package in
torch.distributed.init_process_group()
(by explicitly creating the store as an alternative to specifyinginit_method
.) There are 3 choices for Key-Value Stores:TCPStore
,FileStore
, andHashStore
.
Distributed Backends
- NVIDIA Collective Communications Library (NCCL)
- The NVIDIA Collective Communication Library (NCCL) implements multi-GPU and multi-node communication primitives optimized for NVIDIA GPUs and Networking. NCCL provides routines such as all-gather, all-reduce, broadcast, reduce, reduce-scatter as well as point-to-point send and receive that are optimized to achieve high bandwidth and low latency over PCIe and NVLink high-speed interconnects within a node and over NVIDIA Mellanox Network across nodes.
- NCCL documentation
- nvidia/nccl - Optimized primitives for collective multi-GPU communication
- Gloo
- Gloo documentation
- Overview
- Rendezvous - creating a
gloo::Context
- Algorithms - index of collective algorithms and their semantics and complexity
- Transport details - the transport API and its implementations
- CUDA integration - integration of CUDA aware Gloo algorithms with existing CUDA code
- Latency optimization - number of tips and tricks to improve performance
- Gloo documentation
- Message Passing Interface - Wikipedia - a portable message-passing standard designed to function on parallel computing architectures
- Message Passing Interface High Performance Computing - nice explainer from New Mexico State University
- MPI Forum - This website contains information about the activities of the MPI Forum, which is the standardization forum for the Message Passing Interface (MPI). You may find standard documents, information about the activities of the MPI forum, and links to comment on the MPI Document using the navigation at the top of the page.
- Open MPI: Open Source High Performance Computing - A High Performance Message Passing Library
- MPICH is a high performance and widely portable implementation of the Message Passing Interface (MPI) standard