🪴 Anil's Garden

Home

❯

CS

❯

Distributed Computing, Distributed and Multi-GPU Training

28 Jul 20253 min read

PyTorch Distributed and Multi-GPU Training

✨ Writing Distributed Applications with PyTorch — PyTorch Tutorials 2.6.0+cu124 documentation - tutorial by Séb Arnold
Distributed communication package - torch.distributed — PyTorch 2.6 documentation
- Collective functions
Multi node PyTorch Distributed Training Guide For People In A Hurry || link - amazing + very condensed
PyTorch Distributed Training - Lei Mao
PyTorch Distributed Evaluation - Leo Mao
pytorch/torch/distributed/ as of torch v2.0.1 bugfix

PyTorch Parallel and Distributed Training Tutorials

The Parallel and Distributed Training section of the PyTorch Tutorials contain the following several resources (as of 2025-04-08):

Distributed and Parallel Training Tutorials
PyTorch Distributed Overview
Distributed Data Parallel in PyTorch - Video Tutorials
Single-Machine Model Parallel Best Practices
Getting Started with Distributed Data Parallel
Writing Distributed Applications with PyTorch
Getting Started with Fully Sharded Data Parallel(FSDP)
Advanced Model Training with Fully Sharded Data Parallel (FSDP)
Introduction to Libuv TCPStore Backend
Large Scale Transformer model training with Tensor Parallel (TP)
Introduction to Distributed Pipeline Parallelism
Customize Process Group Backends Using Cpp Extensions
Getting Started with Distributed RPC Framework
Implementing a Parameter Server Using Distributed RPC Framework
Implementing Batch RPC Processing Using Asynchronous Executions
Combining Distributed DataParallel with Distributed RPC Framework
Distributed Training with Uneven Inputs Using the Join Context Manager

Notes - PyTorch Distributed and Multi-GPU

class torch.distributed.ReduceOp
Distributed Key-Value Store - The distributed package comes with a distributed key-value store, which can be used to share information between processes in the group as well as to initialize the distributed package in torch.distributed.init_process_group() (by explicitly creating the store as an alternative to specifying init_method.) There are 3 choices for Key-Value Stores: TCPStore, FileStore, and HashStore.

Distributed Backends

NVIDIA Collective Communications Library (NCCL)
- The NVIDIA Collective Communication Library (NCCL) implements multi-GPU and multi-node communication primitives optimized for NVIDIA GPUs and Networking. NCCL provides routines such as all-gather, all-reduce, broadcast, reduce, reduce-scatter as well as point-to-point send and receive that are optimized to achieve high bandwidth and low latency over PCIe and NVLink high-speed interconnects within a node and over NVIDIA Mellanox Network across nodes.
- NCCL documentation
- nvidia/nccl - Optimized primitives for collective multi-GPU communication
Gloo
- Gloo documentation
  - Overview
  - Rendezvous - creating a gloo::Context
  - Algorithms - index of collective algorithms and their semantics and complexity
  - Transport details - the transport API and its implementations
  - CUDA integration - integration of CUDA aware Gloo algorithms with existing CUDA code
  - Latency optimization - number of tips and tricks to improve performance
Message Passing Interface - Wikipedia - a portable message-passing standard designed to function on parallel computing architectures
- Message Passing Interface High Performance Computing - nice explainer from New Mexico State University
- MPI Forum - This website contains information about the activities of the MPI Forum, which is the standardization forum for the Message Passing Interface (MPI). You may find standard documents, information about the activities of the MPI forum, and links to comment on the MPI Document using the navigation at the top of the page.
  - MPI Documents
- Open MPI: Open Source High Performance Computing - A High Performance Message Passing Library
  - Open MPI docs (v5.0.x)
- MPICH is a high performance and widely portable implementation of the Message Passing Interface (MPI) standard

Collective Communication

Graph View

PyTorch Distributed and Multi-GPU Training
PyTorch Parallel and Distributed Training Tutorials
Notes - PyTorch Distributed and Multi-GPU
Distributed Backends
Collective Communication

Backlinks

No backlinks found

Website
Bluesky
Twitter/X
GitHub
LinkedIn
Instagram
Goodreads
Letterboxd
🍋

🪴 Anil's Garden

Explorer

Distributed Computing, Distributed and Multi-GPU Training

PyTorch Distributed and Multi-GPU Training

PyTorch Parallel and Distributed Training Tutorials

Notes - PyTorch Distributed and Multi-GPU

Distributed Backends

Collective Communication

Graph View

Table of Contents

Backlinks