Probably want to split this out into separate docs and have a PyTorch/ subdirectory in CS/
See also CUDA
GPU Memory
- Understanding CUDA Memory Usage — PyTorch 2.5 documentation
- pytorch.org/memory_viz
- see also a PyTorch blog post on memory profiling: Understanding GPU Memory 1 Visualizing All Allocations over Time
- Understanding GPU Memory 1 Visualizing All Allocations over Time - great, very helpful post; just remember the stack trace is one call per line and gets truncated; useful to ask Grok/GPT (with reasoning) for an explanation (and read the reasoning too, which is what I did on 2025-03-11)
- Understanding GPU Memory 2 Finding and Removing Reference Cycles
The Memory Snapshot and the Memory Profiler are available in the v2.1 release of PyTorch as experimental features.
- Memory Snapshot can be found in the PyTorch Memory docs here
- Memory Profiler can be found in the PyTorch Profiler docs here
Chunked Cross-entropy calculation - for bf16 models; loss calculated at fp32
This is from the torchtune docs on CEWithChunkedOutputLoss — torchtune 0.3 documentation
Cross-entropy with chunked outputs that saves memory by only upcasting one chunk at a time.
Whenever the model is trained with bf16, before running CE, we have to upcast it to fp32 for better accuracy and stability. When upcasting happens, the memory usage doubles. Models like llama3 have large vocabulary size and, therefore, have a large output tensor of shape
(bsz, num_tokens, vocab_size)
. If we chunk on the token level, you can still compute the cross entropy normally, but upcasting only one chunk at a time saves considerable memory.The CE and upcasting have to be compiled together for better performance. When using this class, we recommend using
torch.compile()
only on the methodcompute_cross_entropy
. The gains from chunking won’t be realized if you compile the entire class.For more details, please refer to: https://github.com/pytorch/torchtune/pull/1390
Profiling in PyTorch
Distributed Data Parallel
DistributedDataParallel (DDP) implements data parallelism at the module level which can run across multiple machines. Applications using DDP should spawn multiple processes and create a single DDP instance per process. DDP uses collective communications in the torch.distributed package to synchronize gradients and buffers. More specifically, DDP registers an autograd hook for each parameter given by model.parameters()
and the hook will fire when the corresponding gradient is computed in the backward pass. Then DDP uses that signal to trigger gradient synchronization across processes. Please refer to DDP design note for more details.
See Getting Started with Distributed Data Parallel.
See also:
- torch.distributed.init_process_group: Initializes the default distributed process group, and this will also initialize the distributed package. There are 2 main ways to initialize a process group: (1) Specify
store
,rank
, andworld_size
explicitly. (2) Specifyinit_method
(a URL string) which indicates where/how to discover peers. Optionally specifyrank
andworld_size
, or encode all required parameters in the URL and omit them.- backend valid values include mpi, gloo, nccl, and ucc
- world_size - number of processes participating in the job
- rank - rank of the current process (should be integer between 0 and
world_size
- 1) - store (Store, optional) – Key/value store accessible to all workers, used to exchange connection/address information. Mutually exclusive with
init_method
. - timeout (timedelta, optional) – Timeout for operations executed against the process group. Default value equals 30 minutes.
- (group_name - argument is deprecated - Group name)
Use nn.parallel.DistributedDataParallel instead of multiprocessing or nn.DataParallel
TLDR; Use DistributedDataParallel
even for multi-GPU single node training because DDP will handle the data handling for you with multiprocessing, instead of you potentially making a mistake (resulting in slow code or a bug).
Most use cases involving batched inputs and multiple GPUs should default to using [DistributedDataParallel](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel)
to utilize more than one GPU.
There are significant caveats to using CUDA models with [multiprocessing](https://pytorch.org/docs/stable/multiprocessing.html#module-torch.multiprocessing)
; unless care is taken to meet the data handling requirements exactly, it is likely that your program will have incorrect or undefined behavior.
It is recommended to use [DistributedDataParallel](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel)
, instead of [DataParallel](https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html#torch.nn.DataParallel)
to do multi-GPU training, even if there is only a single node.
The difference between DistributedDataParallel
and DataParallel
is: DistributedDataParallel
uses multiprocessing where a process is created for each GPU, while DataParallel
uses multithreading. By using multiprocessing [instead of multithreading], each GPU has its dedicated process, this avoids the performance overhead caused by GIL of Python interpreter.
DistributedSampler from torch.utils.data.distributed
How does the DistributedSampler (together with ddp) split the dataset to different gpus? I know it will split the dataset to num_gpus chunks and each chunk will go to one of the gpus. Is it randomly sampled or sequentially?
First, it checks if the dataset size is divisible by num_replicas
. If not, extra samples are added.
If shuffle
is turned on, it performs random permutation before subsampling.
You should use set_epoch
function to modify the random seed for that.
Then the DistributedSampler simply subsamples the data among the whole dataset.https://github.com/pytorch/pytorch/blob/master/torch/utils/data/distributed.py#L68 2.2k
# subsample
indices = indices[self.rank:self.total_size:self.num_replicas]
Note that adding extra data could cause at evaluation time due to the duplicated data. I personally use a custom sampler (DistributedEvalSampler 757) when testing my models.
See the source code for torch.utils.data.distributed.DistributedSampler and the to-the-point documentation, also included here:
torch.utils.data.distributed.DistributedSampler(dataset, num_replicas=None, rank=None, shuffle=True, seed=0, drop_last=False)
- num_replicas (int, optional) – Number of processes participating in distributed training. By default,
world_size
is retrieved from the current distributed group. - rank (int, optional) – Rank of the current process within
num_replicas
. By default,rank
is retrieved from the current distributed group. - shuffle (bool, optional) – If
True
(default), sampler will shuffle the indices. - seed (int, optional) – random seed used to shuffle the sampler if
shuffle=True
. This number should be identical across all processes in the distributed group. Default:0
. - drop_last (bool, optional) – if
True
, then the sampler will drop the tail of the data to make it evenly divisible across the number of replicas. IfFalse
, the sampler will add extra indices to make the data evenly divisible across the replicas. Default:False
.
Useful functions / utilities
Use torch.utils.data.get_worker_info()
to return the information about the current [DataLoader](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader)
iterator worker process.
In particular, this is useful when using distributed data parallel (or multiprocessing in isolation, if you’ve rolled your own multiprocessing script), specifically to get the random seed and the datasets that reside in each worker process.
When called in a worker, this returns an object guaranteed to have the following attributes:
id
: the current worker id.num_workers
: the total number of workers.seed
: the random seed set for the current worker. This value is determined by main process RNG and the worker id. See[DataLoader](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader)
’s documentation for more details.dataset
: the copy of the dataset object in this process. Note that this will be a different object in a different process than the one in the main process.
When called in the main process, this returns None
.
Data and Data Loading
Single- vs Multi-process data loading - num_workers
See Single-process data loading (default at the time of writing) and Multi-process data loading.
Setting the argument num_workers as a positive integer will turn on multi-process data loading with the specified number of loader worker processes.
In this mode, each time an iterator of a DataLoader is created (e.g., when you call enumerate(dataloader)
), num_workers
worker processes are created. At this point, the dataset
, collate_fn
, and worker_init_fn
are passed to each worker, where they are used to initialize, and fetch data. This means that dataset access together with its internal IO, transforms (including collate_fn
) runs in the worker process.
[torch.utils.data.get_worker_info()](https://pytorch.org/docs/stable/data.html#torch.utils.data.get_worker_info)
returns various useful information in a worker process (including the worker id, dataset replica, initial seed, etc.), and returns None
in main process. Users may use this function in dataset code and/or worker_init_fn
to individually configure each dataset replica, and to determine whether the code is running in a worker process. For example, this can be particularly helpful in sharding the dataset.
For map-style datasets, the main process generates the indices using sampler
and sends them to the workers. So any shuffle randomization is done in the main process which guides loading by assigning indices to load.
For iterable-style datasets, since each worker process gets a replica of the dataset
object, naive multi-process loading will often result in duplicated data. Using [torch.utils.data.get_worker_info()](https://pytorch.org/docs/stable/data.html#torch.utils.data.get_worker_info)
and/or worker_init_fn
, users may configure each replica independently. (See [IterableDataset](https://pytorch.org/docs/stable/data.html#torch.utils.data.IterableDataset)
documentations for how to achieve this.) For similar reasons, in multi-process loading, the drop_last
argument drops the last non-full batch of each worker’s iterable-style dataset replica.
Workers are shut down once the end of the iteration is reached, or when the iterator becomes garbage collected.
If I have my whole dataset loaded into memory in PyTorch, should I set num_workers in my DataLoader to 0 or a positive integer?
If you have your entire dataset loaded into memory, you can still benefit from setting num_workers
in your DataLoader to a positive integer. The num_workers
parameter determines the number of worker subprocesses used for data loading. When num_workers
is greater than 0, the DataLoader will use multiple subprocesses to parallelize data loading, which can improve the speed of loading and preparing data batches.
Even when the dataset is in memory, there might be additional time-consuming operations, such as data augmentation, custom collation, and formatting, that can be parallelized using multiple workers. By setting num_workers
to a positive integer, you can offload these operations to separate subprocesses, reducing the time your main training loop spends waiting for data.
However, it’s essential to find the right balance when setting num_workers
. Using too many workers can lead to increased memory usage and contention for resources, which might negatively impact training speed. It is generally recommended to start with a small number of workers (e.g., 2 or 4) and experiment with different values to find the optimal setting for your specific use case.
If your dataset is small and the data-loading operations are not computationally expensive, you may not see significant benefits from using multiple workers, and setting num_workers
to 0 might be sufficient. In this case, the DataLoader will use the main process for data loading without spawning any additional subprocesses.
torch.utils.data
Dataset Classes
PyTorch’s data utilities modules implement several dataset types. The two detailed at the top of the the docs are the map-style and iterable dataset types, but several others variants are convenient:
- Dataset: map-style - should implement
__getitem__
and__len__
data model methods - IterableDataset: the iterable - should implement the
__iter__
protocol (data model method). Appropriate when random reads from the dataset are expensive or not used, and where the batch size depends on the fetched data
Useful but not well-known:
- TensorDataset: Dataset wrapping tensors. Each sample will be retrieved by indexing tensors along the first dimension.
- ConcatDataset: Dataset as a concatenation of multiple datasets. This class is useful to assemble different existing datasets.
- ChainDataset: Dataset for chaining multiple
[IterableDataset](https://pytorch.org/docs/stable/data.html#torch.utils.data.IterableDataset)
s. This class is useful to assemble different existing dataset streams. The chaining operation is done on-the-fly, so concatenating large-scale datasets with this class will be efficient. Note: “chain” as initertools.chain
…making an iterator that returns elements from the first iterable until it is exhausted, then proceeding to the next iterable, until all of the iterables are exhausted. Used for treating consecutive sequences as a single sequence. - Subset: Subset of a dataset at specified indices. Used like
Subset(dataset, indices)
Sampler Classes
Also check out the many Sampler classes in torch.utils.data.
Cases when batch size depends on the fetched data (IterableDataset
)
Examples of when the batch size would depend on the fetched data include:
- Variable-length sequences: In natural language processing tasks, you may have sequences of varying lengths (e.g., sentences or paragraphs). In such cases, you could group sequences of similar lengths together to minimize padding, which would result in different batch sizes for different groups of sequences.
- Data streams: If you’re processing a continuous data stream, such as logs or sensor data, the amount of data available at a given time might be variable. In this case, you could create batches based on the data available during each fetch operation, leading to varying batch sizes.
- Online learning: In an online learning scenario where the model is trained on-the-fly with incoming data, the amount of new data that becomes available between training steps might not be constant. The batch size would depend on the amount of new data available at each step.
- Adaptive batching: In some cases, you might want to adjust the batch size based on runtime considerations, such as GPU memory usage or computational efficiency. For example, if your model can handle larger batch sizes without running out of memory, you could increase the batch size for more efficient training.
- Data filtering: If you apply a filtering operation on the dataset during the fetching process, the number of samples that pass the filter might not be constant. In this case, the batch size would depend on the number of samples that meet the filtering criteria at each step.
In these scenarios, using an IterableDataset
allows you to dynamically adjust the batch size based on the fetched data, making it more suitable for handling such variable batch sizes.
Memory Pinning in Data Loading
See Memory Pinning.
Host to GPU copies are much faster when they originate from pinned (page-locked) memory. See Use pinned memory buffers for more details on when and how to use pinned memory generally.
For data loading, passing pin_memory=True
to a [DataLoader](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader)
will automatically put the fetched data Tensors in pinned memory, and thus enables faster data transfer to CUDA enabled GPUs.
Using pinned memory buffers - pin_memory
and non_blocking=True
Host to GPU copies are much faster when they originate from pinned (page-locked) memory. CPU tensors and storages expose a [pin_memory()](https://pytorch.org/docs/stable/generated/torch.Tensor.pin_memory.html#torch.Tensor.pin_memory)
method, that returns a copy of the object, with data put in a pinned region.
Also, once you pin a tensor or storage, you can use asynchronous GPU copies. Just pass an additional non_blocking=True
argument to a [to()](https://pytorch.org/docs/stable/generated/torch.Tensor.to.html#torch.Tensor.to)
or a [cuda()](https://pytorch.org/docs/stable/generated/torch.Tensor.cuda.html#torch.Tensor.cuda)
call. This can be used to overlap data transfers with computation.
You can make the [DataLoader](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader)
return batches placed in pinned memory by passing pin_memory=True
to its constructor.
See also How to Optimize Data Transfers in CUDA C/C++ | NVIDIA Technical Blog
How to Optimize Data Transfers in CUDA C/C++ | NVIDIA Technical Blog
Custom batch types and Memory pinning
The default memory pinning logic only recognizes Tensors and maps and iterables containing Tensors. By default, if the pinning logic sees a batch that is a custom type (which will occur if you have a collate_fn
that returns a custom batch type), or if each element of your batch is a custom type, the pinning logic will not recognize them, and it will return that batch (or those elements) without pinning the memory. To enable memory pinning for custom batch or data type(s), define a pin_memory()
method on your custom type(s).
class SimpleCustomBatch:
def __init__(self, data):
transposed_data = list(zip(*data))
self.inp = torch.stack(transposed_data[0], 0)
self.tgt = torch.stack(transposed_data[1], 0)
# custom memory pinning method on custom type
def pin_memory(self):
self.inp = self.inp.pin_memory()
self.tgt = self.tgt.pin_memory()
return self
def collate_wrapper(batch):
return SimpleCustomBatch(batch)
inps = torch.arange(10 * 5, dtype=torch.float32).view(10, 5)
tgts = torch.arange(10 * 5, dtype=torch.float32).view(10, 5)
dataset = TensorDataset(inps, tgts)
loader = DataLoader(dataset, batch_size=2, collate_fn=collate_wrapper,
pin_memory=True)
for batch_ndx, sample in enumerate(loader):
print(sample.inp.is_pinned())
print(sample.tgt.is_pinned())
The DataLoader loader
is created using the dataset, a batch size of 2, the collate_wrapper
function as the collate_fn
, and setting pin_memory
to True. This DataLoader will be responsible for providing batches of data to the training loop.
- The
collate_wrapper
function will be called for each batch, creating aSimpleCustomBatch
object with the data. - Since
pin_memory
is set to True, the DataLoader will call thepin_memory
method of theSimpleCustomBatch
object for each batch. This is where the custompin_memory
method is used.
In the loop iterating over the DataLoader (for batch_ndx, sample in enumerate(loader)
), the is_pinned()
method is called on both sample.inp
and sample.tgt
to check if the input and target tensors are pinned in GPU memory. The loop prints the result, indicating whether the tensors are pinned or not.
In summary, this code demonstrates how to create a custom batch object with a DataLoader for handling input and target data in a specific way. The pin_memory
method of the custom batch object is called by the DataLoader when pin_memory
is set to True. The custom batch object is used in the training loop to access the input and target tensors.
How does the DataLoader object know to use the pin_memory
method of the batch object? Where is this behaviour implemented?
When pin_memory
is set to True, the DataLoader will try to pin the memory of the output batch objects. This behavior is implemented in the _utils/pin_memory.py
file of the PyTorch source code. Specifically, the pin_memory
function in this file is responsible for pinning the memory of the batch object.
Here’s how it works:
- When a batch is fetched from the dataset, the
collate_fn
(in this case,collate_wrapper
) is called to create a batch object (aSimpleCustomBatch
instance). - After the batch object is created, the DataLoader checks if the
pin_memory
argument is set to True. If it is, the DataLoader calls thepin_memory
function from_utils/pin_memory.py
with the batch object as its argument. - The
pin_memory
function checks if the input object (the batch object) has apin_memory
method. If it does, it calls the object’spin_memory
method. In this case, it will call thepin_memory
method of theSimpleCustomBatch
object.
This way, the DataLoader is aware of the custom pin_memory
method in the SimpleCustomBatch
class and uses it when needed. You can find the implementation of the pin_memory
function in the PyTorch GitHub repository:
https://github.com/pytorch/pytorch/blob/master/torch/utils/data/_utils/pin_memory.py
Pinning Memory - pin_memory
from torch.utils.data
How pin_memory
is implemented, for example in DataLoaders. Taken from torch/utils/data/_utils/pin_memory.py.
(This is also a nice example of recursion.)
def pin_memory(data, device=None):
if isinstance(data, torch.Tensor):
return data.pin_memory(device)
elif isinstance(data, (str, bytes)):
return data
elif isinstance(data, collections.abc.Mapping):
try:
return type(data)({k: pin_memory(sample, device) for k, sample in data.items()}) # type: ignore[call-arg]
except TypeError:
# The mapping type may not support `__init__(iterable)`.
return {k: pin_memory(sample, device) for k, sample in data.items()}
elif isinstance(data, tuple) and hasattr(data, '_fields'): # namedtuple
return type(data)(*(pin_memory(sample, device) for sample in data))
elif isinstance(data, tuple):
return [pin_memory(sample, device) for sample in data] # Backwards compatibility.
elif isinstance(data, collections.abc.Sequence):
try:
return type(data)([pin_memory(sample, device) for sample in data]) # type: ignore[call-arg]
except TypeError:
# The sequence type may not support `__init__(iterable)` (e.g., `range`).
return [pin_memory(sample, device) for sample in data]
elif hasattr(data, "pin_memory"):
return data.pin_memory()
else:
return data
Disabling Gradient Computation torch.no_grad
vs torch.inference_mode
Lots of info:
- PyTorch Dev Podcast episode on Inference Mode
- Why are there two different flags to disable gradient computation in PyTorch?
- Great answer: inference mode additionally removes overhead arising from Version control of tensors & View Tracking of Tensors and passes on that “PyTorch dev team says they have seen a bump of 5-10% while deploying models in production at Facebook.”
- The legendary Piotr Bialecki says: “you can depend on runtime errors and as long as no errors are raised, you code should be fine” and that “you are not allowed to set the
requires_grad
attribute on tensors from aninference_mode
context”
Torch JIT (Just-in-time Compiler)
Steps
-
Convert to Torch Script via one of (a) or (b) - remember to wrap any unsupported methods in the
@torch.jit.ignore
decorator-
Tracing
# An example with *tracing* import torch import torchvision # An instance of your model. model = torchvision.models.resnet18() # An example input you would normally provide to your model's forward() method. example = torch.rand(1, 3, 224, 224) # Use torch.jit.trace to generate a torch.jit.ScriptModule via tracing. traced_script_module = torch.jit.trace(model, example) # The traced ScriptModule can now be evaluated identically to a regular PyTorch module: output = traced_script_module(torch.ones(1, 3, 224, 224)) output[0, :5] # Out[2]: tensor([-0.2698, -0.0381, 0.4023, -0.3010, -0.0448], grad_fn=<SliceBackward>)
-
Annotation
-
-
Serialise the script module:
traced_script_module.save("traced_resnet_model.pt")
-
Load into C++ in your production environment
-
Minimal C++ script / application. The following snippet will do, called example-app.cpp:
// ***example-app.cpp*** #include <torch/script.h> // One-stop header. #include <iostream> #include <memory> int main(int argc, const char* argv[]) { if (argc != 2) { std::cerr << "usage: example-app <path-to-exported-script-module>\n"; return -1; } torch::jit::script::Module module; try { // Deserialize the ScriptModule from a file using torch::jit::load(). module = torch::jit::load(argv[1]); } catch (const c10::Error& e) { std::cerr << "error loading the model\n"; return -1; } std::cout << "ok\n"; }
-
Depend on LibTorch and build the application, e.g. using CMake. CMakeLists.txt:
cmake_minimum_required(VERSION 3.0 FATAL_ERROR) project(custom_ops) find_package(Torch REQUIRED) add_executable(example-app example-app.cpp) target_link_libraries(example-app "${TORCH_LIBRARIES}") set_property(TARGET example-app PROPERTY CXX_STANDARD 14)
-
Depend on LibTorch
- The lib/ folder contains the shared libraries you must link against,
- The include/ folder contains header files your program will need to include
- The share/ folder contains the necessary CMake configuration to enable the simple find_package(Torch) command above
-
Directory structure for your application:
example-app/ CMakeLists.txt example-app.cpp
-
Build the application:
mkdir build cd build cmake -DCMAKE_PREFIX_PATH=/path/to/libtorch .. cmake --build . --config Release
-
-
Execute the Script Module in the C++ application
// Create a vector of inputs. std::vector<torch::jit::IValue> inputs; inputs.push_back(torch::ones({1, 3, 224, 224})); // Execute the model and turn its output into a tensor. at::Tensor output = module.forward(inputs).toTensor(); std::cout << output.slice(/*dim=*/1, /*start=*/0, /*end=*/5) << '\n';
- The first two lines set up the inputs to our model.
- We create a vector of
torch::jit::IValue
(a type-erased value typescript::Module
methods accept and return) and add a single input. - To create the input tensor, we use
torch::ones()
, the equivalent totorch.ones
in the C++ API.
- We create a vector of
- We then run the
script::Module
’sforward
method, passing it the input vector we created. - In return we get a new
IValue
, which we convert to a tensor by callingtoTensor()
. - In the last line, we print the first five entries of the output.
- Since we supplied the same input to our model in Python earlier in this tutorial, we should ideally see the same output. Let’s try it out by re-compiling our application and running it with the same serialized model:
root@4b5a67132e81:/example-app/build# make Scanning dependencies of target example-app [ 50%] Building CXX object CMakeFiles/example-app.dir/example-app.cpp.o [100%] Linking CXX executable example-app [100%] Built target example-app root@4b5a67132e81:/example-app/build# ./example-app traced_resnet_model.pt -0.2698 -0.0381 0.4023 -0.3010 -0.0448 [ Variable[CPUFloatType]{1,5} ]
- The first two lines set up the inputs to our model.
Loading a TorchScript Model in C++ — PyTorch Tutorials 2.0.1+cu117 documentation
Introduction to TorchScript — PyTorch Tutorials 2.0.1+cu117 documentation
TorchScript — PyTorch 2.0 documentation
Torch C++ API
To learn more about functions like
torch::ones
and the PyTorch C++ API in general, refer to its documentation at https://pytorch.org/cppdocs. The PyTorch C++ API provides near feature parity with the Python API, allowing you to further manipulate and process tensors just like in Python.
PyTorch C++ API — PyTorch main documentation
PyTorch’s Compiler
PyTorch Compile
TorchDynamo
TorchDynamo was validated using 7k+ PyTorch GitHub projects
We took a data-driven approach to validate its effectiveness on Graph Capture. We used 7,000+ Github projects written in PyTorch as our validation set. While TorchScript and others struggled to even acquire the graph 50% of the time, often with a big overhead, TorchDynamo acquired the graph 99% of the time, correctly, safely and with negligible overhead – without needing any changes to the original code.
TorchDynamo Update 8: TorchDynamo passed correctness check on 7k+ github models
TorchScript
What is TorchScript? Phind
TorchScript is an intermediate representation of a PyTorch model (a subclass of nn.Module
) that can be run in a high-performance environment such as C++. It is a statically typed subset of Python and is designed to be optimizable, serializable, and portable. TorchScript can be created from PyTorch code, making it possible to transition a model from a pure Python program to a TorchScript program. Once a model is converted to TorchScript, it can be saved from a Python process and loaded in a process where there is no Python dependency [Source 1, Source 0, Source 2].
TorchScript provides two main features:
-
Tracing: This is a mechanism in which the structure of the model is captured by evaluating it once using example inputs and recording the flow of those inputs through the model. This is suitable for models that make limited use of control flow [Source 5, Source 8].
-
Scripting: This method uses annotations in your model that inform the TorchScript compiler that it may directly parse and compile your model code, subject to the constraints imposed by the TorchScript language. Scripting is useful when the model makes use of control flow (like loops and if statements) [Source 0, Source 5, Source 8].
Here is an example of how to convert a PyTorch model to TorchScript using scripting:
import torch
class MyDecisionGate(torch.nn.Module):
def forward(self, x):
if x.sum() > 0:
return x
else:
return -x
class MyCell(torch.nn.Module):
def __init__(self, dg):
super(MyCell, self).__init__()
self.dg = dg
self.linear = torch.nn.Linear(4, 4)
def forward(self, x, h):
new_h = torch.tanh(self.dg(self.linear(x)) + h)
return new_h, new_h
my_cell = MyCell(MyDecisionGate())
x, h = torch.rand(3, 4), torch.rand(3, 4)
my_cell = torch.jit.script(my_cell)
print(my_cell.code)
[Source 0]
Once a model is converted to TorchScript, you can save it to a file and load it back:
# Save model
my_cell.save("my_model.pt")
# Load model
loaded_model = torch.jit.load("my_model.pt")
[Source 6]
In conclusion, TorchScript allows you to export PyTorch models for deployment in a high-performance, possibly non-Python, environment. It provides a way to serialize PyTorch models, as well as a way to optimize them for runtime efficiency.
RNNs
- Why do we need “flatten_parameters” when using RNN - original question is in the context of DataParallel
Implementation details of PyTorch LSTM
Recall the LSTM equations:
- In the above, consider the top three and fifth equations (1-3, 5) simply sums of the affine transformations of the current input and previous hidden representations respectively.
- Everything with a in front of it can be considered a weighting (since we’ll get values for the elements of the vector outputs of those equations)
- All the learnable parameters are the and weight matrices and bias vectors
- Only , the cell state, which composes it (along with the previous time’s cell state ) and are significant quantities in terms of representation [reading this simply]
- they are all in the range - or as elements of e.g. tend to
Main takeaway on implementation:
- PyTorch concatenates all the matrices used to do the input-to-hidden and hidden-to-hidden linear transformations, indexed where stands for any of in the LSTM equations, in order to optimize performance by passing a single matrix into the lower-level linear algebra code (presumably occurring inside the CUDA kernel)
- Inspecting the Parameter tensors that constitute the weights, you’ll see just
lstm.weight_ih_l0
andlstm.weight_hh_l0
- so 2 matrices despite the above equations having a total of 8 matrices in total- The matrices are of course decomposable along the row axis (first dimension) so we can take the result and consider each of the matrices as the respective rows of the overall and matrices PyTorch keeps in the LSTM implementation
# Example from the Tacotron implementation (encoder's LSTM) which uses
# - input dimension of 512; and
# - hidden state dim of 512 // 2 = 256
# weight matrices
encoder.lstm.weight_ih_l0 -> torch.Size([1024, 512]) # input -> hidden
encoder.lstm.weight_hh_l0 -> torch.Size([1024, 256]) # hidden -> hidden
# => can consider this pre-multiplication because second ("column") dimension has
# dimension of {input, hidden state} respectively
# biases
encoder.lstm.bias_ih_l0 -> torch.Size([1024])
encoder.lstm.bias_hh_l0 -> torch.Size([1024])
- Additions:
- If you pass
bidirectional=True
you’ll get extra Parameters set up with the suffix_reverse
- e.g.encoder.lstm.weight_ih_l0_reverse
orencoder.lstm.weight_hh_l0_reverse
for the weights orencoder.lstm.bias_ih_l0_reverse
for a bias vector - If you pass
num_layers = 4
or some other integer, you’ll get additional matrices set up to transform between representations- Question: Would you get at most one set of additional matrices set up - there needs to be at least one more matrix set up (unless the input dimension is equal to the hidden dimension) since the matrix we used before would now have it’s input come from the “previous” (lower) layer LSTM - as opposed to whatever time-varying features we’re feeding into the first LSTM layer (lowest LSTM)
- [cont.] …but could PyTorch just re-use that same matrix to perform the calculation across layers?
- → Presumably not since we require the output of the matrix multiplication from one layer in order to compute the multiplications in subsequent layers
- Is there a clever way of engineering this?
- What about at the CUDA level?
- Answer: No, PyTorch sets up one additional set of matrices for each layer of the LSTM you specify.
- This is easily demonstrable with the code snippet shown below (PyTorch version 2.0.1)
from pprint import pprint import torch pprint({_ for _ in dir(torch.nn.LSTM(8, 16, num_layers=4, bidirectional=True)) if 'weight' in _}) # you get out something like the follow - ignore the first few lines, considering only weight_* {'_all_weights', '_flat_weight_refs', '_flat_weights', '_flat_weights_names', '_init_flat_weights', '_weights_have_changed', 'all_weights', 'weight_hh_l0', 'weight_hh_l0_reverse', 'weight_hh_l1', 'weight_hh_l1_reverse', 'weight_hh_l2', 'weight_hh_l2_reverse', 'weight_hh_l3', 'weight_hh_l3_reverse', 'weight_ih_l0', 'weight_ih_l0_reverse', 'weight_ih_l1', 'weight_ih_l1_reverse', 'weight_ih_l2', 'weight_ih_l2_reverse', 'weight_ih_l3', 'weight_ih_l3_reverse'}
- If you pass
PyTorch Source Code
Convolution 1D
- Declaration in the C++ header file: pytorch/torch/csrc/api/include/torch/nn/modules/conv.h
- CPU Implementation: pytorch/torch/csrc/api/src/nn/modules/conv.cpp
- CUDA Implementation: pytorch/aten/src/ATen/native/Convolution.cpp
PyTorch Convolution 1D API:
torch.nn.functional.conv1d(input,
weight,
bias=None,
stride=1,
padding=0,
dilation=1,
groups=1
)
Legacy PyTorch
What is Tensor.data useful for in modern PyTorch code?
In modern PyTorch what use is a line like this:
text_lengths, mel_lengths = text_lengths.data, mel_lengths.data # TODO why?
In modern versions of PyTorch, this line of code:
text_lengths, mel_lengths = text_lengths.data, mel_lengths.data # TODO why?
assigns the .data
attribute of the text_lengths
and mel_lengths
tensors to the variables text_lengths
and mel_lengths
, respectively.
In previous versions of PyTorch (before version 0.4), accessing the .data
attribute of a tensor was a common way to detach the tensor from its computational graph. This operation returned a new tensor that shared the same underlying storage but had no history of how it was computed. This detachment was useful when you wanted to perform operations on the tensor without backpropagation or if you wanted to use the tensor in a context where autograd was not supported (e.g., some legacy operations or external libraries).
However, in modern PyTorch versions (0.4 and later), the preferred way to detach tensors and create new tensors without a computational graph is to use the detach()
method. The detach()
method returns a new tensor that shares the same storage but does not have a computational history. It is generally recommended to use detach()
instead of .data
for detaching tensors.
Therefore, if you are using a modern version of PyTorch, the line of code you provided might be outdated and can be rewritten as:
text_lengths = text_lengths.detach()
mel_lengths = mel_lengths.detach()
This ensures that text_lengths
and mel_lengths
are detached from the computational graph while following the recommended practice for tensor detachment.