11 min readDec 6, 2024
If you think you need to spend $2,000 on a 180-day program to become a data scientist, then listen to me for a minute.
I understand that learning data science can be really challenging, especially when you’re just starting out, because you don’t know what you need to know.
But it doesn’t have to be this way.
That’s why I spent weeks creating the perfect roadmap to help you land your first data science job.
Here’s what it contains:
- A 42 weeks roadmap with study resources
- 30+ practice problems for each topic
- A discord community
- A resources hub that contains:
- Free-to-read books
- YouTube channels for data scientists
- Free courses
- Top GitHub repositories
- Free APIs
- List of data science communities to join
- Project ideas
- And much more…
If this sounds exciting, you can grab it right now by clicking here.
Now let’s get back to the blog:
1. Introduction
“Optimization is not just a process; it’s an art.”
If you’ve ever trained a neural network, you’ve likely encountered the famous optimizer.step() in PyTorch. It’s a line of code that seems simple — almost too simple.
But as with many things in deep learning, there’s a lot going on under the hood.
Understanding this function is critical if you’re aiming to debug training anomalies, tweak optimization strategies, or build custom optimizers for cutting-edge research.
Here’s the deal: most tutorials introduce optimizer.step() as part of the training loop and quickly move on. But if you’re here, you’re not looking for a surface-level explanation.
This guide dives deep into how optimizer.step() works, why it’s essential, and how you can leverage its mechanics for advanced workflows.
Whether you’re troubleshooting vanishing gradients, experimenting with new optimization techniques, or fine-tuning your model for production, mastering this one line of code can make all the difference.
This guide assumes you’ve worked with PyTorch before and are familiar with basic training loops. Together, we’ll go beyond the basics and uncover what happens when optimizer.step() gets called — and why that matters.
2. Optimizer Basics Recap (Advanced Context)
“To optimize something, you first need to know what you’re optimizing.”
Before we dissect optimizer.step(), let’s revisit the optimizer’s role in training — but from an advanced perspective.
The optimizer’s job is straightforward yet powerful: it updates your model’s parameters based on the gradients computed during backpropagation.
In practical terms, this means moving your model’s parameters closer to a configuration that minimizes the loss function.
Here’s the critical part: optimizer.step() is where this parameter update happens.
It’s the function that takes the gradients stored in param.grad and applies them to your model parameters according to the chosen optimization algorithm (e.g., SGD, Adam, RMSProp). While this might sound simple, understanding its mechanics is key for advanced debugging and customization.
Let’s place optimizer.step() in context with a minimal training loop:
Code Example: Minimal Training Loop
import torch
import torch.nn as nn
import torch.optim as optim
# Define a simple model and data
model = nn.Linear(10, 1) # Single-layer linear model
criterion = nn.MSELoss() # Mean Squared Error Loss
optimizer = optim.SGD(model.parameters(), lr=0.01) # Stochastic Gradient Descent
# Generate dummy data
inputs = torch.randn(32, 10) # 32 samples, 10 features
targets = torch.randn(32, 1) # 32 target values
# Forward pass
outputs = model(inputs)
loss = criterion(outputs, targets) # Compute the loss
# Backward pass
loss.backward() # Compute gradients for all parameters
# Update parameters
optimizer.step() # Apply the gradients to update model weights
optimizer.zero_grad() # Clear gradients for the next iteration
Here’s the thing: while this code gets the job done, it hides all the complexity. What happens in optimizer.step()?
How does it apply updates to the parameters? And why do we clear gradients afterward? These are the questions we’ll answer as we progress.
3. What Does optimizer.step() Do?
“So, what’s really happening when you call optimizer.step()?”
At a high level, this function is where your model’s parameters are adjusted. It takes the gradients computed during backpropagation and applies them to the parameters according to the optimizer’s update rules.
For example, in the case of SGD, this involves subtracting a scaled version of the gradient from the parameter value.
Under the Hood: Breaking It Down
Let’s use Stochastic Gradient Descent (SGD) as an example to understand the steps involved:
- Retrieve Parameters: The optimizer loops through all parameters registered with it.
- Apply Updates: Each parameter is updated using its gradient and the learning rate. Additional terms like momentum or weight decay might also factor into this step, depending on the optimizer.
- Handle Constraints: Optimizers like Adam or RMSProp maintain internal states (e.g., moving averages) that are updated here.
Let’s translate this into code to see what happens manually:
Code Example: Manual SGD Update
# Loop through each parameter in the model
for param in model.parameters():
if param.grad is not None: # Ensure the parameter has a gradient
# Update parameter using gradient and learning rate
param.data -= 0.01 * param.grad # Gradient step: param = param - lr * grad
Notice how straightforward this looks for vanilla SGD? However, when you use advanced optimizers like Adam, additional calculations such as momentum, adaptive learning rates, and bias correction come into play.
These complexities make optimizer.step() both powerful and opaque, which is why we’ll explore them in later sections.
Key Takeaway:
Understanding how parameters are updated, even for something as “basic” as SGD, is foundational for debugging training issues and creating custom optimizers. Stay tuned as we go deeper into optimizer internals and how they manage these updates.
By the end of these sections, your foundation in optimizer mechanics will be rock solid. Ready to dig into more advanced details? Let’s keep going!
4. Key Components Affected by optimizer.step()
“Every tiny detail matters when optimizing a neural network — gradients, parameters, and even hidden states.”
You might think optimizer.step() just updates your model’s parameters and moves on, but here’s the deal: this single function is intricately tied to three key components that shape your training process. Let’s break them down.
1. Gradients (param.grad)
Gradients are the lifeblood of optimization. When you call .backward(), PyTorch computes the gradients for each parameter and stores them in param.grad. These gradients represent the direction and magnitude of the change needed to minimize the loss.
When optimizer.step() runs, it uses these gradients to update the model’s parameters. Without them, the optimizer has no information on how to adjust the weights.
Example: Inspecting Gradients Before and After optimizer.step()
# Inspect gradients before step
for name, param in model.named_parameters():
if param.grad is not None:
print(f"Before step -> Parameter: {name}, Gradient norm: {param.grad.norm()}")
optimizer.step() # Update parameters
# Inspect gradients after step (shouldn't change unless you call .backward() again)
for name, param in model.named_parameters():
if param.grad is not None:
print(f"After step -> Parameter: {name}, Gradient norm: {param.grad.norm()}")
Takeaway: Gradients flow in during backpropagation, and optimizer.step() uses them up. Forgetting to zero them (optimizer.zero_grad()) after the step can cause gradient accumulation — a common pitfall.
2. Parameter Groups
Here’s something you might not know: optimizers can manage multiple groups of parameters, each with its own hyperparameters. This feature is invaluable when fine-tuning models or applying layer-specific learning rates.
Code Example: Parameter Groups in Action
optimizer = torch.optim.SGD([
{'params': model.layer1.parameters(), 'lr': 0.01}, # Layer 1 with custom learning rate
{'params': model.layer2.parameters(), 'lr': 0.001} # Layer 2 with a smaller learning rate
])
print(optimizer.state_dict()) # Shows groups and their settings
Why it matters: Parameter grouping allows you to tune specific layers differently, a technique often used in transfer learning. optimizer.step() ensures the correct hyperparameters are applied to each group.
3. State Dictionary (state_dict)
This might surprise you: PyTorch optimizers maintain internal states for advanced optimization techniques. These states include moving averages for gradients (e.g., Adam), momentum buffers (e.g., SGD with momentum), and more.
Code Example: Inspecting state_dict
# Check optimizer's internal state
print("Optimizer State Dict:")
print(optimizer.state_dict())
You might see entries like:
- exp_avg: Exponential moving average of gradients.
- momentum_buffer: Buffers used for momentum updates.
These states are critical for reproducibility and resuming training from checkpoints. If you ever save and reload your optimizer, the state_dict is what gets stored and retrieved.
5. Diving Deeper: Implementation for Popular Optimizers
“Not all optimizers are created equal.”
While the general goal of all optimizers is to update parameters, their internal mechanisms vary significantly. Let’s peek under the hood of three popular optimizers: SGD with momentum, Adam, and RMSProp.
1. SGD with Momentum
Momentum adds a fraction of the previous update to the current update, enabling smoother and faster convergence.
Pseudocode Implementation
for param in model.parameters():
if param.grad is not None:
# Initialize momentum buffer if it doesn't exist
if 'momentum_buffer' not in optimizer.state[param]:
optimizer.state[param]['momentum_buffer'] = torch.zeros_like(param.grad)
# Update momentum buffer
momentum_buffer = optimizer.state[param]['momentum_buffer']
momentum_buffer.mul_(momentum).add_(param.grad)
# Apply update
param.data -= lr * momentum_buffer
Momentum acts like rolling a ball down a hill — it builds speed and avoids getting stuck in small valleys.
2. Adam
Adam (Adaptive Moment Estimation) combines momentum and adaptive learning rates, making it one of the most widely used optimizers.
Pseudocode Implementation
for param in model.parameters():
if param.grad is not None:
state = optimizer.state[param]
# Initialize state
if 'step' not in state:
state['step'] = 0
state['exp_avg'] = torch.zeros_like(param.grad) # Exponential moving average
state['exp_avg_sq'] = torch.zeros_like(param.grad) # Exponential moving average of squared gradients
# Update state
state['step'] += 1
state['exp_avg'].mul_(beta1).add_(1 - beta1, param.grad) # exp_avg = beta1 * exp_avg + (1 - beta1) * grad
state['exp_avg_sq'].mul_(beta2).add_(1 - beta2, param.grad ** 2) # exp_avg_sq = ...
# Bias correction
bias_correction1 = 1 - beta1 ** state['step']
bias_correction2 = 1 - beta2 ** state['step']
exp_avg_corr = state['exp_avg'] / bias_correction1
exp_avg_sq_corr = state['exp_avg_sq'] / bias_correction2
# Parameter update
param.data -= lr * exp_avg_corr / (exp_avg_sq_corr.sqrt() + epsilon)
Why Adam matters: It adapts the learning rate for each parameter, making it robust to noisy gradients.
3. RMSProp
RMSProp scales the learning rate by the moving average of squared gradients, making it ideal for non-convex functions.
Pseudocode Implementation
for param in model.parameters():
if param.grad is not None:
state = optimizer.state[param]
# Initialize state
if 'square_avg' not in state:
state['square_avg'] = torch.zeros_like(param.grad)
# Update moving average of squared gradients
state['square_avg'].mul_(alpha).addcmul_(1 - alpha, param.grad, param.grad)
# Update parameter
param.data -= lr * param.grad / (state['square_avg'].sqrt() + epsilon)
6. Common Pitfalls and Debugging Techniques
“A small oversight can derail your training.”
Even experienced data scientists encounter issues with optimizers. Let’s explore a few common pitfalls and how to debug them.
1. Forgetting to Clear Gradients
If you don’t call optimizer.zero_grad(), gradients from the previous step accumulate, leading to incorrect updates.
How to Debug:
# Check gradients before clearing
for name, param in model.named_parameters():
if param.grad is not None:
print(f"Before zero_grad -> Parameter: {name}, Gradient norm: {param.grad.norm()}")
optimizer.zero_grad() # Clear gradients
# Verify they are cleared
for name, param in model.named_parameters():
if param.grad is not None:
print(f"After zero_grad -> Parameter: {name}, Gradient norm: {param.grad.norm()}")
2. Unintended Gradient Scaling
Sometimes, gradients can be excessively large or small, causing instability. Use gradient clipping or normalization to mitigate this.
Example: Gradient Clipping
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
3. Tracking Parameter Updates
If your model isn’t converging, check whether parameters are being updated correctly.
Debugging Parameter Updates
for name, param in model.named_parameters():
print(f"Parameter: {name}, Value: {param.data.norm()}, Gradient: {param.grad.norm()}")
Mastering these aspects will save you countless hours of debugging and elevate your understanding of PyTorch optimization workflows. Let’s dive deeper into the next sections!
7. Extending optimizer.step(): Custom Optimizers
“Sometimes, the best tools are the ones you build yourself.”
While PyTorch provides a solid arsenal of built-in optimizers, there are times when you’ll need to go beyond what’s available.
Whether it’s introducing custom gradient clipping, experimenting with new learning rate schedules, or implementing a research paper’s novel optimizer, extending the optimizer.step() function is your way to create something tailored to your needs.
Here’s the deal: creating a custom optimizer is surprisingly simple in PyTorch. The Optimizer class provides the scaffolding — you just need to fill in the logic for how parameters should be updated. Let me show you step-by-step.
Custom Optimizer: A Minimal Example
Let’s start with a basic custom optimizer. Suppose you want to apply a simple gradient descent update, but with the added twist of gradient clipping.
Code Example: Custom Optimizer with Gradient Clipping
from torch.optim.optimizer import Optimizer
class CustomOptimizer(Optimizer):
def __init__(self, params, lr=0.01, clip_value=1.0):
defaults = {'lr': lr, 'clip_value': clip_value}
super().__init__(params, defaults)
def step(self, closure=None):
# Iterate through each parameter group
for group in self.param_groups:
clip_value = group['clip_value'] # Retrieve gradient clipping value
for param in group['params']:
if param.grad is not None:
# Apply gradient clipping
param.grad.data.clamp_(-clip_value, clip_value)
# Standard gradient descent update
param.data -= group['lr'] * param.grad
How it works:
- Initialization (__init__): You define the optimizer’s hyperparameters, like learning rate (lr) and clipping value (clip_value).
- Parameter Update (step): For each parameter, gradients are clipped to a specific range before the parameter is updated.
Usage Example:
# Define model, loss, and optimizer
model = torch.nn.Linear(10, 1)
optimizer = CustomOptimizer(model.parameters(), lr=0.01, clip_value=0.5)
# Dummy data
inputs = torch.randn(32, 10)
targets = torch.randn(32, 1)
# Training loop
outputs = model(inputs)
loss = torch.nn.MSELoss()(outputs, targets)
loss.backward() # Compute gradients
optimizer.step() # Custom step applies updates with gradient clipping
optimizer.zero_grad() # Clear gradients
This might surprise you: with just a few lines of code, you now have a working custom optimizer that clips gradients during the update step.
Going Deeper: Adaptive Learning Rates
Let’s make things more interesting. Imagine you want an optimizer that adjusts its learning rate dynamically based on the magnitude of the gradient.
Code Example: Custom Optimizer with Adaptive Learning Rates
class AdaptiveLearningRateOptimizer(Optimizer):
def __init__(self, params, base_lr=0.01):
defaults = {'base_lr': base_lr}
super().__init__(params, defaults)
def step(self, closure=None):
for group in self.param_groups:
base_lr = group['base_lr']
for param in group['params']:
if param.grad is not None:
# Scale learning rate by gradient magnitude
grad_norm = param.grad.norm() + 1e-8 # Avoid division by zero
adaptive_lr = base_lr / grad_norm
# Update parameter
param.data -= adaptive_lr * param.grad
Key Idea: Gradients with smaller magnitudes receive larger updates, ensuring the optimizer adjusts adaptively to the parameter’s sensitivity.
8. Real-World Use Cases and Best Practices
“Knowing when to use a tool is just as important as knowing how to build it.”
In practice, understanding optimizer.step() extends far beyond custom implementations. Let’s explore scenarios where this knowledge truly shines.
1. Multiple Optimizers for Different Parameter Groups
In some models, especially multi-part architectures like GANs or multi-task learning setups, you’ll often use different optimizers for different parts of the model.
Example: Using Multiple Optimizers
# Define model with multiple layers
class CustomModel(torch.nn.Module):
def __init__(self):
super().__init__()
self.layer1 = torch.nn.Linear(10, 20)
self.layer2 = torch.nn.Linear(20, 1)
model = CustomModel()
# Separate optimizers for each layer
optimizer1 = torch.optim.Adam(model.layer1.parameters(), lr=0.001)
optimizer2 = torch.optim.SGD(model.layer2.parameters(), lr=0.01)
# Training loop
outputs = model(inputs)
loss = torch.nn.MSELoss()(outputs, targets)
loss.backward()
# Update each optimizer independently
optimizer1.step() # Update layer1 parameters
optimizer2.step() # Update layer2 parameters
When to use this:
- Fine-tuning pre-trained models where certain layers require slower updates.
- Complex models like GANs, where generator and discriminator require separate optimizers.
2. Lookahead Optimizers
Lookahead optimizers maintain a “fast” optimizer that makes frequent updates and a “slow” optimizer that periodically updates the parameters based on the fast optimizer’s progress.
Example Implementation: Lookahead Wrapper
class LookaheadOptimizer(Optimizer):
def __init__(self, optimizer, k=5, alpha=0.5):
self.optimizer = optimizer
self.k = k
self.alpha = alpha
self.fast_params = [p.clone().detach() for group in optimizer.param_groups for p in group['params']]
self.step_count = 0
def step(self):
self.optimizer.step() # Perform fast optimizer step
self.step_count += 1
if self.step_count % self.k == 0:
for fast, group in zip(self.fast_params, self.optimizer.param_groups):
for fast_p, p in zip(fast, group['params']):
# Slow update
p.data = fast_p + self.alpha * (p.data - fast_p)
fast_p.copy_(p.data)
Why it’s useful: Lookahead helps smooth out noisy updates from fast optimizers like Adam or SGD.
3. Gradient Accumulation
For memory-constrained environments, you might accumulate gradients over multiple batches before applying optimizer.step().
Code Example: Gradient Accumulation
accumulation_steps = 4
for i, (inputs, targets) in enumerate(dataloader):
outputs = model(inputs)
loss = criterion(outputs, targets)
loss = loss / accumulation_steps # Scale loss
loss.backward() # Accumulate gradients
if (i + 1) % accumulation_steps == 0:
optimizer.step() # Apply gradients
optimizer.zero_grad() # Clear accumulated gradients
9. Summary
We’ve covered a lot, but here’s what you should take away:
- Extending optimizer.step() unlocks endless possibilities for custom optimization strategies.
- Use cases like gradient clipping, adaptive learning rates, and advanced techniques like Lookahead optimizers showcase how impactful this function is.
- Debugging and customizing training workflows often revolve around a solid understanding of what optimizer.step() does and how to control it.
Next Steps: If you’re ready to dive deeper, explore PyTorch’s source code for optimizers like Adam or RMSProp — it’s a treasure trove of insights.
And remember, whether you’re building from scratch or tweaking existing code, understanding the internals of optimizer.step() can take your training workflows to the next level.