11 min readDec 6, 2024

If you think you need to spend $2,000 on a 180-day program to become a data scientist, then listen to me for a minute.

I understand that learning data science can be really challenging, especially when you’re just starting out, because you don’t know what you need to know.

But it doesn’t have to be this way.

That’s why I spent weeks creating the perfect roadmap to help you land your first data science job.

Here’s what it contains:

  1. A 42 weeks roadmap with study resources
  2. 30+ practice problems for each topic
  3. A discord community
  4. A resources hub that contains:
  • Free-to-read books
  • YouTube channels for data scientists
  • Free courses
  • Top GitHub repositories
  • Free APIs
  • List of data science communities to join
  • Project ideas
  • And much more…

If this sounds exciting, you can grab it right now by clicking here.

Now let’s get back to the blog:

1. Introduction

“Optimization is not just a process; it’s an art.”

If you’ve ever trained a neural network, you’ve likely encountered the famous optimizer.step() in PyTorch. It’s a line of code that seems simple — almost too simple.

But as with many things in deep learning, there’s a lot going on under the hood.

Understanding this function is critical if you’re aiming to debug training anomalies, tweak optimization strategies, or build custom optimizers for cutting-edge research.

Here’s the deal: most tutorials introduce optimizer.step() as part of the training loop and quickly move on. But if you’re here, you’re not looking for a surface-level explanation.

This guide dives deep into how optimizer.step() works, why it’s essential, and how you can leverage its mechanics for advanced workflows.

Whether you’re troubleshooting vanishing gradients, experimenting with new optimization techniques, or fine-tuning your model for production, mastering this one line of code can make all the difference.

This guide assumes you’ve worked with PyTorch before and are familiar with basic training loops. Together, we’ll go beyond the basics and uncover what happens when optimizer.step() gets called — and why that matters.

2. Optimizer Basics Recap (Advanced Context)

“To optimize something, you first need to know what you’re optimizing.”

Before we dissect optimizer.step(), let’s revisit the optimizer’s role in training — but from an advanced perspective.

The optimizer’s job is straightforward yet powerful: it updates your model’s parameters based on the gradients computed during backpropagation.

In practical terms, this means moving your model’s parameters closer to a configuration that minimizes the loss function.

Here’s the critical part: optimizer.step() is where this parameter update happens.

It’s the function that takes the gradients stored in param.grad and applies them to your model parameters according to the chosen optimization algorithm (e.g., SGD, Adam, RMSProp). While this might sound simple, understanding its mechanics is key for advanced debugging and customization.

Let’s place optimizer.step() in context with a minimal training loop:

Code Example: Minimal Training Loop

import torch
import torch.nn as nn
import torch.optim as optim

# Define a simple model and data
model = nn.Linear(10, 1)  # Single-layer linear model
criterion = nn.MSELoss()  # Mean Squared Error Loss
optimizer = optim.SGD(model.parameters(), lr=0.01)  # Stochastic Gradient Descent

# Generate dummy data
inputs = torch.randn(32, 10)  # 32 samples, 10 features
targets = torch.randn(32, 1)  # 32 target values

# Forward pass
outputs = model(inputs)
loss = criterion(outputs, targets)  # Compute the loss

# Backward pass
loss.backward()  # Compute gradients for all parameters

# Update parameters
optimizer.step()  # Apply the gradients to update model weights
optimizer.zero_grad()  # Clear gradients for the next iteration

Here’s the thing: while this code gets the job done, it hides all the complexity. What happens in optimizer.step()?

How does it apply updates to the parameters? And why do we clear gradients afterward? These are the questions we’ll answer as we progress.

3. What Does optimizer.step() Do?

“So, what’s really happening when you call optimizer.step()?”

At a high level, this function is where your model’s parameters are adjusted. It takes the gradients computed during backpropagation and applies them to the parameters according to the optimizer’s update rules.

For example, in the case of SGD, this involves subtracting a scaled version of the gradient from the parameter value.

Under the Hood: Breaking It Down

Let’s use Stochastic Gradient Descent (SGD) as an example to understand the steps involved:

  1. Retrieve Parameters: The optimizer loops through all parameters registered with it.
  2. Apply Updates: Each parameter is updated using its gradient and the learning rate. Additional terms like momentum or weight decay might also factor into this step, depending on the optimizer.
  3. Handle Constraints: Optimizers like Adam or RMSProp maintain internal states (e.g., moving averages) that are updated here.

Let’s translate this into code to see what happens manually:

Code Example: Manual SGD Update

# Loop through each parameter in the model
for param in model.parameters():
    if param.grad is not None:  # Ensure the parameter has a gradient
        # Update parameter using gradient and learning rate
        param.data -= 0.01 * param.grad  # Gradient step: param = param - lr * grad

Notice how straightforward this looks for vanilla SGD? However, when you use advanced optimizers like Adam, additional calculations such as momentum, adaptive learning rates, and bias correction come into play.

These complexities make optimizer.step() both powerful and opaque, which is why we’ll explore them in later sections.

Key Takeaway:

Understanding how parameters are updated, even for something as “basic” as SGD, is foundational for debugging training issues and creating custom optimizers. Stay tuned as we go deeper into optimizer internals and how they manage these updates.

By the end of these sections, your foundation in optimizer mechanics will be rock solid. Ready to dig into more advanced details? Let’s keep going!

4. Key Components Affected by optimizer.step()

“Every tiny detail matters when optimizing a neural network — gradients, parameters, and even hidden states.”

You might think optimizer.step() just updates your model’s parameters and moves on, but here’s the deal: this single function is intricately tied to three key components that shape your training process. Let’s break them down.

1. Gradients (param.grad)

Gradients are the lifeblood of optimization. When you call .backward(), PyTorch computes the gradients for each parameter and stores them in param.grad. These gradients represent the direction and magnitude of the change needed to minimize the loss.

When optimizer.step() runs, it uses these gradients to update the model’s parameters. Without them, the optimizer has no information on how to adjust the weights.

Example: Inspecting Gradients Before and After optimizer.step()

# Inspect gradients before step
for name, param in model.named_parameters():
    if param.grad is not None:
        print(f"Before step -> Parameter: {name}, Gradient norm: {param.grad.norm()}")

optimizer.step()  # Update parameters

# Inspect gradients after step (shouldn't change unless you call .backward() again)
for name, param in model.named_parameters():
    if param.grad is not None:
        print(f"After step -> Parameter: {name}, Gradient norm: {param.grad.norm()}")

Takeaway: Gradients flow in during backpropagation, and optimizer.step() uses them up. Forgetting to zero them (optimizer.zero_grad()) after the step can cause gradient accumulation — a common pitfall.

2. Parameter Groups

Here’s something you might not know: optimizers can manage multiple groups of parameters, each with its own hyperparameters. This feature is invaluable when fine-tuning models or applying layer-specific learning rates.

Code Example: Parameter Groups in Action

optimizer = torch.optim.SGD([
    {'params': model.layer1.parameters(), 'lr': 0.01},  # Layer 1 with custom learning rate
    {'params': model.layer2.parameters(), 'lr': 0.001}  # Layer 2 with a smaller learning rate
])

print(optimizer.state_dict())  # Shows groups and their settings

Why it matters: Parameter grouping allows you to tune specific layers differently, a technique often used in transfer learning. optimizer.step() ensures the correct hyperparameters are applied to each group.

3. State Dictionary (state_dict)

This might surprise you: PyTorch optimizers maintain internal states for advanced optimization techniques. These states include moving averages for gradients (e.g., Adam), momentum buffers (e.g., SGD with momentum), and more.

Code Example: Inspecting state_dict

# Check optimizer's internal state
print("Optimizer State Dict:")
print(optimizer.state_dict())

You might see entries like:

  • exp_avg: Exponential moving average of gradients.
  • momentum_buffer: Buffers used for momentum updates.

These states are critical for reproducibility and resuming training from checkpoints. If you ever save and reload your optimizer, the state_dict is what gets stored and retrieved.

“Not all optimizers are created equal.”

While the general goal of all optimizers is to update parameters, their internal mechanisms vary significantly. Let’s peek under the hood of three popular optimizers: SGD with momentum, Adam, and RMSProp.

1. SGD with Momentum

Momentum adds a fraction of the previous update to the current update, enabling smoother and faster convergence.

Pseudocode Implementation

for param in model.parameters():
    if param.grad is not None:
        # Initialize momentum buffer if it doesn't exist
        if 'momentum_buffer' not in optimizer.state[param]:
            optimizer.state[param]['momentum_buffer'] = torch.zeros_like(param.grad)
        
        # Update momentum buffer
        momentum_buffer = optimizer.state[param]['momentum_buffer']
        momentum_buffer.mul_(momentum).add_(param.grad)
        
        # Apply update
        param.data -= lr * momentum_buffer

Momentum acts like rolling a ball down a hill — it builds speed and avoids getting stuck in small valleys.

2. Adam

Adam (Adaptive Moment Estimation) combines momentum and adaptive learning rates, making it one of the most widely used optimizers.

Pseudocode Implementation

for param in model.parameters():
    if param.grad is not None:
        state = optimizer.state[param]

        # Initialize state
        if 'step' not in state:
            state['step'] = 0
            state['exp_avg'] = torch.zeros_like(param.grad)  # Exponential moving average
            state['exp_avg_sq'] = torch.zeros_like(param.grad)  # Exponential moving average of squared gradients

        # Update state
        state['step'] += 1
        state['exp_avg'].mul_(beta1).add_(1 - beta1, param.grad)  # exp_avg = beta1 * exp_avg + (1 - beta1) * grad
        state['exp_avg_sq'].mul_(beta2).add_(1 - beta2, param.grad ** 2)  # exp_avg_sq = ...

        # Bias correction
        bias_correction1 = 1 - beta1 ** state['step']
        bias_correction2 = 1 - beta2 ** state['step']
        exp_avg_corr = state['exp_avg'] / bias_correction1
        exp_avg_sq_corr = state['exp_avg_sq'] / bias_correction2

        # Parameter update
        param.data -= lr * exp_avg_corr / (exp_avg_sq_corr.sqrt() + epsilon)

Why Adam matters: It adapts the learning rate for each parameter, making it robust to noisy gradients.

3. RMSProp

RMSProp scales the learning rate by the moving average of squared gradients, making it ideal for non-convex functions.

Pseudocode Implementation

for param in model.parameters():
    if param.grad is not None:
        state = optimizer.state[param]

        # Initialize state
        if 'square_avg' not in state:
            state['square_avg'] = torch.zeros_like(param.grad)

        # Update moving average of squared gradients
        state['square_avg'].mul_(alpha).addcmul_(1 - alpha, param.grad, param.grad)

        # Update parameter
        param.data -= lr * param.grad / (state['square_avg'].sqrt() + epsilon)

6. Common Pitfalls and Debugging Techniques

“A small oversight can derail your training.”

Even experienced data scientists encounter issues with optimizers. Let’s explore a few common pitfalls and how to debug them.

1. Forgetting to Clear Gradients

If you don’t call optimizer.zero_grad(), gradients from the previous step accumulate, leading to incorrect updates.

How to Debug:

# Check gradients before clearing
for name, param in model.named_parameters():
    if param.grad is not None:
        print(f"Before zero_grad -> Parameter: {name}, Gradient norm: {param.grad.norm()}")

optimizer.zero_grad()  # Clear gradients

# Verify they are cleared
for name, param in model.named_parameters():
    if param.grad is not None:
        print(f"After zero_grad -> Parameter: {name}, Gradient norm: {param.grad.norm()}")

2. Unintended Gradient Scaling

Sometimes, gradients can be excessively large or small, causing instability. Use gradient clipping or normalization to mitigate this.

Example: Gradient Clipping

torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

3. Tracking Parameter Updates

If your model isn’t converging, check whether parameters are being updated correctly.

Debugging Parameter Updates

for name, param in model.named_parameters():
    print(f"Parameter: {name}, Value: {param.data.norm()}, Gradient: {param.grad.norm()}")

Mastering these aspects will save you countless hours of debugging and elevate your understanding of PyTorch optimization workflows. Let’s dive deeper into the next sections!

7. Extending optimizer.step(): Custom Optimizers

“Sometimes, the best tools are the ones you build yourself.”

While PyTorch provides a solid arsenal of built-in optimizers, there are times when you’ll need to go beyond what’s available.

Whether it’s introducing custom gradient clipping, experimenting with new learning rate schedules, or implementing a research paper’s novel optimizer, extending the optimizer.step() function is your way to create something tailored to your needs.

Here’s the deal: creating a custom optimizer is surprisingly simple in PyTorch. The Optimizer class provides the scaffolding — you just need to fill in the logic for how parameters should be updated. Let me show you step-by-step.

Custom Optimizer: A Minimal Example

Let’s start with a basic custom optimizer. Suppose you want to apply a simple gradient descent update, but with the added twist of gradient clipping.

Code Example: Custom Optimizer with Gradient Clipping

from torch.optim.optimizer import Optimizer

class CustomOptimizer(Optimizer):
    def __init__(self, params, lr=0.01, clip_value=1.0):
        defaults = {'lr': lr, 'clip_value': clip_value}
        super().__init__(params, defaults)

    def step(self, closure=None):
        # Iterate through each parameter group
        for group in self.param_groups:
            clip_value = group['clip_value']  # Retrieve gradient clipping value
            for param in group['params']:
                if param.grad is not None:
                    # Apply gradient clipping
                    param.grad.data.clamp_(-clip_value, clip_value)
                    # Standard gradient descent update
                    param.data -= group['lr'] * param.grad

How it works:

  1. Initialization (__init__): You define the optimizer’s hyperparameters, like learning rate (lr) and clipping value (clip_value).
  2. Parameter Update (step): For each parameter, gradients are clipped to a specific range before the parameter is updated.

Usage Example:

# Define model, loss, and optimizer
model = torch.nn.Linear(10, 1)
optimizer = CustomOptimizer(model.parameters(), lr=0.01, clip_value=0.5)

# Dummy data
inputs = torch.randn(32, 10)
targets = torch.randn(32, 1)

# Training loop
outputs = model(inputs)
loss = torch.nn.MSELoss()(outputs, targets)
loss.backward()  # Compute gradients
optimizer.step()  # Custom step applies updates with gradient clipping
optimizer.zero_grad()  # Clear gradients

This might surprise you: with just a few lines of code, you now have a working custom optimizer that clips gradients during the update step.

Going Deeper: Adaptive Learning Rates

Let’s make things more interesting. Imagine you want an optimizer that adjusts its learning rate dynamically based on the magnitude of the gradient.

Code Example: Custom Optimizer with Adaptive Learning Rates

class AdaptiveLearningRateOptimizer(Optimizer):
    def __init__(self, params, base_lr=0.01):
        defaults = {'base_lr': base_lr}
        super().__init__(params, defaults)

    def step(self, closure=None):
        for group in self.param_groups:
            base_lr = group['base_lr']
            for param in group['params']:
                if param.grad is not None:
                    # Scale learning rate by gradient magnitude
                    grad_norm = param.grad.norm() + 1e-8  # Avoid division by zero
                    adaptive_lr = base_lr / grad_norm
                    # Update parameter
                    param.data -= adaptive_lr * param.grad

Key Idea: Gradients with smaller magnitudes receive larger updates, ensuring the optimizer adjusts adaptively to the parameter’s sensitivity.

8. Real-World Use Cases and Best Practices

“Knowing when to use a tool is just as important as knowing how to build it.”

In practice, understanding optimizer.step() extends far beyond custom implementations. Let’s explore scenarios where this knowledge truly shines.

1. Multiple Optimizers for Different Parameter Groups

In some models, especially multi-part architectures like GANs or multi-task learning setups, you’ll often use different optimizers for different parts of the model.

Example: Using Multiple Optimizers

# Define model with multiple layers
class CustomModel(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.layer1 = torch.nn.Linear(10, 20)
        self.layer2 = torch.nn.Linear(20, 1)

model = CustomModel()

# Separate optimizers for each layer
optimizer1 = torch.optim.Adam(model.layer1.parameters(), lr=0.001)
optimizer2 = torch.optim.SGD(model.layer2.parameters(), lr=0.01)

# Training loop
outputs = model(inputs)
loss = torch.nn.MSELoss()(outputs, targets)
loss.backward()

# Update each optimizer independently
optimizer1.step()  # Update layer1 parameters
optimizer2.step()  # Update layer2 parameters

When to use this:

  • Fine-tuning pre-trained models where certain layers require slower updates.
  • Complex models like GANs, where generator and discriminator require separate optimizers.

2. Lookahead Optimizers

Lookahead optimizers maintain a “fast” optimizer that makes frequent updates and a “slow” optimizer that periodically updates the parameters based on the fast optimizer’s progress.

Example Implementation: Lookahead Wrapper

class LookaheadOptimizer(Optimizer):
    def __init__(self, optimizer, k=5, alpha=0.5):
        self.optimizer = optimizer
        self.k = k
        self.alpha = alpha
        self.fast_params = [p.clone().detach() for group in optimizer.param_groups for p in group['params']]
        self.step_count = 0

    def step(self):
        self.optimizer.step()  # Perform fast optimizer step
        self.step_count += 1

        if self.step_count % self.k == 0:
            for fast, group in zip(self.fast_params, self.optimizer.param_groups):
                for fast_p, p in zip(fast, group['params']):
                    # Slow update
                    p.data = fast_p + self.alpha * (p.data - fast_p)
                    fast_p.copy_(p.data)

Why it’s useful: Lookahead helps smooth out noisy updates from fast optimizers like Adam or SGD.

3. Gradient Accumulation

For memory-constrained environments, you might accumulate gradients over multiple batches before applying optimizer.step().

Code Example: Gradient Accumulation

accumulation_steps = 4
for i, (inputs, targets) in enumerate(dataloader):
    outputs = model(inputs)
    loss = criterion(outputs, targets)
    loss = loss / accumulation_steps  # Scale loss
    loss.backward()  # Accumulate gradients

    if (i + 1) % accumulation_steps == 0:
        optimizer.step()  # Apply gradients
        optimizer.zero_grad()  # Clear accumulated gradients

9. Summary

We’ve covered a lot, but here’s what you should take away:

  • Extending optimizer.step() unlocks endless possibilities for custom optimization strategies.
  • Use cases like gradient clipping, adaptive learning rates, and advanced techniques like Lookahead optimizers showcase how impactful this function is.
  • Debugging and customizing training workflows often revolve around a solid understanding of what optimizer.step() does and how to control it.

Next Steps: If you’re ready to dive deeper, explore PyTorch’s source code for optimizers like Adam or RMSProp — it’s a treasure trove of insights.

And remember, whether you’re building from scratch or tweaking existing code, understanding the internals of optimizer.step() can take your training workflows to the next level.