python - AdamW and Adam with weight decay - Stack Overflow

Excerpt

Is there any difference between torch.optim.Adam(weight_decay=0.01) and torch.optim.AdamW(weight_decay=0.01)? Link to the docs: torch.optim.


In Pytorch, the implementations of Adam and AdamW are different. In the Adam source code, weight decay is implemented as

grad = grad.add(param, alpha=weight_decay)

whereas in the AdamW source code, it is implemented as

param.mul_(1 - lr * weight_decay)

So in each iteration, in Adam, the gradient is updated by the estimated parameters from the previous iteration weighted by the weight decay. On the other hand, in AdamW, the parameters are updated by the parameters from the previous iteration weighted by the weight decay. The pseudocode from the documentation clearly shows the difference (boxed for emphasis) where lambda is the weight decay.

pseudocode


However in Keras, even thought the default implementations are different because Adam has weight_decay=None while AdamW has weight_decay=0.004 (in fact, it cannot be None), if weight_decay is not None, Adam is the same as AdamW. Both are subclassed from optimizer.Optimizer and in fact, their source codes are almost identical; in particular, the variables updated in each iteration are the same. The only difference is that the definition of Adam’s weight_decay is deferred to the parent class while AdamW’s weight_decay is defined in the AdamW class itself.