python - AdamW and Adam with weight decay - Stack Overflow
Excerpt
Is there any difference between torch.optim.Adam(weight_decay=0.01) and torch.optim.AdamW(weight_decay=0.01)? Link to the docs: torch.optim.
In Pytorch, the implementations of Adam and AdamW are different. In the Adam source code, weight decay is implemented as
grad = grad.add(param, alpha=weight_decay)
whereas in the AdamW source code, it is implemented as
param.mul_(1 - lr * weight_decay)
So in each iteration, in Adam, the gradient is updated by the estimated parameters from the previous iteration weighted by the weight decay. On the other hand, in AdamW, the parameters are updated by the parameters from the previous iteration weighted by the weight decay. The pseudocode from the documentation clearly shows the difference (boxed for emphasis) where lambda is the weight decay.
However in Keras, even thought the default implementations are different because Adam has weight_decay=None
while AdamW has weight_decay=0.004
(in fact, it cannot be None), if weight_decay
is not None, Adam is the same as AdamW. Both are subclassed from optimizer.Optimizer
and in fact, their source codes are almost identical; in particular, the variables updated in each iteration are the same. The only difference is that the definition of Adamâs weight_decay
is deferred to the parent class while AdamWâs weight_decay
is defined in the AdamW class itself.