- Accurate, Large Minibatch SGD Training ImageNet in 1 Hour
- The Warmup Trick for Training Deep Neural Networks
If your data set is highly differentiated, you can suffer from a sort of âearly over-fittingâ. If your shuffled data happens to include a cluster of related, strongly-featured observations, your modelâs initial training can skew badly toward those features â or worse, toward incidental features that arenât truly related to the topic at all.
Warm-up is a way to reduce the primacy effect of the early training examples. Without it, you may need to run a few extra epochs to get the convergence desired, as the model un-trains those early superstitions.
Many models afford this as a command-line option. The learning rate is increased linearly over the warm-up period. If the target learning rate isÂ
p
 and the warm-up period isÂn
, then the first batch iteration usesÂ1*p/n
 for its learning rate; the second usesÂ2*p/n
, and so on: iterationÂi
 usesÂi*p/n
, until we hit the nominal rate at iterationÂn
.This means that the first iteration gets only 1/n of the primacy effect. This does a reasonable job of balancing that influence.
Note that the ramp-up is commonly on the order of one epoch â but is occasionally longer for particularly skewed data, or shorter for more homogeneous distributions. You may want to adjust, depending on how functionally extreme your batches can become when the shuffling algorithm is applied to the training set.
â Source: # What does âlearning rate warm-upâ mean? [closed]