[D] Mixed Precision Training: Difference between BF16 and FP16 : r/MachineLearning
Excerpt
TL;DR: if you have the right hardware, use BF16 :-)
TL;DR: if you have the right hardware, use BF16 :-)
Both consume the exact same memory as they encode each number on 16 bits.
On recent Nvidia GPU (Ampere generation like A100 and 3090 RTX), tensor cores boost both of them. On older ones (like a V100 or a T4), bfloat16 is not supported so life is easier because you have no choice. Google TPU supports BF16 since quite some time.The diff between them is in the number of bits for the exponent part and the mantissa (see Wikipedia https://en.wikipedia.org/wiki/Bfloat16_floating-point_format).
FP16 has 5 bits for the exponent, meaning it can encode numbers between -65K and +65.BF16 has as 8 bits in exponent like FP32, meaning it can approximately encode as big numbers as FP32.
During training in mixed precision, when values are too big to be encoded in FP16 (>65K or â65K), there is a trick applied to rescale the gradient. However, it seems that on super large models (the GPT3 likes), it makes nnet unstable.
BF16 is not perfect either, as itâs really less precise than FP32. One bad thing which may happen is that a value very close to 0 canât be encoded and is rounded to 0 (same with FP16 but worth in BF16). Itâs an issue when, for instance, you plan to divide something with this 0 :-)
Another bad thing IRL is that your model may contain large values and may require work if you plan to perform inference on a hardware which doesnât support bf16. Itâs still doable. For instance, T5 model from Google is known for requiring work to make it work in FP16.