• One of the important lessons from this release is the importance of multiple seeds when doing RL finetuning. With DPO or SFT, you can largely sweep over hyperparameters like learning rate and get great outcomes. With RL finetuning, you often need multiple seeds for each run. RL training is unstable, but once it finds a useful area to improve upon it really keeps cooking. Last night, we had about 1/2 the gains from RL, the last post-training stage, but woke up to a few extra points by just letting more seeds run longer. – Source: OLMo 2 and building effective teams for training language models
  • Common confusion, overstatement or gotcha regarding the effect of KV caching: KV caching reduce the per token complexity to linear in the length of the input sequence (i.e. how many tokens there are in the context) but the overall complexity of the self-attention mechanism is still quadratic in the sequence length
    • see this complexity analysis from Grok and this explanation from ChatGPT in response to my query: “One thing I’ve never had clear in my mind is the claim that KV caching reduces the complexity of the standard attention mechanism to subquadratic. This strikes me as clearly wrong. It is a caching mechanism, which caches previous key-value vector dot products thus prevent re-computation of these quantities for subsequent time steps (we’re considering decoder-only models; i.e. autoregressive ones)
”