Fixing DPO but I have a dinner reservation … – Kyunghyun Cho
Excerpt
Direct preference optimization (DPO; https://arxiv.org/abs/2305.18290) is all the rage, i heard. i also hear from my students that DPO, which minimizes the following loss, often results in weird behaviours, such as unreasonable preference toward lengthy responses (even when there is no statistical difference in lengths between desirable and undesirable responses.) i won’t go into details of these issues, but i feel like there’s a relatively simple reason behind these pathologies based on basic calculus.
Direct preference optimization (DPO; https://arxiv.org/abs/2305.18290) is all the rage, i heard. i also hear from my students that DPO, which minimizes the following loss, often results in weird behaviours, such as unreasonable preference toward lengthy responses (even when there is no statistical difference in lengths between desirable and undesirable responses.) i won’t go into details of these issues, but i feel like there’s a relatively simple reason behind these pathologies based on basic calculus.
where is the so-called reference model from which and were drawn independently given .
Let’s consider DPO with a fixed set of query-responses triplets drawn from . that is, . Without loss of generality, i will always say that is more preferred than . the overall loss is then:
what’s the issue here? the issue is that updating by minimizing this loss does not necessarily lead to from which we draw a good response. that is, there is no reason why , where and is an arbitrary sequence.
instead, a proper loss would be the following:
the main difference is that we are not using a fixed set of triplets drawn from but we use the samples drawn from the latest model . This makes perfect sense, since responses we care about are those that we are more likely to draw from the trained model . let’s now look at the gradient of this proper loss with respect to here.
where we use a couple of tricks for computing the derivative, such as and the log-derivative trick (). we use as a short-hand notation of .
what is interesting is that we automatically end up with two types of loss functions. the first one is the usual DPO loss. the second one is the likelihood on both desirable and undesriable responses. the second one is extremely important, since this one ensures that we are more likely to sample responses for which the first one (DPO) was optimized, after training.
now, this proper DPO loss (perhaps i can call it PDPO, since i was told we must name every single math formula in an obscure way) is not easy to minimize, as we must be able to determine which of an arbitrary pair of responses given the query is more desirable. if is a molecular description, we would need to synthesize them and experiment with them to tell which is better. in other words, this PDPO loss is more readily usable when we have a ready and cheap way to tell the preference.
we can instead use importance sampling with the fixed set of the preference triplets :
the importance weights, , say that we would use the pre-collected preference triplets only if they are reasonably likely under the current model . this makes sense, as we care about the examples that are more likely to be drawn from the current model. unfortunately, this approach is not ideal, since the quality of each preference triplet becomes worse as drifts away.
so, what should we do? we should (1) draw triplets from the current model as frequently as possible based on the available resources and constraints and (2) use importance sampling to update the parameters rather than the original gradient. unfortunately (1) is very difficult because it’s often really costly to measure the preference over two responses. (2) is also difficult, because the variance of this estimator will blow up quite rapidly as quickly evolves.
do i have any empirical evidence? unfortunately i have a dinner reservation and need to leave for the restaurant shortly.
Acknowledgement: i’d like to thank Richard Pang, who’s a graduating PhD student at NYU, for spending a couple of hours later afternoon on Friday to hear my rant and (then-incorrect) derivation. Also, i thank Weizhe Yuan and Angie Chen for keeping me up-to-date on mysteries and magics people perform each day finetuning language models, which serves as the constant motivation for me to think about this problem.