Title: Zephyr: Direct Distillation of LM Alignment
Authors: Lewis Tunstall, Edward Beeching, Nathan Lambert, Nazneen Rajani, Kashif Rasul, Younes Belkada, Shengyi Huang, Leandro von Werra, Clémentine Fourrier, Nathan Habib, Nathan Sarrazin, Omar Sanseviero, Alexander M. Rush, Thomas Wolf
Published: 25th October 2023 (Wednesday) @ 19:25:16
Link: http://arxiv.org/abs/2310.16944v1

Abstract

We aim to produce a smaller language model that is aligned to user intent. Previous research has shown that applying distilled supervised fine-tuning (dSFT) on larger models significantly improves task accuracy; however, these models are unaligned, i.e. they do not respond well to natural prompts. To distill this property, we experiment with the use of preference data from AI Feedback (AIF). Starting from a dataset of outputs ranked by a teacher model, we apply distilled direct preference optimization (dDPO) to learn a chat model with significantly improved intent alignment. The approach requires only a few hours of training without any additional sampling during fine-tuning. The final result, Zephyr-7B, sets the state-of-the-art on chat benchmarks for 7B parameter models, and requires no human annotation. In particular, results on MT-Bench show that Zephyr-7B surpasses Llama2-Chat-70B, the best open-access RLHF-based model. Code, models, data, and tutorials for the system are available at https://github.com/huggingface/alignment-handbook.


Models & Code: https://github.com/huggingface/alignment-handbook.


Quick Notes

In this work, we consider the problem of aligning a small open LLM entirely through distillation. The main step is to utilize AI Feedback (AIF) from an ensemble of teacher models as preference data, and apply distilled direct preference optimization as the learning objective (Rafailov et al., 2023). We refer to this approach as dDPO. Notably, it requires no human annotation and no sampling compared to using other approaches like proximal preference optimization (PPO) (Schulman et al., 2017). Moreover, by utilizing a small base LM, the resulting chat model can be trained in a matter of hours on 16 A100s (80GB).

To validate this approach, we construct ZEPHYR-7B, an aligned version of Mistral-7B (Jiang et al., 2023).

Steps:

  1. We first use dSFT, based on the UltraChat (Ding et al., 2023) dataset.
  2. Next we use the AI feedback data collected in the UltraFeedback dataset (Cui et al., 2023).
  3. Finally, we apply dDPO based on this feedback data.

Experiments show that this 7B parameter model can achieve performance comparable to 70B-parameter chat models aligned with human feedback.

Results show improvements both in terms of standard academic benchmarks as well as benchmarks that take into account conversational capabilities. Analysis shows that the use of preference learning is critical in achieving these results.

We note an important caveat for these results. We are primarily concerned with intent alignment of models for helpfulness. The work does not consider safety considerations of the models, such as whether they produce harmful outputs or provide illegal advice (Bai et al., 2022). As distillation only works with the output of publicly available models this is technically more challenging to do because of added challenges in curating that type of synthetic data, and is an important subject for future work.

Method

Distilled Supervised Finetuning (dSFT)

They follow the self-instruct protocol (Wang et al. 2023 Self-Instruct Aligning Language Model with Self Generated Instructions) with seed prompts passed to a teacher LM they use to build a diverse SFT dataset, with iterative refinement (from the teacher) of the instructions as well as the responses.

AI Feedback through Preferences (AIF; “RLAIF”)

Follow the approach of UltraFeedback Boosting Language Models with Scaled AI Feedback.

  1. Generate responses from various models (Llama, Claude, Falcon)
  2. Rate/Rank these with GPT-4
  3. Select triples of (1. prompt, 2. top scoring response, 3. random lower scoring response) -

They train for RL on these triples.

Distilled Direct Preference Optimization (dDPO)

Objective/Goal: Train to prefer over using the model (parameters) as a function.

The goal of the final step is to refine the by maximizing the likelihood of ranking the preferred over in a preference model. The preference model is determined by a reward function which utilizes the student language model . Past work using AI feedback has primarily focused on using RL methods such as proximal policy optimization (PPO) to optimize with respect to this reward. These approaches optimize by first training the reward and then sampling from the current policy to compute updates. Direct preference optimization (DPO) uses a simpler approach to directly optimize the preference model from the static data (Rafailov et al., 2023 Direct Preference Optimization Your Language Model is Secretly a Reward Model). The key observation is to derive the optimal reward function in terms of the optimal LLM policy , and the original LLM policy . Under an appropriate choice of preference model they show, for constant and partition function that,

By plugging this function of the reward into the preference model, the authors show that the objective can be written as,

While this term looks complex, we note that it implies a simple training procedure. Starting with the dSFT version of the model, we iterate through each AIF triple .

  1. Compute the probability for and from the dSFT model (forward-only).
  2. Compute the probability for and from the dDPO model.
  3. Compute above and backpropagate to update.
  4. Repeat (1-3)