🪴 Anil's Garden

❯

Training language models to follow instructions with human feedback

19 Dec 20253 min read

lm
rlhf
paper
annotated
instruction-tuning
instructgpt

Title: Training language models to follow instructions with human feedback
Authors: Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, Ryan Lowe
Published: 4th March 2022 (Friday) @ 07:04:42
Link: http://arxiv.org/abs/2203.02155v1

Abstract

Making language models bigger does not inherently make them better at following a user’s intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users. In this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback. Starting with a set of labeler-written prompts and prompts submitted through the OpenAI API, we collect a dataset of labeler demonstrations of the desired model behavior, which we use to fine-tune GPT-3 using supervised learning. We then collect a dataset of rankings of model outputs, which we use to further fine-tune this supervised model using reinforcement learning from human feedback. We call the resulting models InstructGPT. In human evaluations on our prompt distribution, outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters. Moreover, InstructGPT models show improvements in truthfulness and reductions in toxic output generation while having minimal performance regressions on public NLP datasets. Even though InstructGPT still makes simple mistakes, our results show that fine-tuning with human feedback is a promising direction for aligning language models with human intent.

Blog post from OpenAI: Aligning language models to follow instructions

predicting the next token on a webpage from the internet—is different from the objective “follow the user’s instructions helpfully and safely” (Radford et al., 2019; Brown et al., 2020; Fedus et al., 2021; Rae et al., 2021; Thoppilan et al., 2022)
language modeling objective is misaligned
LMs should “act” in line with users’ intentions and be:
- honest
- helpful
- harmless
Focus on fine-tuning approaches to make GPT-3 follow broad class of instructions
- RLHF
- PPO via a reward model trained on human preferences Proximal Policy Optimization Algorithms
Method (using 40 contractors to label the data, based on a behavioural screening)
- Labellers write prompts
- (other) labellers write demonstrations of the desired responses
- (also) collect a dataset of human preference data - comparisons - of model outputs from a larger bank of prompts
Outcomes:
- Labellers prefer

Graph View

Backlinks

Language Models
Reinforcement Learning

Website
Bluesky
Twitter/X
GitHub
LinkedIn
Instagram
Goodreads
Letterboxd
🍋

🪴 Anil's Garden

Explorer

Training language models to follow instructions with human feedback

Graph View

Backlinks