Title: Instruction Tuning for Large Language Models: A Survey
Authors: Shengyu Zhang, Linfeng Dong, Xiaoya Li, Sen Zhang, Xiaofei Sun, Shuhe Wang, Jiwei Li, Runyi Hu, Tianwei Zhang, Fei Wu, Guoyin Wang
Published: 21st August 2023 (Monday) @ 15:35:16
Link: http://arxiv.org/abs/2308.10792v8

Abstract

This paper surveys research works in the quickly advancing field of instruction tuning (IT), which can also be referred to as supervised fine-tuning (SFT)\footnote{In this paper, unless specified otherwise, supervised fine-tuning (SFT) and instruction tuning (IT) are used interchangeably.}, a crucial technique to enhance the capabilities and controllability of large language models (LLMs). Instruction tuning refers to the process of further training LLMs on a dataset consisting of \textsc{(instruction, output)} pairs in a supervised fashion, which bridges the gap between the next-word prediction objective of LLMs and the users’ objective of having LLMs adhere to human instructions. In this work, we make a systematic review of the literature, including the general methodology of SFT, the construction of SFT datasets, the training of SFT models, and applications to different modalities, domains and application, along with analysis on aspects that influence the outcome of SFT (e.g., generation of instruction outputs, size of the instruction dataset, etc). We also review the potential pitfalls of SFT along with criticism against it, along with efforts pointing out current deficiencies of existing strategies and suggest some avenues for fruitful research. Project Page: github.com/xiaoya-li/Instruction-Tuning-Survey


  • Instruction tuning (IT) used interchangeably with supervised fine-tuning (SFT) in this review
    • in reality I see these are different instruction following is user-oriented whereas SFT can be user-agnostic but still task-oriented e.g. machine translation
    • can achieve task adaptation with IT e.g. system prompt I guess
  • enhance the capabilities and controllability of large language models
  • further training LLMs using (INSTRUCTION, OUTPUT) pairs, where
    • INSTRUCTION denotes the human instruction for the model, and
    • OUTPUT denotes the desired output that follows the INSTRUCTION
  • benefits of SFT:
    1. bridge gap between the next-word prediction objective and the users’ objectives
    2. more controllable and predictable model behaviour cf standard LLMs - instructions constrain model outputs to align with the desired response characteristics or domain knowledge
      • provides a channel for humans to intervene with the model’s behaviours
    3. SFT is computationally efficient and can help LLMs rapidly adapt to a specific domain without extensive retraining or architectural changes
  • problems:
    • need domain-specific supervised data - obvious
    • reduces output diversity
      • question interpretation in terms of LM distribution change (truncation/deformation/other)
    • no fundamental task learning - SFT impacts “surface-level” behaviour -question really?

Sections

  1. intro
  2. methodology of IT
  3. constructing IT datasets
  4. important IT’d models
  5. multimodality
  6. IT for domain adaptation
  7. efficiency
  8. evaluation of SFT models

Methodology (§2)

Creating good quality and diverse instructions to cover the intended use case(s) is not trivial.

Instruction Dataset Construction

Approach 1: Existing NLP datasets e.g. NLI which are reformed from (text-label) pairs to (instruction output) pairs

Examples:

Approach 2: Generate outputs with LLM. Write instructions by hand and/or generate from an LLM starting from seed instructions (can provide a few examples)

Examples:

IT/SFT Datasets (§3)

  1. Human crafted
    1. Natural Instructions (Mishra et al., 2021) - 193k instances; 61 NLP tasks - instructions and instances
    2. P3 (Public Pool of Prompts) (Sanh et al., 2021) - 170 English NLP datasets, 2,052 prompts (“task templates”). Authors built PromptSource to crowdsource/outsource prompt construction
    3. xP3 (Crosslingual Public Pool of Prompts) (Muennighoff et al., 2022) - multilingual: 46 languages; 16 natural language tasks - inputs (task description) and target (output text)
    4. Flan 2021 (Longpre et al., 2023) - English; 62 widely-used NLP benchmarks (e.g., SST-2, SNLI, AG News, MultiRC); input-target pairs
    5. LIMA (Zhou et al., 2023a) - English; train set: 1K (“instruction”, “response”) pairs; test set: 300 instances.
      1. training set: 75% are sampled from three community question & answers websites (Stack Exchange, wikiHow, Pushshift Reddit Dataset (Baumgartner et al., 2020)); 20% manually written by a set of the authors (referred Group A) inspired by their interests; 5% are sampled from the Super-Natural Instructions dataset (Wang et al., 2022d).
      2. valididation set: authors sampled 50 instances from the Group A author-written set
      3. test set: 230 written by another group (Group B) of authors and 70 sampled from the Pushshift Reddit Dataset (Baumgartner et al., 2020), which is a collection of questions & answers within the Reddit community.
    6. Super Natural Instructions (Wang et al., 2022f) - 1,616 NLP tasks and 5M task instances, covering 76 distinct task types (e.g., text classification, information extraction, text rewriting, text composition and etc.) and 55 languages.
    7. Dolly (Conover et al., 2023a) - English; 15,000 human-generated data instances designed to enable LLMs to interact with users akin to ChatGPT. The dataset is designed for simulating a wide range of human behaviours, covering 7 specific types: open Q&A, closed Q&A, extracting information from Wikipedia, summarising information from Wikipedia, brainstorming, classification, and creative writing.
    8. OpenAssistant Conversations (Köpf et al., 2023) - human-crafted multilingual assistant-style conversation corpus consisting of 161,443 messages (i.e., 91,829 user prompts, 69,614 assistant replies) from 66,497 conversation trees in 35 languages, along with 461,292 humanannotated quality ratings. This is from Yannic.
  2. synthetic data via distillation
    1. Alpaca A Strong, Replicable Instruction-Following Model
    2. WizardLM / Evol-Instruct
    3. Orca / Orca-2
    4. Baize
    5. ShareGPT
    6. WildChat
    7. Vicuna
    8. Unnatural Instructions
    9. WizardCoder, Magicoder, WaveCoder - forcoding
    10. Phi-1, Phi-1.5 - for writing andreasoning
    11. Nectar - forranking
  3. synthetic data via self-improvement
    1. Self-Instruct Aligning Language Model with Self Generated Instructions
    2. SPIN
    3. Self-Alignment with Instruction Backtranslation

Instruction Tuned LLMs (§4) - mostly refers to the earlier ones

Hotlist as of August 2023

  • InstructGPT
  • BLOOMZ
  • Flan-T5
  • Alpaca
  • Vicuna
  • Claude
  • Ones that are in their list of most important but I don’t know:
    • GPT-4-LLM (terrible name)
    • WizardLM
    • ChatGLM2
    • LIMA

A couple others I wanted to remember

Nous-Herme (13B) is a large language model trained by fine-tuning LLaMA (13B) (Touvron et al., 2023a) on an instruction dataset, which contains over 300k instructions, sampled from GPTeacher’?, CodeAlpaca (Chaudhary, 2023), GPT-4-LLM (Peng et al., 2023), Unnatural Instructions (Honovich et al., 2022), and BiologyPhysicsChemistry subsets in the Camel-Al (Li et al., 2023c). Responses are generated by GPT-4. For evaluations, Nous-Herme (13B) achieves comparable performances to GPT-3.5-turbo on multiple tasks like ARC challenge (Clark et al., 2018) and BoolQ (Clark et al., 2019).

Noted this down because I always see Nous Research around on Twitter, mostly due to Teknium (e/λ)

TULU (6.7B)
is a large language model trained by fine-tuning OPT (6.7B) (Zhang et al., 2022a) on a mixed instruction dataset, which contains FLAN V2 (Longpre et al., 2023), CoT (Wei et al., 2022), Dolly (Conover et al., 2023a), Open Assistant-113, GPT4-Alpaca 4 Code-Alpaca (Chaudhary, 2023), and ShareGPT!5 After fine-tuning, TULU (6.7B) reaches on average 83% of ChatGPT’s performance and 68% of GPT-4’s performance.

TÜLU is from AI2 in: How Far Can Camels Go Exploring the State of Instruction Tuning on Open Resources

Table 3: An overview of LLMs tuned on IT datasets

Instruction fine-tuned LLMs# ParamsBase ModelFine-tuning Trainset
Self-buildDataset NameSize
Instruct-GPT (Ouyang et al., 2022)176BGPT-3 (Brown et al., 2020b)Yes--
BLOOMZ (Muennighoff et al., 2022) 176BBLOOM (Scao et al., 2022)NoxP3-
FLAN-T5 (Chung et al., 2022) 11BT5 (Raffel et al., 2019)NoFLAN 2021-
Alpaca (Taori et al., 2023a) 7BLLaMA (Touvron et al., 2023a)Yes-52 K
Vicuna (Chiang et al., 2023) 13BLLaMA (Touvron et al., 2023a)Yes-70K
GPT-4-LLM (Peng et al., 2023) 7BLLaMA (Touvron et al., 2023a)Yes-52 K
Claude (Bai et al., 2022b)--Yes--
WizardLM (Xu et al., 2023a) 7BLLaMA (Touvron et al., 2023a)YesEvol-Instruct70K
ChatGLM2 (Du et al., 2022) 6BGLM (Du et al., 2022)Yes-1.1 Tokens
LIMA (Zhou et al., 2023a)65BLLaMA (Touvron et al., 2023a)Yes-1 K
OPT-IML (Iyer et al., 2022) 175BOPT (Zhang et al., 2022a)No--
Dolly 2.0 (Conover et al., 2023a) 12BPythia (Biderman et al., 2023)No-15K
Falcon-Instruct (Almazrouei et al., 2023a) 40BFalcon (Almazrouei et al., 2023b)No--
Guanaco (JosephusCheung, 2021) 7BLLaMA (Touvron et al., 2023a)Yes-586K
Minotaur (Collective, 2023) 15BStarcoder Plus (Li et al., 2023f)No--
Nous-Hermes (NousResearch, 2023) 13BLLaMA (Touvron et al., 2023a)No-
TÜLU (Wang et al., 2023d) 6.7BOPT (Zhang et al., 2022a)NoMixed-
YuLan-Chat (YuLan-Chat-Team, 2023) 13BLLaMA (Touvron et al., 2023a)Yes-250K
MOSS (Tianxiang and Xipeng, 2023) 16B-Yes--
Airoboros (Durbin, 2023) 13BLLaMA (Touvron et al., 2023a)Yes--
UltraLM (Ding et al., 2023a) 13BLLaMA (Touvron et al., 2023a)Yes--

Code or Weights Access to IT Models

  1. https://github.com/allenai/unifiedqa
  2. https://github.com/LAION-AI/Open-Instruction-Generalist
  3. https://github.com/hkunlp/unifiedskg
  4. https://github.com/allenai/natural-instructions-v1
  5. https://github.com/allenai/natural-instructions
  6. https://huggingface.co/datasets/bigscience/P3
  7. https://github.com/bigscience-workshop/xmtf
  8. https://github.com/google-research/FLAN
  9. https://github.com/BAAI-Zlab/COIG
  10. https://github.com/orhonovich/unnatural-instructions
  11. https://github.com/yizhongw/self-instruct
  12. https://github.com/XueFuzhao/InstructionWild
  13. https://github.com/nlpxucan/evol-instruct
  14. https://github.com/tatsu-lab/stanford_alpaca
  15. https://github.com/csitfun/LogiCoT
  16. https://huggingface.co/datasets/databricks/databricks-dolly-15k
  17. https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM
  18. https://huggingface.co/datasets/GAIR/lima
  19. https://huggingface.co/datasets/JosephusCheung/GuanacoDataset
  20. https://github.com/LAION-AI/Open-Assistant
  21. https://github.com/project-baize/baize-chatbot
  22. https://github.com/thunlp/UltraChat

Note on Claude

Claude was trained by:

  • IT-ing on instructions generated by GPT-4
  • doing RL with comparative ratings supplied by GPT-4 - comparisons of multiple LLMs inc. GPT-3

Claude is a language model trained by fine-tuning the pre-trained language model on an instruction dataset, aiming to generate helpful and harmless responses. The fine-tuning process consists of two stages: (1) supervised fine-tuning on the instruction dataset. The authors created an instruction dataset by collecting 52K different instructions, paired with responses generated by GPT-4. The fine-tuning process takes approximately eight hours on an 8-card 80GB A100 machine with mixed precision and fully shared data parallelism. (2) optimizing the step-1 model with the proximal policy optimization (Schulman et al., 2017) method.

The authors first built a comparison dataset by collecting responses from multiple large language models (e.g., GPT-3 (Brown et al., 2020b)) to the given collection of instructions and then asking GPT-4 (OpenAl, 2023) to rate each response.

Using the ratings, a reward model is trained. Then, the fine-tuned model from Step 1 is optimized using the reward model with the proximal policy optimization method.

Multimodal Instruction Tuning

Open-source / Open Access

  1. https://github.com/timothybrooks/instruct-pix2pix
  2. https://github.com/haotian-liu/LLaVA
  3. https://github.com/DAMO-NLP-SG/Video-LLaMA
  4. https://github.com/salesforce/LAVIS/tree/main/projects/instructblip
  5. https://github.com/Luodian/Otter
  6. https://github.com/open-mmlab/Multimodal-GPT

Multimodal Instruction Tuning Datasets

  • MUL-TIINSTRUCT (Xu et al., 2022) - multimodal seq2seq
  • PMC-VQA (Zhang et al., 2023c) - VQA
  • LAMM (Yin et al., 2023) - 2D image & 3D point cloud understanding
    • LAMM-Dataset includes data pairs for commonsense knowledge question answering by incorporating a hierarchical knowledge graph label system from the Bamboo (Zhang et al., 2022b) dataset and the corresponding Wikipedia description
  • Vision-Flan (Xu et al., 2024)
    • the largest public-available human-annotated visual instruction tuning dataset that consists of 1,664,261 instances and 200+ diverse vision-language tasks derived from 101 open-source computer vision datasets
  • ALLaVA (Chen et al., 2024a)
    • opensource, extensive dataset tailored for fine-tuning visual question-answering models, featuring 1.4M entries that include detailed captions, intricate instructions, and comprehensive answers produced by GPT-4V
  • ShareGPT4V

Multimodal Instruction Tuned Models (§5)

  • InstructPix2Pix (983M) (Brooks et al., 2022) - a conditional diffusion model trained by fine-tuning Stable Diffusion (983M) (Rombach et al., 2022) on a constructed multi-modal dataset that contains more than 450K text editing instructions and corresponding images before and after the edit.
  • LLaVA (13B) (Liu et al., 2023b) - a large multimodal model developed by connecting the visual encoder of CLIP (400M) (Radford et al., 2021) with the language decoder LLaMA (7B) (Touvron et al., 2023a). LLaVA is fine-tuned using the generated instructional vision-language dataset consisted of 158K unique language-image instruction-following samples.
  • Video-LLaMA (Zhang et al., 2023b) - a multimodal framework that enhances large language models with the ability to understand both visual and auditory content in videos. The architecture of Video-LLaMA consists of two branched encoders:
    • the Vision-Language (VL) Branch
    • Audio-Language (AL) Branch
    • 
and a language decoder (Vicuna (7B/13B) (Chiang et al., 2023), LLaMA (7B) (Touvron et al., 2023a), etc.)
  • InstructBLIP (1.2B) (Dai et al., 2023) - a vision-language instruction tuning framework initialized with a pre-trained BLIP-2 (Li et al., 2023d)) model consisting of an image encoder, an LLM (FlanT5 (3B/11B) (Chung et al., 2022) or Vicuna (7B/13B) (Chiang et al., 2023)), and a Query Transformer (Q-Former) to bridge the two.
  • Otter (Li et al., 2023b) - a multi-modal model trained by fine-tuning OpenFlamingo (9B) (Awadalla et al., 2023), with the language and vision encoders frozen and only fine-tuning the Perceiver resampler module, cross-attention layers, and input/output embeddings.
  • MultiModal-GPT (Gong et al., 2023) - a multimodal instruction tuning model trained by fine-tuning OpenFlamingo (9B) (Awadalla et al., 2023) on various created visual instruction data with open datasets, including VQA, Image Captioning, Visual Reasoning, Text OCR, and Visual Dialogue. Can follow diverse instructions, generating detailed captions, counting specific objects, and addressing general inquiries.

Domain-specific Instruction Tuning (§6)