🪴 Anil's Garden

❯

Finetuned Language Models Are Zero-Shot Learners

18 Jul 20253 min read

paper
annotated

Title: Finetuned Language Models Are Zero-Shot Learners
Authors: Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, Quoc V. Le
Published: 3rd September 2021 (Friday) @ 17:55:52
Link: http://arxiv.org/abs/2109.01652v5

Abstract

This paper explores a simple method for improving the zero-shot learning abilities of language models. We show that instruction tuning — finetuning language models on a collection of tasks described via instructions —substantially improves zero-shot performance on unseen tasks. We take a 137B parameter pretrained language model and instruction-tune it on over 60 NLP tasks verbalized via natural language instruction templates. We evaluate this instruction-tuned model, which we call FLAN, on unseen task types. FLAN substantially improves the performance of its unmodified counterpart and surpasses zero-shot 175B GPT-3 on 20 of 25 tasks that we evaluate. FLAN even outperforms few-shot GPT-3 by a large margin on ANLI, RTE, BoolQ, AI2-ARC, OpenbookQA, and StoryCloze. Ablation studies reveal that number of finetuning datasets, model scale, and natural language instructions are key to the success of instruction tuning.

Notes

Overall idea: Instruction fine-tuning helps generalised zero-shot ability for new tasks
- in other words, fine-tuning on a bunch of tasks with instruction-response templates improves the performance on other downstream tasks without explicitly IFT-ing on them
Tasks are grouped (“clustered”) into task clusters, e.g. BoolQ is in the Reading Comprehension cluster with MultiRC and OBQA
Ablation 1:
- Hold out commonsense reasoning, closed-book QA and NLI clusters of specific tasks and IFT with other tasks
- Add clusters of tasks sequentially
- Observed Result: Performance on the held-out tasks goes up with the added tasks, i.e. more tasks help generalisation
Ablation 2:
- “Using the same cluster split as in the previous ablation study, we evaluate the effect of instruction tuning on models of size 422M, 2B, 8B, 68B, and 137B parameters.”
- “Whereas instruction tuning helps large models generalize to new tasks, for small models it actually hurts generalization to unseen tasks, potentially because all model capacity is used to learn the mixture of instruction tuning tasks.” - caption from Figure 7
- Hurt up to 8B, helps from 68B
- Takeaway: Model scale is crucial for IFT to work
- Hypothesis: IFT with lots of tasks makes small models use all their capacity to learn the different tasks - splitting their capacity
Ablation 3 (stupid): Testing if instructions as important or if multi-task fine-tuning is responsible for the performance gains (in zero-shot generalisation)
- “In a no template setup, only inputs and outputs were given to the model (e.g., for translation the input would be “The dog runs.” and the output would be “Le chien court.”). In a dataset name setup, each input is prepended with the name of the task and dataset (e.g., for translation to French, the input would be “[Translation: WMT’14 to French] The dog runs.”)”
- Not using instructions in a natural language template doesn’t work as well
- e.g. Translation task would provide as input to the model
Instruction fine-tuning improves model’s responsiveness to soft prompts:
- “Instruction-tuned models respond better to continuous inputs from prompt tuning. When prompt tuning on a given dataset, no tasks from the same cluster as that dataset were seen during instruction tuning.”
  - They use Lester et al. (2021) (The Power of Scale for Parameter-Efficient Prompt Tuning) to construct continuous prompts for each of the SuperGLUE tasks
    - see also Prefix-Tuning Optimizing Continuous Prompts for Generation

Graph View

Backlinks

Language Models

Website
Bluesky
Twitter/X
GitHub
LinkedIn
Instagram
Goodreads
Letterboxd
🍋

🪴 Anil's Garden

Explorer

Finetuned Language Models Are Zero-Shot Learners

Notes

Graph View

Backlinks