Title: Finetuned Language Models Are Zero-Shot Learners
Authors: Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, Quoc V. Le
Published: 3rd September 2021 (Friday) @ 17:55:52
Link: http://arxiv.org/abs/2109.01652v5
Abstract
This paper explores a simple method for improving the zero-shot learning abilities of language models. We show that instruction tuning â finetuning language models on a collection of tasks described via instructions âsubstantially improves zero-shot performance on unseen tasks. We take a 137B parameter pretrained language model and instruction-tune it on over 60 NLP tasks verbalized via natural language instruction templates. We evaluate this instruction-tuned model, which we call FLAN, on unseen task types. FLAN substantially improves the performance of its unmodified counterpart and surpasses zero-shot 175B GPT-3 on 20 of 25 tasks that we evaluate. FLAN even outperforms few-shot GPT-3 by a large margin on ANLI, RTE, BoolQ, AI2-ARC, OpenbookQA, and StoryCloze. Ablation studies reveal that number of finetuning datasets, model scale, and natural language instructions are key to the success of instruction tuning.
Notes
- Overall idea: Instruction fine-tuning helps generalised zero-shot ability for new tasks
- in other words, fine-tuning on a bunch of tasks with instruction-response templates improves the performance on other downstream tasks without explicitly IFT-ing on them
- Tasks are grouped (âclusteredâ) into task clusters, e.g. BoolQ is in the Reading Comprehension cluster with MultiRC and OBQA
- Ablation 1:
- Hold out commonsense reasoning, closed-book QA and NLI clusters of specific tasks and IFT with other tasks
- Add clusters of tasks sequentially
- Observed Result: Performance on the held-out tasks goes up with the added tasks, i.e. more tasks help generalisation
- Ablation 2:
- âUsing the same cluster split as in the previous ablation study, we evaluate the effect of instruction tuning on models of size 422M, 2B, 8B, 68B, and 137B parameters.â
- âWhereas instruction tuning helps large models generalize to new tasks, for small models it actually hurts generalization to unseen tasks, potentially because all model capacity is used to learn the mixture of instruction tuning tasks.â - caption from Figure 7
- Hurt up to 8B, helps from 68B
- Takeaway: Model scale is crucial for IFT to work
- Hypothesis: IFT with lots of tasks makes small models use all their capacity to learn the different tasks - splitting their capacity
- Ablation 3 (stupid): Testing if instructions as important or if multi-task fine-tuning is responsible for the performance gains (in zero-shot generalisation)
- âIn a no template setup, only inputs and outputs were given to the model (e.g., for translation the input would be âThe dog runs.â and the output would be âLe chien court.â). In a dataset name setup, each input is prepended with the name of the task and dataset (e.g., for translation to French, the input would be â[Translation: WMTâ14 to French] The dog runs.â)â
- Not using instructions in a natural language template doesnât work as well
- e.g. Translation task would provide as input to the model
- Instruction fine-tuning improves modelâs responsiveness to soft prompts:
- âInstruction-tuned models respond better to continuous inputs from prompt tuning. When prompt tuning on a given dataset, no tasks from the same cluster as that dataset were seen during instruction tuning.â
- They use Lester et al. (2021) (The Power of Scale for Parameter-Efficient Prompt Tuning) to construct continuous prompts for each of the SuperGLUE tasks
- âInstruction-tuned models respond better to continuous inputs from prompt tuning. When prompt tuning on a given dataset, no tasks from the same cluster as that dataset were seen during instruction tuning.â