Title: Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Authors: Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, Denny Zhou
Published: 28th January 2022 (Friday) @ 02:33:07
Link: http://arxiv.org/abs/2201.11903v6

Abstract

We explore how generating a chain of thought — a series of intermediate reasoning steps — significantly improves the ability of large language models to perform complex reasoning. In particular, we show how such reasoning abilities emerge naturally in sufficiently large language models via a simple method called chain of thought prompting, where a few chain of thought demonstrations are provided as exemplars in prompting. Experiments on three large language models show that chain of thought prompting improves performance on a range of arithmetic, commonsense, and symbolic reasoning tasks. The empirical gains can be striking. For instance, prompting a 540B-parameter language model with just eight chain of thought exemplars achieves state of the art accuracy on the GSM8K benchmark of math word problems, surpassing even finetuned GPT-3 with a verifier.


Chain-of-thought prompting has several attractive properties as an approach for facilitating reasoning in language models.

  1. ï»żï»żï»żFirst, chain of thought, in principle, allows models to decompose multi-step problems into intermediate steps, which means that additional computation can be allocated to problems that require more reasoning steps.
  2. ï»żï»żï»żSecond, a chain of thought provides an interpretable window into the behavior of the model, suggesting how it might have arrived at a particular answer and providing opportunities to debug where the reasoning path went wrong (although fully characterizing a model’s computations that support an answer remains an open question).
  3. ï»żï»żï»żThird, chain-of-thought reasoning can be used for tasks such as math word problems, commonsense reasoning, and symbolic manipulation, and is potentially applicable (at least in principle) to any task that humans can solve via language.
  4. ï»żï»żï»żFinally, chain-of-thought reasoning can be readily elicited in sufficiently large off-the-shelf language models simply by including examples of chain of thought sequences into the exemplars of few-shot prompting.

In empirical experiments, they show utility of chain-of-thought prompting for:

  1. arithmetic reasoning (§3)
  2. commonsense reasoning (§4)
  3. symbolic reasoning (§5)

No language models were finetuned in the process of writing this paper.

Arithmetic Reasoning (§3)

Ablations - is it really chain of thought that’s helping? Figure 6

  1. LM should output the equation for the problem
    • helps in cases where smaller equations can be derived from the question
  2. Scaling test time compute - LM should output dots (
) about as long as equation that would have been derived (maybe length taken from (1) above)
    • doesn’t help
  3. Give chain of thought after answer
    • performance like standard prompting baseline model relies on CoT induced context to access knowledge

Sensitivity analysis

Compared:

  • Annotator A: chains of thought written by an Annotator A
  • Annotators B and C: two other co-authors of this paper independently wrote chains of thought for the same few-shot exemplars (shown in Appendix H)
  • Annotator A also wrote another chain of thought that was more concise than the original, following the style of solutions given in Cobbe et al. (2021), the GSM8K paper Training Verifiers to Solve Math Word Problems (Grade School Math, 8.5k samples)

Datasets used / evaluated on:

Commonsense Reasoning

Datasets:

Results with chain-of-thought prompting

  • PaLM 540B outperformed prior state of the art on:
    • StrategyQA (75.6% vs 69.4%)
    • outperforming an unaided sports enthusiast on sports understanding (95.4% vs 84%)
  • Gain was minimal on CSQA

  • Figure 7 highlights these results for PaLM (full results for LaMDA, GPT-3, and different model scales are shown in Table 4)
  • For all tasks, scaling up model size improved the performance of standard prompting;
  • chain-of-thought prompting led to further gains, with improvements appearing to be largest for PaLM 540B

Symbolic Reasoning

They take the following two tasks, that can be done without CoT, out of distribution and see good results (figure).

  • ï»żï»żLast letter concatenation. This task asks the model to concatenate the last letters of words in a name (e.g.,
    “Amy Brown” → “yn”). It is a more challenging version of first letter concatenation, which language models can already perform without chain of thought. We generate full names by randomly concatenating names from the top one-thousand first and last names from name census data (https://namecensus.com/).
  • ï»żï»żCoin flip. This task asks the model to answer whether a coin is still heads up after people either flip or don’t flip the coin (e.g., “A coin is heads up. Phoebe flips the coin.
    Osvaldo does not flip the coin. Is the coin still heads up?” → “no”).

Background

“Explanations typically come after answers” (in previous, different work):

  • WT5! Training Text-to-Text Models to Explain their Predictions - trains T5 (base and 11B) to output explanations for NLI tasks using datasets e-SNLI, CoS-E, MultiRC and Movie Reviews
  • Reframing Human-AI Collaboration for Generating Free-Text Explanations - generates free-text explanations for classification tasks (decisions) using GPT-3
  • Can language models learn from explanations in context - summary, explanations can support the in-context learning of large LMs on challenging tasks.
    • Explanations of few-shot answers examples help LMs using matched explanations as controls
    • Explanations can improve performance without tuning
    • Explanations hand-tuned for performance on a small validation set offer substantially larger benefits
    • building a prompt by selecting examples and explanations together substantially improves performance over selecting examples alone
    • Even untuned explanations outperform carefully matched controls suggests benefits are due to link between an example and its explanation, rather than lower-level features
    • Caveat: Only large models benefit

Limitations

  1. Although chain of thought emulates the thought processes of human reasoners, this does not answer whether the neural network is actually “reasoning,”
  2. Although the cost of manually augmenting exemplars with chains of thought is minimal in the few-shot setting, such annotation costs could be prohibitive for finetuning (though this could potentially be surmounted with synthetic data generation, or zero-shot generalization).
  3. there is no guarantee of correct reasoning paths, which can lead to both correct and incorrect answers
    • improving factual generations of language models is an open field
  4. Emergence of chain-of-thought reasoning only at large model scales makes it costly to serve in real-world applications
    • following research explores how to induce reasoning in smaller models