🪴 Anil's Garden

❯

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

18 Jul 20256 min read

paper
google
neurips
annotated

Title: Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Authors: Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, Denny Zhou
Published: 28th January 2022 (Friday) @ 02:33:07
Link: http://arxiv.org/abs/2201.11903v6

Abstract

We explore how generating a chain of thought — a series of intermediate reasoning steps — significantly improves the ability of large language models to perform complex reasoning. In particular, we show how such reasoning abilities emerge naturally in sufficiently large language models via a simple method called chain of thought prompting, where a few chain of thought demonstrations are provided as exemplars in prompting. Experiments on three large language models show that chain of thought prompting improves performance on a range of arithmetic, commonsense, and symbolic reasoning tasks. The empirical gains can be striking. For instance, prompting a 540B-parameter language model with just eight chain of thought exemplars achieves state of the art accuracy on the GSM8K benchmark of math word problems, surpassing even finetuned GPT-3 with a verifier.

Chain-of-thought prompting has several attractive properties as an approach for facilitating reasoning in language models.

First, chain of thought, in principle, allows models to decompose multi-step problems into intermediate steps, which means that additional computation can be allocated to problems that require more reasoning steps.
Second, a chain of thought provides an interpretable window into the behavior of the model, suggesting how it might have arrived at a particular answer and providing opportunities to debug where the reasoning path went wrong (although fully characterizing a model’s computations that support an answer remains an open question).
Third, chain-of-thought reasoning can be used for tasks such as math word problems, commonsense reasoning, and symbolic manipulation, and is potentially applicable (at least in principle) to any task that humans can solve via language.
Finally, chain-of-thought reasoning can be readily elicited in sufficiently large off-the-shelf language models simply by including examples of chain of thought sequences into the exemplars of few-shot prompting.

In empirical experiments, they show utility of chain-of-thought prompting for:

arithmetic reasoning (§3)
commonsense reasoning (§4)
symbolic reasoning (§5)

No language models were finetuned in the process of writing this paper.

Arithmetic Reasoning (§3)

Ablations - is it really chain of thought that’s helping? Figure 6

LM should output the equation for the problem
- helps in cases where smaller equations can be derived from the question
Scaling test time compute - LM should output dots (…) about as long as equation that would have been derived (maybe length taken from (1) above)
- doesn’t help
Give chain of thought after answer
- performance like standard prompting baseline $⟹$ model relies on CoT induced context to access knowledge

Sensitivity analysis

Compared:

Annotator A: chains of thought written by an Annotator A
Annotators B and C: two other co-authors of this paper independently wrote chains of thought for the same few-shot exemplars (shown in Appendix H)
Annotator A also wrote another chain of thought that was more concise than the original, following the style of solutions given in Cobbe et al. (2021), the GSM8K paper Training Verifiers to Solve Math Word Problems (Grade School Math, 8.5k samples)

Datasets used / evaluated on:

MAWPS A Math Word Problem Repository
GSM8K i.e. Training Verifiers to Solve Math Word Problems

Commonsense Reasoning

Datasets:

CommonsenseQA A Question Answering Challenge Targeting Commonsense Knowledge - commonsense questions, semantics, often require prior knowledge
Did Aristotle Use a Laptop A Question Answering Benchmark with Implicit Reasoning Strategies - requires “multi-hop strategy” for reasoning
Big-bench: Beyond the Imitation Game Quantifying and extrapolating the capabilities of language models
- Date - infer date from context
- Sports - is sentence about sports plausible
SayCan: Do As I Can, Not As I Say Grounding Language in Robotic Affordances - robots choose from discrete options

Results with chain-of-thought prompting

PaLM 540B outperformed prior state of the art on:
- StrategyQA (75.6% vs 69.4%)
- outperforming an unaided sports enthusiast on sports understanding (95.4% vs 84%)
Gain was minimal on CSQA

Figure 7 highlights these results for PaLM (full results for LaMDA, GPT-3, and different model scales are shown in Table 4)
For all tasks, scaling up model size improved the performance of standard prompting;
chain-of-thought prompting led to further gains, with improvements appearing to be largest for PaLM 540B

Symbolic Reasoning

They take the following two tasks, that can be done without CoT, out of distribution and see good results (figure).

Last letter concatenation. This task asks the model to concatenate the last letters of words in a name (e.g.,
“Amy Brown” → “yn”). It is a more challenging version of first letter concatenation, which language models can already perform without chain of thought. We generate full names by randomly concatenating names from the top one-thousand first and last names from name census data (https://namecensus.com/).
Coin flip. This task asks the model to answer whether a coin is still heads up after people either flip or don’t flip the coin (e.g., “A coin is heads up. Phoebe flips the coin.
Osvaldo does not flip the coin. Is the coin still heads up?” → “no”).

Background

“Explanations typically come after answers” (in previous, different work):

WT5! Training Text-to-Text Models to Explain their Predictions - trains T5 (base and 11B) to output explanations for NLI tasks using datasets e-SNLI, CoS-E, MultiRC and Movie Reviews
Reframing Human-AI Collaboration for Generating Free-Text Explanations - generates free-text explanations for classification tasks (decisions) using GPT-3
Can language models learn from explanations in context - summary, explanations can support the in-context learning of large LMs on challenging tasks.
- Explanations of few-shot answers examples help LMs using matched explanations as controls
- Explanations can improve performance without tuning
- Explanations hand-tuned for performance on a small validation set offer substantially larger benefits
- building a prompt by selecting examples and explanations together substantially improves performance over selecting examples alone
- Even untuned explanations outperform carefully matched controls $⟹$ suggests benefits are due to link between an example and its explanation, rather than lower-level features
- Caveat: Only large models benefit

Limitations

Although chain of thought emulates the thought processes of human reasoners, this does not answer whether the neural network is actually “reasoning,”
Although the cost of manually augmenting exemplars with chains of thought is minimal in the few-shot setting, such annotation costs could be prohibitive for finetuning (though this could potentially be surmounted with synthetic data generation, or zero-shot generalization).
there is no guarantee of correct reasoning paths, which can lead to both correct and incorrect answers
- improving factual generations of language models is an open field
Emergence of chain-of-thought reasoning only at large model scales makes it costly to serve in real-world applications
- following research explores how to induce reasoning in smaller models

Graph View

Arithmetic Reasoning (§3)
Commonsense Reasoning
Background
Limitations

Backlinks

xTower: A Multilingual LLM for Explaining and Correcting Translation Errors
Language Models

Website
Bluesky
Twitter/X
GitHub
LinkedIn
Instagram
Goodreads
Letterboxd
🍋