Show Your Work: Eight Examples That Unlock Reasoning in Large Models

The eight examples that taught a giant to “show its work”

With eight worked examples, a 540‑billion‑parameter language model jumped to the top of a hard grade‑school math benchmark—no finetuning, just a prompt. The trick is “chain‑of‑thought”: making the model write out its steps before giving an answer.

A simple shift: add the steps, then the answer

Researchers prompted large language models with tiny demonstrations that look like this: question → short, natural‑language reasoning → “The answer is …”. They call the middle part the chain of thought: a few sentences that break a problem into intermediate steps. That’s it. The model then imitates the pattern on new questions.

Why this matters: prompting like this is cheap and reusable. Unlike training a new model for each task, a single off‑the‑shelf model can tackle math, commonsense puzzles, and even toy logic tasks—if you show it how to think, not just what to predict.

The surprise: reasoning “emerges” at large scale

On small models, chain‑of‑thought often hurts. The strings read fluently but the logic is off. Past roughly 100 billion parameters, the curve bends upward.

On GSM8K (challenging math word problems), GPT‑3 175B went from 15.6% correct with standard prompting to 46.9% with chain‑of‑thought. PaLM 540B rose from 17.9% to 56.9%—a new state of the art using just eight exemplars, surpassing a finetuned GPT‑3 system with a verifier. A simple external calculator patched a few arithmetic slips and nudged PaLM to 58.6%.
The harder the problem, the bigger the lift. On easy one‑step subsets of MAWPS, gains were tiny because baselines were already high. On multi‑step subsets, the boost was large.

This isn’t a one‑off for math. On commonsense tasks that require multi‑hop reasoning, the same pattern holds. PaLM 540B with chain‑of‑thought reached 77.8% on StrategyQA (prior best: 69.3%) and 95.4% on a sports plausibility test, outscoring an unaided enthusiast (84%).

Tiny “programs” written in words

Two toy tasks show something deeper: length generalization. With a few two‑step examples, the model learned procedures it could run for longer inputs.

Last‑letter concatenation (e.g., “Amy Brown” → “yn”): PaLM 540B hit 99.4% on names with two words and 63% on four‑word names, where no four‑word examples were shown.
Coin‑flip state tracking (flip/not‑flip sequences): PaLM 540B reached 100% on two steps and 90.2% on four steps, again without seeing longer sequences in the prompt.

Small models failed even in‑domain. The ability to apply the learned procedure at all only appeared in very large models.

Why the steps matter

The team stress‑tested the idea. Three revealing ablations on GSM8K:

Equation‑only prompting (ask for “5*4=20” before the answer) helped a bit on simple sets but didn’t close the gap on the hard math; turning language into the right equation is the real challenge.
Variable compute (output “…” before answering) did nothing, suggesting the benefit isn’t “more tokens” but the content of the tokens.
Reasoning after the answer didn’t help, implying the step‑by‑step process before answering carries the gain.

Think of it like climbing a ladder. The model does better when it takes rungs one by one rather than trying to jump to the top.

Chain‑of‑thought also offers a window into behavior. On math problems the model solved correctly, most chains were logically sound; a few arithmetic slips were fixable with a calculator. On classification tasks, coincidentally correct answers with shaky chains show up more often. These are not guaranteed explanations, but they do expose where reasoning goes off the rails.

How robust is it?

Different writers produced different‑sounding chains for the same examples. All beat the standard prompt by a wide margin. Even a deliberately terse style worked. Reordering exemplars barely changed results, and three random sets of examples borrowed from a public dataset worked well across several math benchmarks. In practice, about eight short exemplars were enough; more helped somewhat, but the method didn’t hinge on heavy prompt engineering.

Limits—and why it still matters

This isn’t proof that models “reason” like people. Chains can look confident and still be wrong. And the effect truly shines only in very large models, which are costly to serve. Even so, the result reframes what prompting can do. Adding a chain of thought turns flat scaling curves into rising ones on multi‑step tasks, expanding the range of problems a single, off‑the‑shelf model can solve without training.

If a few carefully chosen steps can unlock state‑of‑the‑art math, beat humans on sports plausibility, and generalize tiny procedures to longer inputs, then the message is clear: ask the model to show its work, and it can do work it couldn’t do before.