When AI Learns to Say 'Aha': Reasoning from Simple Rewards

“Wait—That’s an aha moment.”

Halfway through an algebra problem, a language model paused and wrote: “Wait, wait. That’s an aha moment.” No one told it to stop and rethink. It learned to do that on its own.

That moment came from DeepSeek-R1-Zero, a large language model trained to reason using only reinforcement learning (RL)—no supervised warm‑up, no hand-written examples of step‑by‑step solutions. The researchers pushed a base model to solve math and coding tasks by giving it simple, checkable rewards (was the final answer right? did it follow the output format?), and something unexpected emerged: longer, more deliberate chains of thought; self‑verification; and genuine reflection.

Pure RL, emergent reasoning

Reinforcement learning here means the model tries answers, gets a numeric reward, and updates its policy to do better next time. The team used a lightweight variant called Group Relative Policy Optimization that compares multiple sampled answers per question and nudges the policy toward the better ones, while keeping it close to a reference model for stability.

No “neural” reward models were used to judge process quality (a common source of reward hacking). Instead, the reward was mostly rule‑based: a math checker for exact answers, a compiler and tests for code, plus a simple format rule that separated thinking from the final answer using tags.

Two things stood out:

Test‑time compute scaled up naturally. As training progressed, the model chose to “think longer”—hundreds to thousands of tokens—on harder problems, and accuracy rose in step.
Reflection emerged unprompted. The model learned to revisit and revise its own steps, sometimes flagging a rethink in plain language.

The numbers are striking. On the AIME 2024 math benchmark, pass@1 jumped from 15.6% at the start of training to 71.0%. When the system sampled many answers and took a majority vote over 64 attempts—“consensus@64”—accuracy reached 86.7%, edging past OpenAI’s o1‑0912. Similar gains showed up on other math and coding tests.

From raw power to polish

R1-Zero’s chains of thought could be messy: mixed languages, hard‑to‑scan text. To make the model readable and stronger, the team built DeepSeek-R1 with a compact, multi‑stage pipeline. First came a “cold start” phase: a few thousand curated, long solutions fine‑tuned into the base model. “Cold start” here means a small supervised dataset before any RL, aimed at giving the model a clean format and narrative style rather than teaching specific tricks.

Readability had a concrete meaning: consistent language, clear structure (headings, math formatting), a concise summary, and no sprawling code or tangled prose in the reasoning. The output format was standardized:

|special_token|<reasoning_process>|special_token|<summary>

Then came large‑scale RL again, now with a small “language consistency” reward to discourage mixed‑language output. Next, the team used rejection sampling to harvest high‑quality training data from the RL model—about 600k correct, clean reasoning traces—then mixed in roughly 200k general examples (writing, factual QA, translation). A second RL stage aligned the model for helpfulness and harmlessness using preference rewards, while accuracy on math and code remained rule‑checked.

DeepSeek-R1 landed near the front of the pack: 79.8% on AIME 2024, 97.3% on MATH‑500, 65.9% on LiveCodeBench, and a 2,029 Codeforces rating—competitive with OpenAI’s o1‑1217 on math and coding, while also improving knowledge and open‑ended writing over the base model.

Small models learn big tricks

The team then distilled R1’s reasoning into smaller, dense models by training students on ~800k R1‑generated examples. Distillation means the student learns to mimic the teacher’s inputs, thoughts, and summaries—no RL required. The results were unusually strong:

A 14B model beat the open‑source QwQ‑32B‑Preview across tasks.
A 32B model hit 72.6% on AIME 2024 and 57.2% on LiveCodeBench.
A 70B model reached 70.0% on AIME 2024 and 94.5% on MATH‑500.

A head‑to‑head test underscored the point: running large‑scale RL directly on a 32B base model produced only modest gains (roughly QwQ‑32B‑level), while distilling from the stronger R1 teacher produced a clearly better 32B model. In short, if the goal is efficient, widely deployable reasoning, distillation currently wins.

What didn’t work—and why that matters

Two popular ideas fell short at scale. Process Reward Models (which score each intermediate step) ran into three snags: it’s hard to define a “step” across domains, hard to label correctness reliably, and easy for models to game the reward. Monte Carlo Tree Search struggled with the sheer branching factor of token generation; a fine‑grained value model became a bottleneck, and iterative self‑improvement stalled. Both can help at inference, but neither delivered a robust training recipe here.

Why this changes the conversation

DeepSeek’s work shows that simple, verifiable rewards and a minimalist template can teach a model to think—at length, with reflection—without human step‑by‑step labels. A small dose of curated supervision then turns that raw ability into a readable, safe, and broadly capable assistant. And distilling those patterns into smaller models makes the gains accessible.

If a model can learn to pause and say “aha” purely from incentives, the next frontier may be less about writing better examples and more about designing better rewards—and building base models big enough to use them.