A smaller model just outgunned GPT‑3

Meta’s researchers built a 13‑billion‑parameter language model that beats GPT‑3 (175B) on most tests, trained only on public data. And it runs on a single GPU.

The gambit: smaller, longer, cheaper

Most AI teams chase ever-bigger models. LLaMA flips the question: what if the smarter choice is a smaller model trained for longer on more data? That’s the heart of the work. The team fixates on the “inference budget”—the cost and speed of serving a model once it’s deployed—rather than only the cost of training. A model that’s compact but trained hard can be cheaper to run every day. Their surprise find: even a 7B model kept improving after being fed over 1 trillion tokens (tokens are the little pieces text is split into for training).

What they built

LLaMA is a family of four “foundation” language models—7B, 13B, 33B, and 65B parameters—trained on roughly 1.0–1.4 trillion tokens drawn only from public sources: filtered CommonCrawl and C4 web data, Wikipedia in 20 languages, open‑licensed GitHub code, public‑domain and community book corpora, arXiv LaTeX, and StackExchange Q&A. No private book dumps, no mystery social‑media troves.

Under the hood, they lean on a modern GPT‑style Transformer with three quiet but potent tweaks:

  • RMSNorm “pre‑norm” layers for stability,
  • SwiGLU activations (a smoother alternative to ReLU) sized for efficiency,
  • Rotary positional embeddings (RoPE) instead of absolute positions to better track word order.

They also rewired training for speed: a memory‑lean attention kernel that skips masked computations, manual backward passes to save recomputation, and careful parallelism. Result: the 65B model chewed through about 1.4T tokens in ~21 days on 2,048 A100 GPUs—an industrial‑scale sprint.

Did it work?

On a wide benchmark gauntlet, the answer is yes—especially where inference efficiency matters.

  • Common sense and reasoning: The 13B model beats GPT‑3 on most zero‑shot tasks. The 65B model rivals or tops giants like Chinchilla‑70B and PaLM‑540B on HellaSwag and ARC‑Challenge.
  • Closed‑book question answering: On NaturalQuestions and TriviaQA, the 65B model often leads across zero‑shot to 64‑shot settings; the 13B is competitive with much larger systems.
  • Reading comprehension: The 65B hits GPT‑3‑class scores and tracks PaLM‑540B on RACE.
  • Code generation: On HumanEval and MBPP, the 65B model matches or beats general‑purpose models far larger than itself, and trails only code‑specialized variants.

One area it lags: broad academic knowledge (MMLU). Here the 65B averages 63.4% accuracy, a few points behind Chinchilla and PaLM. The likely reason is data diet: LLaMA uses far fewer books and academic texts than those models. But a small post‑training step—light instruction finetuning—lifts LLaMA‑65B to 68.9% on MMLU, nearly closing the gap.

Open data, open release

A bold claim stands up: state‑of‑the‑art performance is possible using only publicly available data. That matters for reproducibility, scrutiny, and access. The team has released the models to the research community, and the 13B variant can run on a single modern GPU, putting serious capability within reach of labs, universities, and startups that can’t field hundred‑billion‑parameter behemoths.

The fine print: bias, truth, and carbon

Scaling has trade‑offs. On a toxicity stress test (RealToxicityPrompts), toxicity tends to rise with model size—even when prompts are phrased “politely.” A gender coreference test (WinoGender) shows patterns consistent with occupational gender stereotypes, and a broader bias audit (CrowS‑Pairs) finds LLaMA competitive with GPT‑3 but still biased in categories like religion and age. On TruthfulQA, the 65B model answers correctly more often than GPT‑3, yet absolute truthfulness remains modest—hallucinations are very much a thing.

Training costs energy. Using a standardized accounting (same datacenter assumptions), the team estimates about 173 tCO2e to train the 65B model, with the development phase totaling roughly 1,015 tCO2e. Releasing strong, efficient models can amortize that footprint by reducing the need for duplicative training and by shifting use toward smaller, faster models.

Why this matters

LLaMA reframes the scaling race. It shows that careful engineering and lots of clean, public text can make medium‑size models punch far above their weight—and that deployment speed and cost may beat raw parameter count in the real world. The 13B model’s ability to beat GPT‑3 while fitting on a single GPU is the headline, but the deeper story is a blueprint: train longer on open data, optimize for inference, add a dash of instruction tuning, and you get a capable, inspectable model that more people can actually use.