An open assistant that knows when to say no

Ask it to write a scam email, and it refuses—politely, with reasons, and better options. That’s the surprising promise of Llama 2‑Chat, a free-to-use conversational AI from Meta that aims to match the polish of big-name chatbots without locking its methods behind closed doors.

How did an open model learn to say no—and still be useful?

What Meta built

Meta released two families: the base Llama 2 models (7B–70B parameters) and the fine‑tuned Llama 2‑Chat assistants. The base models were rebuilt on a “new mix” of public data totaling about 2 trillion tokens, doubling the context window to 4,000 tokens and adopting grouped‑query attention so larger models run faster at inference. Training used both Nvidia InfiniBand and a cheaper Ethernet‑based interconnect; strikingly, the latter scaled almost as well up to 2,000 GPUs. The team estimates 539 tCO2e for pretraining—and says Meta offset it.

But the base models are only the foundation. The assistant comes from a carefully staged recipe: a small dose of supervised teaching followed by a very large dose of human feedback.

Quality over quantity

Instead of drowning the model in millions of noisy instructions, the team capped supervised fine‑tuning at just 27,540 high‑quality examples. That was enough to get a solid conversational baseline. Then they shifted the human effort where it counts: preferences.

Over months, annotators wrote prompts and compared pairs of model replies, choosing the better one and rating how clear the preference was. Those comparisons—more than a million of them—trained two separate “reward models”: one for helpfulness, one for safety. Think of the reward model as a casting director that can pick the best take from a batch of readings and explain why.

Teaching judgment (and keeping it)

With reward models in hand, Llama 2‑Chat learned in two complementary ways:

  • Rejection sampling fine‑tuning: sample many answers per prompt, score them with the reward model, and fine‑tune on the best ones. This wipes out the worst behaviors quickly—Meta’s plots show the entire quality distribution shifting right as iterations progress.
  • PPO (a reinforcement learning algorithm): nudge the model toward higher‑scoring answers while penalizing it for drifting too far from its supervised baseline.

A practical detail matters: the team routes between the safety and helpfulness rewards. If a prompt looks risky—or the safety score falls below a threshold—the model gets safety-first guidance. Otherwise, it aims to be directly helpful.

A small but clever trick called Ghost Attention helps the model remember “system” instructions (like “reply in French” or “act as Oscar Wilde”) over many turns. During training, the instruction is quietly attached to every user turn, and the model is trained so those early reminders influence later replies without overfitting to the past. The result: consistent behavior across 20+ turns.

Safety as a feature, not an afterthought

Safety work starts at the data level. The pretraining mix is mostly English, with measurable demographic skews (for example, “he” appears in more documents than “she”). Toxic content exists in small amounts (~0.2% of documents) by design: scrubbing too hard can blunt downstream safety training and erase minority voices.

The real safety gains come later. Annotators craft adversarial prompts across categories like criminal activity, hateful content, and unqualified medical or financial advice. The model practices refusing clearly, explaining why, and offering better paths. Context distillation—a safety pre‑prompt used just during training—teaches the assistant to produce safer answers without leaning on constant disclaimers in deployment. A reward model filters these distillations so the system doesn’t slide into vague refusals.

Meta also ran extensive red‑teaming with 350+ experts. A simple metric—violations per person-hour—dropped from 1.8 to 0.45 across rounds, with roughly 90% of previously found failures fixed in the next model.

What the numbers say

  • Against open chat models, Llama 2‑Chat wins by a wide margin in human helpfulness tests across single‑ and multi‑turn prompts.
  • Versus ChatGPT (gpt‑3.5‑turbo‑0301), the 70B model wins 36% and ties 31.5% on Meta’s prompt set. It’s not ahead overall, but it’s in the conversation.
  • On a toxicity benchmark, fine‑tuned models produce effectively 0% toxic generations across sizes—lower than all baselines tested.
  • Truthfulness (TruthfulQA) jumps substantially after fine‑tuning, though state‑of‑the‑art closed models still lead.
  • A reality check: stronger safety can raise “false refusals” on edge‑case benign prompts with sensitive words (think “Christmas crack” as a dessert). On general helpfulness tests those refusals were rare (~0.05%), but higher on a deliberately borderline set. That’s a trade‑off the team tunes with targeted data and reward gating.

Why it matters now

Open assistants used to feel like prototypes while productized models held the polish. Llama 2‑Chat narrows that gap with a playbook others can reuse: small, clean supervision; large‑scale preference data; separate reward models for helpfulness and safety; a mix of best‑of‑K selection and PPO; and training hacks like Ghost Attention that improve multi‑turn control.

It’s not perfect—English dominates, safety can over‑refuse in tricky corners, and the model still trails top closed systems on some benchmarks. But an open model that can explain why it won’t help you build a bomb, then point you somewhere useful, marks real progress. If more teams adopt and refine this recipe, the next time an AI says no, it might be for the right reasons—and you’ll be able to see how it learned to do it.