A million questions, some with no answer

A new dataset from Microsoft takes on a messy truth: many real questions don’t have clear answers. Built from more than a million genuine Bing queries, MS MARCO asks machines to read snippets of the web, write a concise answer, or say “No Answer Present.” That last option is the twist.

What MS MARCO is

MS MARCO is a large-scale benchmark for two tightly linked ideas. First, machine reading comprehension (MRC): teaching a system to read provided text and answer a question about it. Second, open‑domain question answering (QA): surfacing answers drawn from anywhere, like the web. MS MARCO straddles both. It supplies, for each question, around ten short “passages” retrieved from the web. Models must judge whether those passages are enough, and if so, synthesize a grounded answer.

The scale is unusual. The release includes 1,010,916 anonymized questions, 8,841,823 passages extracted from roughly 3.2–3.6 million web documents, and 1,026,758 unique answers. Crucially, these aren’t crowd‑invented trivia prompts. They’re the kinds of snippets people really type or say—“barack obama age” as much as “Who is Barack Obama?”, typos and all.

How the answers were made

Human editors read each question, inspect the retrieved passages, and write an answer using only what those passages support. If the passages are insufficient or contradictory, they mark the question as unanswerable. Editors also flag which passages supported the answer with an is_selected tag. Multiple passages can be tagged, because evidence often lives in more than one place. Those tags are helpful but incomplete—the annotators weren’t required to mark every supporting passage.

Some answers went through a second pass to become a “well‑formed answer”—a rewrite that stands alone when read aloud. For example, given the query “tablespoon in cup” and a draft answer of “16,” the rewrite becomes, “There are 16 tablespoons in a cup.” MS MARCO contains 182,669 of these rewrites for a subset of questions.

Each question also receives a simple “segment” label—the expected answer type—using an automatic classifier: DESCRIPTION, NUMERIC, ENTITY, LOCATION, or PERSON. Think of this as an answer‑type hint (similar to classic TREC categories), not a grammatical cue like “what” or “where.” Because the logs include terse, web‑style queries, the two don’t always line up; “barack obama age” is tagged NUMERIC even though it lacks a “how old” phrasing.

What’s different—and harder

Most benchmarks ask models to copy a span of text from a single clean article. MS MARCO asks for something closer to what smart assistants face: short, noisy queries; web passages that may disagree; evidence scattered across multiple snippets; and the possibility that no answer is justified. The goal isn’t to find a quote but to write a short, faithful reply—and to recognize when the safest move is to abstain.

The dataset also supports a retrieval task: given a question and 1,000 candidate passages (from a BM25 search), rank them by likelihood of containing the needed evidence. Here, is_selected acts as a positive label, with the caveat that positives are incomplete.

What the benchmarks show

Simple extractive and generative baselines struggle. On a subset where long answers are needed, a vanilla sequence‑to‑sequence model trained to map questions to answers reached a ROUGE‑L score of 0.089; a memory-network‑augmented variant rose to 0.119. Even a discriminative passage‑ranking baseline scored 0.177. By contrast, a “best passage” oracle—picking whichever retrieved passage overlaps the gold answer most—reached 0.351. Copying a good snippet is easier than writing a good answer.

On numeric questions, attention‑based models that do well on the CNN/Daily Mail cloze dataset drop on MS MARCO. ReasoNet scored 58.9% accuracy on MS MARCO’s numeric subset versus 74.7% on CNN; AS Reader fell from 69.5% to 55.0%. Real queries and web passages are tougher patterns.

When the authors raised the bar with a newer v2.1 split that includes unanswerable detection and focuses on well‑formed, standalone answers, a human “ensemble” baseline (five expert editors, best answer picked per question) reached ROUGE‑L 0.737 on the novice task and 0.630 on the intermediate task. A strong span model, BiDAF, landed at 0.150 and 0.170, respectively. Two cracks showed: the model didn’t know when to say “No Answer Present,” and it couldn’t produce words absent from the passage when a fluent rewrite required them.

Why this matters now

As voice interfaces spread, the challenge isn’t only finding facts—it’s deciding when the evidence is good enough and speaking a short, clear answer grounded in what was read. MS MARCO pushes on all three fronts: retrieval, reasoning across multiple passages, and faithful generation. It also nudges evaluation forward, for example by using multiple references and phrasing‑aware metrics to judge diverse but correct answers.

There are imperfections—some web pages vanished or changed after passage extraction, and passage relevance labels are incomplete—but those reflect the real web. And that’s the point. If assistants are to earn trust, they must handle ambiguity, conflict, and silence. MS MARCO’s most provocative rule closes the loop: sometimes the smartest answer a machine can give is, simply, “No Answer Present.”