A simple equation that taught machines about meaning

Subtract Spain from Madrid, add France, and you get Paris. In 2013, a Google team showed that kind of word arithmetic isn’t a party trick but a property of how language can be learned from raw text—fast, at web scale.

What they actually built

The team (Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, Jeffrey Dean) trained a model called Skip‑gram to learn “distributed representations” of words. Instead of giving each word its own isolated ID (“one‑hot” vectors full of zeros), the model maps every word to a short list of numbers—a point in space—so that words used in similar contexts land near each other.

“Continuous” here just means those vectors use real numbers, not ones and zeros. The training task is simple: for each word, predict the nearby words in a sliding window. That humble objective turns out to encode rich structure. In a two‑dimensional projection, countries and their capital cities line up so that the “country→capital” offset is almost the same direction everywhere.

Two small hacks, big wins

The naive way to train this model asks it to score every word in the vocabulary for each training example. That’s painfully slow when the vocabulary is in the hundreds of thousands. The paper introduced two tricks that made the whole thing fly.

  • Negative sampling: Instead of computing a giant probability distribution, the model learns to distinguish a true (center, context) pair from a handful of fake ones. For each positive pair, it samples k “noise” words (typically k=5–15) from a distribution tilted toward frequent words (unigram counts raised to the 3/4 power). This small logistic regression task per pair captured semantics better than heavier alternatives like hierarchical softmax on the word analogy benchmark.

  • Subsampling frequent words: Ultra‑common words (“the”, “in”) dominate text but add little information. The authors randomly drop them during training using a simple frequency‑based rule (threshold around 1e‑5). That single step sped training by roughly 2–10× and improved the accuracy of rare words, because the model spent more of its learning budget on informative co‑occurrences.

Put together, these choices made training so efficient that a tuned single machine could chew through more than 100 billion words in a day.

From words to phrases

Word vectors ignore word order and stumble on idioms. “Boston” plus “Globe” doesn’t equal the newspaper. So the team mined the corpus for phrases that appear together unusually often (“New York Times”, “Toronto Maple Leafs”), then treated each as a single token and retrained.

On a new phrase analogy set—think “Montreal : Montreal Canadiens :: Toronto : Toronto Maple Leafs”—the best setup at small scale used hierarchical softmax plus subsampling. Scaling up was the real payoff: with about 33 billion words, 1000‑dimensional vectors, and the whole sentence as context, the system hit 72% accuracy on phrase analogies. More data mattered; a 6‑billion‑word run reached 66%.

Two flavors of “composition”

The paper showed two ways that meanings combine in this space.

  • Relation as vector offset: Many relationships align as near‑parallel directions. That’s why vec(“Madrid”) − vec(“Spain”) + vec(“France”) lands near vec(“Paris”). The same trick works for tense, gender, and more.

  • Concept mixing by addition: Adding two vectors often yields a meaningful composite. Russia + river sits near Volga River; German + airlines sides with Lufthansa.

Why does addition work? Each vector roughly encodes the distribution of contexts a word appears in. Adding two vectors nudges the model toward items that share both sets of contexts. If “Volga River” often appears with “Russian” and “river,” the sum leans that way.

A few lines capture both ideas:

Madrid - Spain + France ≈ Paris
Russia + river ≈ Volga River
German + airlines ≈ Lufthansa

These aren’t cherry‑picked one‑offs; they reflect a widespread linear structure that the training objective uncovers.

Why this matters

Speed and simplicity unlocked scale. Compared with earlier neural language models, Skip‑gram with negative sampling ran orders of magnitude faster, which let the team train on tens of billions of words. Quality followed: on standard analogies, negative sampling with subsampling outperformed hierarchical softmax; for phrases, the phrase‑aware training with subsampling delivered crisp neighbors like “moonwalker” for “Alan Bean.”

The model also offered a new way to think about meaning. If adding and subtracting word vectors reveals capitals, currencies, airlines, and sports teams, then a lot of linguistic regularity hides in linear directions. That insight influenced everything from search to translation and laid groundwork for later advances.

There are limits. A single vector per word mixes senses (“bank” the river vs “bank” the lender), and phrase discovery depends on what the corpus repeats. Word order beyond the window is invisible. But the core idea—learn compact vectors from raw co‑occurrence, then use simple math to explore meaning—proved durable.

The opening puzzle has an answer now: machines can “do” Paris from Madrid − Spain + France because the geometry of language, learned at scale, turns relationships into straight lines.