In 2014, Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio introduced a neural translator that looks at the source sentence while it writes the translation, and it rivaled leading phrase-based systems on English–French. By replacing a single memory bottleneck with a learnable “attention” mechanism, their model handled long sentences that stumped earlier neural approaches.
The problem: squeezing a sentence into a suitcase
Early neural translators used an encoder–decoder: one network read the source sentence and squeezed it into a fixed-length vector; a second network decoded that vector into the target language. It worked, but only up to a point. As sentences grew, quality fell. Cramming every nuance of a 30-word sentence into one vector is like packing a library into a carry-on: something important gets left behind.
A translator that looks while writing
The team’s fix is simple to say and powerful in practice: let the decoder “look back” at the source as it writes each target word. Instead of one vector, the encoder produces a sequence of source-side snapshots—called “annotations”—one per word, each summarizing that word in the context of its neighbors. The decoder then computes a set of weights over these annotations at every step and builds a context vector as their weighted sum. Those weights, αij, form a soft alignment: not a hard choice of one source word, but a probability distribution over them.
Think of it as a moving spotlight. As the translator chooses the next word, it brightens the most relevant source positions and dims the rest, then writes using the information under the light. Because the alignment is soft (the weights sum to 1), the model can split attention across multiple source words when needed.
How attention is learned
Under the hood, the encoder is a bidirectional recurrent network that reads the sentence left-to-right and right-to-left, then concatenates the two views. Each annotation thus “knows” about both preceding and following words. The decoder is a gated recurrent network that, at each step, combines three things: its previous state, the last word it generated, and the current context vector built by attention.
Crucially, the alignment itself is learned. A small neural network scores how well each source annotation matches the decoder’s current needs; a softmax turns those scores into the αij weights. Because everything is differentiable, the system learns to align and translate jointly by maximizing the likelihood of correct translations.
Proof on the page
On the WMT’14 English→French task, attention delivered big gains. Trained on about 348 million words of parallel text (with a 30k-word vocabulary and unknowns mapped to [UNK]), the attention model (“RNNsearch-50”) scored 26.75 BLEU, versus 17.82 for the fixed-vector baseline with the same training limit. With longer training it reached 28.45. On the subset of sentences without unknown tokens, it hit 36.15 BLEU—comparable to Moses, a strong phrase-based system, at 35.63—even though Moses also used an extra 418 million words of monolingual data as a language model.
Length robustness is the headline result. As sentences got longer, the older encoder–decoder’s scores dropped sharply. The attention model held steady, showing no deterioration even beyond 50 words. That matches intuition: the model no longer needs to memorize the entire source at once; it retrieves the right pieces at the right time.
What alignment looks like
Because the model computes αij, its “gaze” is visible. Heatmaps show mostly diagonal alignments for English–French, with smart deviations when word order differs. For example, it translated “European Economic Area” as “zone économique européenne,” first aligning “Area” with “zone,” then stepping back for the adjectives—mirroring French syntax. In another case, “the man” became “l’homme.” A hard one-to-one alignment would map “the” to “l’,” missing that the article’s form depends on the following noun; the soft alignment spreads attention over both “the” and “man,” enabling the correct elision.
These alignments don’t just look good; they solve classic headaches for statistical machine translation, like handling phrases of different lengths without inventing artificial NULL words.
Why it matters—and what’s next
This work marked a shift from bolting neural components onto phrase-based systems to a model that translates end-to-end, learning where to look and what to write in one objective. It made neural translation interpretable, more accurate on long inputs, and competitive with traditional systems—without auxiliary language models.
There are limits. Attention adds computation that scales with source length times target length. And rare or unknown words remained a weakness in this setup, with [UNK] tokens dragging overall scores. Those gaps sparked follow-on advances—subword vocabularies, copy mechanisms, coverage modeling—that built on the same core idea.
The punchline holds: once a neural translator can focus its gaze as it writes, long sentences stop being suitcases to stuff and start being stories it can read and retell.


