One model that reads and writes
A single neural network has learned to both read and write. Built at Microsoft Research, the Unified Language Model—UniLM—uses one Transformer to rival BERT on understanding tasks while setting new marks in text generation. The trick isn’t more layers. It’s how the model looks at text: by changing its “field of view” with attention masks.
The bet: masks can teach multiple minds
Modern language AI has split personalities. GPT speaks well but reads narrowly, scanning only left to right. BERT reads with full context but struggles to talk, because its bidirectional “both-sides” view doesn’t map cleanly to step-by-step generation. UniLM asks: what if one brain could do both by swapping the way it pays attention?
That swap is a mask. In a Transformer, self‑attention decides which words can “see” which other words. UniLM uses a single stack of Transformer layers and toggles three masks:
- Unidirectional: each token can attend only to earlier tokens (left-to-right), or only to later ones (right-to-left). That’s the voice of a storyteller.
- Bidirectional: every token attends to every other token. That’s the careful reader.
- Sequence‑to‑sequence: tokens in a source passage attend bidirectionally within the source; tokens in the target (the output to generate) attend to all source tokens and only earlier target tokens. That’s an encoder‑decoder—without a separate decoder stack.
Training unifies these views with a classic fill‑in‑the‑blank, or “cloze,” objective. Random tokens are replaced with "[MASK]", and the model learns to recover them, using only the context permitted by the active mask. Even "[EOS]"—the “end of sentence” token—gets masked sometimes, so the model learns when to stop generating. Segment embeddings double as “mode tags,” telling the network which role it’s playing.
It’s one set of parameters, three ways of seeing. No extra heads or auxiliary networks. Just different “window blinds” on the same room.
Proof on the scoreboard
UniLM starts with a BERT‑Large initialization (24 layers, 340M parameters), then continues pre‑training on Wikipedia and BookCorpus by mixing objectives: one‑third bidirectional, one‑third sequence‑to‑sequence, and the rest split between left‑to‑right and right‑to‑left. That shared regimen pays off across both understanding (NLU) and generation (NLG).
- Abstractive summarization (CNN/DailyMail): new state of the art, ROUGE‑L 40.51, surpassing strong extractive and abstractive baselines.
- Headline generation (Gigaword): best results on the full 3.8M‑example training set, and a striking jump in a 10K‑example low‑resource setting—ROUGE‑L up by 7.08 over the prior pre‑trained seq2seq model (MASS).
- Question generation (SQuAD): best BLEU‑4 (22.12) and ROUGE‑L (51.07) across two data splits.
- Generative QA (CoQA): F1 82.5, a huge leap over earlier generative models and close to extractive systems that can only copy spans.
- Dialog response generation (DSTC7): wins every automatic metric; on NIST‑4, even edges past the human reference score (2.67 vs 2.65).
- Understanding benchmarks: on GLUE, UniLM matches or slightly beats BERT‑Large overall. On extractive QA (SQuAD 2.0, CoQA), it improves F1 over BERT‑Large.
There’s a bonus flywheel: using UniLM to generate millions of synthetic questions augments SQuAD 2.0 training and boosts an extractive QA model from 80.5/83.4 to 84.7/87.6 (EM/F1). Adding a small masked‑LM loss during this fine‑tuning guards against “forgetting” and delivers part of that gain.
Why this matters
- One backbone, many jobs: teams don’t need separate pre‑trained models for reading and writing. Masking turns the same network into a reader, a decoder, or an encoder‑decoder at fine‑tuning time.
- Strong conditional generation “from pre‑training”: because sequence‑to‑sequence behavior is baked in, UniLM fine‑tunes quickly on tasks like summarization and question generation—and shines when labeled data is scarce.
- Simpler training recipe: the cloze framework unifies unidirectional, bidirectional, and seq2seq learning under one loss and one stack. No special cross‑attention block is required; concatenating source and target with a carefully designed mask effectively emulates it.
Of course, it’s not magic. Like other large language models, UniLM can produce fluent but fabricated details when asked to generate facts, and its input length tops out at 512 tokens. But the core idea—control a model’s behavior by controlling what it can see—scales well and slots cleanly into real systems.
The bigger picture
UniLM opens a door: if attention masks can teach one model to read and write, they can likely teach it to translate across modalities and languages, too. The authors suggest extensions to cross‑lingual tasks and multi‑task fine‑tuning. The immediate payoff is practical—fewer models to train and serve, better generation with less data—but the deeper point loops back to the opening claim: with the right “blinds” over its attention, a single Transformer can be both a careful reader and a capable writer.


