Machines that see and speak still stumble on something toddlers do

Modern AI can label photos and play Go, yet it struggles to combine what it knows in new ways. A team spanning DeepMind, Google Brain, MIT, and Edinburgh argues the fix is simple to name and hard to ignore: build models that think in terms of things and the relations between them.

The missing ingredient: structure

Humans make “infinite use of finite means”: a few words form endless sentences; a handful of skills compose into new plans. Deep learning’s recent triumphs came from minimizing assumptions and maximizing data and compute. That design choice helped—but it also sidestepped compositionality. The authors name what’s required: a strong “relational inductive bias,” meaning architectural assumptions that encourage a model to represent entities (objects, people, symbols), relations (heavier-than, connected-to, next-to), and rules (shared functions that operate on those entities and relations).

An inductive bias is a built-in preference that guides learning even before data arrives. Convolutions, for example, assume nearby pixels matter and that patterns repeat across an image. The claim here is starker: many real problems aren’t grids or sequences. They’re sets of things connected by irregular webs of interaction.

From sets to graphs, and why order doesn’t matter

Sometimes only the collection matters. The center of mass of a solar system doesn’t care how you list the planets. Models for sets respect this by applying the same subnetwork to each item, then pooling with a symmetric operator (sum/mean/max). But predicting each planet’s next position is different: every planet tugs on every other. That calls for explicit pairwise interactions—and, often, sparsity. Real systems sit between “no relations” and “everyone talks to everyone.” Graphs capture this middle ground: nodes for entities, edges for relations, and an optional global context for shared conditions (say, gravity).

What a graph network actually does

A graph network (GN) is a graph-to-graph module with three shared update steps:

  • Edge update: compute each relation’s new state from the sender node, receiver node, current edge, and global context.
  • Node update: aggregate the updated incoming edges to a node (with a permutation-invariant reducer like sum) and update the node.
  • Global update: pool information from all edges and all nodes to update the global attribute.

Think of it like a neighborhood: each street (edge) updates based on its two houses (nodes) and the town weather (global). Each house then updates based on the activity on its incoming streets. Finally, city hall updates its bulletin after summarizing what happened everywhere. Crucially, the same functions are reused across all edges and all nodes, and sets are treated without order. That’s where the generalization comes from.

Why this helps models scale their thinking

Because the same “rules” apply at every node and edge, a GN trained on small systems can often handle bigger ones, different shapes, or longer chains of reasoning. In physics simulators, GN-based models trained on a handful of objects roll out thousands of steps and transfer to systems with more or fewer parts. In routing, planning, and SAT-style problems, GN policies generalize to much larger graphs. Even Transformers fit this picture: self‑attention is a GN operating on a fully connected graph of tokens, with attention weights acting as learned edge strengths.

A practical recipe emerges: encode raw data as a graph, run a GN “core” for M rounds of message passing (each round extends the reasoning radius by one hop), then decode what you need—per edge, per node, or for the whole graph. This encode–process–decode pattern works for shortest paths, sorting, molecules, and multi-agent systems with the same basic block.

Limits, and the open frontier

GNs aren’t magic. Message passing has known blind spots: some non‑isomorphic graphs look identical to 1‑hop schemes. Over-smoothing and over-squashing can bottleneck long-range signals. And many times, the graph itself isn’t given. For images and text, researchers often start fully connected and let attention learn which edges matter, but that’s dense and may not align with true objects. Learning sparse, adaptive structure—adding edges on contact, splitting a node when an object fractures—remains an active pursuit.

That said, graph-shaped thinking tends to be more interpretable. Nodes and edges often align with human-understandable parts and interactions. Watching messages flow can reveal what the model believes “matters,” a useful property in science and engineering domains.

Why now

For a decade, cheap data and compute rewarded flexible, structure-agnostic models. The next leap may come from models that know what to pay attention to before they look. Graph networks offer a plain, reusable building block for that: explicit entities, explicit relations, and shared rules that can recombine—much like how people learn. If AI is to make “infinite use of finite means,” it will have to think in graphs.