January 22, 2026 · Speech · 25min
Closing the Intelligence Gap: Why the Transformer Was an Engineering Feat, Not an Architectural One
The co-inventor of the Transformer thinks it wasn’t actually an architectural breakthrough. It was an engineering one. And the real breakthroughs, the ones that will close the gap between AI and human intelligence, will require something the industry is currently allergic to: research without a plan.
The Panel
At DLD Munich 2026, Llion Jones (co-author of “Attention is All You Need,” founder of Sakana AI) and Raphaël Millière (professor of cognitive sciences and philosophy at Oxford, author of an upcoming free textbook on generative AI) sit down with moderator Ulrike Hoffmann-Burchardi (UBS Global Wealth Management) to dissect where AI actually stands versus human intelligence and what it will take to close the gap.
The conversation is compact (25 minutes) but densely packed with contrarian takes from someone who would know: the person who helped build the architecture that launched the current AI era, now arguing it wasn’t the breakthrough everyone thinks it was.
The Transformer: a Thousandfold Speed Trick
Jones’s most provocative claim: the Transformer wasn’t an architectural innovation. It was an engineering optimization.
“I no longer think that the Transformer was actually an architectural breakthrough. It was actually an engineering breakthrough.”
His argument is precise. The Transformer is built from the exact same components as its predecessor (recurrent neural networks): deep multi-layer perceptrons, residual connections, even attention mechanisms. All of these already existed. The Transformer simply rearranged them to process all words simultaneously rather than sequentially.
“The main difference between an RNN and a transformer is the fact that you can process all of the words at the same time. And if you’re training on batches of say a thousand words, that represents a thousandfold increase in processing speed.”
The team literally looked at TPU hardware and asked: how do we push data through this as fast as possible? How do we make the matrix multiplications as big as possible? It was hardware-aware design, not a conceptual leap.
Jones goes further: most researchers would agree that RNNs, if they could be scaled as effectively as Transformers, would work just as well. The Transformer’s real contribution was unlocking the ability to scale deep learning, not inventing a new form of intelligence.
The Pelican Problem: Where AI Still Fails
Millière brings a vivid illustration of what AI cannot do. Ask an image model to generate “a pelican riding a bicycle” and it succeeds, because the pattern of an agent riding a vehicle is common in training data. Ask it to generate “a bicycle riding a pelican,” a simple conceptual reversal, and it fails completely, producing the same pelican-on-bicycle image.
A young child can do this reversal effortlessly. The child’s drawing lacks the AI’s visual detail but is conceptually correct, which is the part that actually matters.
This failure has a name: broad generalization, the ability to handle radically novel combinations of familiar concepts. Current AI generalizes narrowly (recombining known patterns) but breaks down at genuinely novel compositions.
The same problem shows up across domains:
Clock hands: Image models have seen millions of clock images, but most are advertisements where hands are positioned at 10:10 for aesthetic reasons. The models can’t separate this statistical pattern from the actual task of depicting a specific time. A child who has seen far fewer clocks can do this instantly.
Variant Sudoku: Jones’s team created a benchmark called Sudoku Bench, using puzzles with novel rule combinations handcrafted by expert setters. These require an “aha moment” of understanding how new rules interact. State-of-the-art LLMs still struggle badly on this benchmark, even though standard Sudoku is within their capability.
The Deeper Gaps: Continual Learning and Inverted Development
Millière identifies two structural problems that cognitive science highlights:
No continual learning. Current models are trained once and then frozen. When you talk to ChatGPT, it doesn’t fundamentally learn from the interaction. Animals, including humans, continuously learn from every interaction with the world. This remains an unsolved problem in AI.
“One of the tough nuts to crack in AI is called continual learning. That’s what we do, what animals do. And the open secret is that we don’t fully know how to do this well currently.”
Inverted developmental trajectory. Human babies first experience the world through sensory input. They grab their feet, figure out what is their body versus the external world. Language comes years later, mapped onto a rich foundation of sensorimotor understanding. AI development is the exact opposite: trained on millions of web pages of text first, with visual and interactive capabilities bolted on afterward. The trajectory is “completely lopsided” compared to biological agents.
No curriculum. Perhaps most surprisingly, AI training has no concept of curriculum. You might expect models to learn simple arithmetic first, then algebra, then calculus. Instead, everything is mixed into one pile: “predict the next word on all of that.” This works well enough at scale, but it’s another signal that something fundamental is missing.
AI Reasoning Is Externalized Monologue
Jones and Millière converge on a striking observation about how current AI “reasons.” When GPT-5.2 shows a “thinking” phase, it’s generating tokens, producing words the same way it would produce a response, just doing it for 10 minutes before answering.
Jones offers a vivid analogy: imagine if every time someone asked you a question, you couldn’t think in your head. You had to take a piece of paper and pen and write down in English every single thought, producing a book’s worth of text before answering. That would be a profoundly strange and inefficient way to think.
“Almost certainly everyone here can feel that there’s a way that we reason non-linguistically, conceptually, visually. But right now, state-of-the-art AI is forced to reason entirely in language.”
The bottleneck is at the level of individual words. The model reasons by outputting a word, and all the information it gets back is what word it just produced. True AI, Jones argues, would reason internally in its own “head,” in continuous latent space rather than tokenized language.
This connects to Sakana AI’s work on what they call the Continuous Thought Machine, inspired by the observation that biological brains exhibit synchronization patterns absent from standard neural networks. Jones describes it as not biologically plausible per se, but inspired by neural synchronization, and yielding “very interesting examples.”
The Freedom Problem
The conversation takes a pointed turn toward research culture. Jones pushes a message he says he brings to every public speaking opportunity: there isn’t enough research freedom right now, and the flood of investment and pressure is making it worse.
“It’s sort of odd to me that there’s so much excitement in AI and we’re not putting anything on the long bets.”
He draws on the philosophy of “Why Greatness Cannot Be Planned” (a book by Kenneth Stanley and Joel Lehman): to find truly interesting things, you have to not have a goal. You have to explore and play.
At Sakana AI, most resources go toward exploiting current technology, but a dedicated part of the company operates without a plan: smart people in a room researching what they find interesting and important. “That’s where the interesting stuff really happens.”
The pressure exists in both industry and academia, just with different currencies: investor/shareholder value in industry, citations and publications in academia. Both incentivize incremental work that builds on what already works, because that’s what ships quickly and publishes easily. Neither incentivizes the fundamental exploratory research that produces breakthroughs.
DeepSeek: Engineering, Not Breakthrough
When asked about the geopolitical dimension and whether DeepSeek V4 might deliver the next breakthrough, Jones is characteristically blunt:
“I have to be controversial again and say that’s engineering, not a breakthrough, because that is exploiting the current state-of-the-art.”
Millière agrees, noting that the DeepSeek team is talented at optimizing the existing Transformer architecture, especially under compute constraints from import bans. But they don’t have the luxury to invest in fundamental exploratory research because the pressure to achieve state-of-the-art performance is too strong.
His advice to Europe, which faces a similar question about how to compete: “Don’t try to keep up with the hyperscalers. Try to do something different. It’s a longer bet, but if it works, you win.”
Some Thoughts
A 25-minute panel that punches well above its weight, largely because Jones is willing to undercut the mythology of his own contribution.
- The Transformer-as-engineering-trick framing is quietly devastating. If the architecture doesn’t matter as much as the ability to parallelize, then the current race to optimize Transformers is an optimization treadmill, not a path to fundamentally more intelligent systems.
- The pelican reversal test is one of the cleanest illustrations of AI’s generalization failure. It’s simple enough for a child to solve and impossible for the most advanced models. That gap should concern anyone betting on current architectures reaching human-level intelligence.
- Jones’s research philosophy, “I don’t know, and that’s okay,” is the opposite of the industry’s current mode. Every major lab is sprinting toward AGI with concrete timelines and benchmarks. He’s arguing the real breakthroughs will come from people who aren’t trying to hit benchmarks at all.
- The distinction between “inspired by nature” and “biologically plausible” is subtle but important. You don’t need to replicate the brain; you need to notice what it does that your systems don’t, and ask whether that gap matters.