February 12, 2026 · Podcast · 1h 23min

Jeff Dean on Owning the Pareto Frontier: Distillation, Energy Economics, and 10,000 Tokens Per Second

#Distillation#TPU Co-design#Gemini#AI Infrastructure#Latency

The man who rewrote Google’s search index, co-designed TPUs from scratch, and now steers Gemini as Chief AI Scientist sits down to explain what “owning the Pareto frontier” actually means in practice. The answer turns out to be less about any single breakthrough and more about a relentless, full-stack optimization philosophy: you need the biggest model to make the smallest one good.

Episode Overview

Jeff Dean joins Alessio Fanelli and Swyx on the Latent Space podcast for a wide-ranging conversation that traces his path from a 1990 neural network thesis through building Google’s infrastructure to leading the Gemini effort. The discussion is technical and concrete, moving from distillation mechanics to energy economics to hardware co-design, with Dean consistently steering away from hype toward engineering specifics. What emerges is a picture of AI development driven not by singular model breakthroughs but by systematic optimization across every layer of the stack.

From 1990 Neural Nets to Google’s Search Revolution

Dean’s career arc itself is an argument for long-term conviction. His 1990 thesis at the University of Minnesota explored parallel training of neural networks, 22 years before the deep learning revolution. He believed early on that bigger models with more data would produce better results, a mantra he says held for 15 years before the rest of the field caught up.

At Google, he joined early (employee ~20) and spent the first decade building infrastructure. The pivotal insight in 2001 was moving Google’s entire search index into RAM: 60 shards times 20 replicas, exactly 1,200 machines, fitting one copy of the whole index in memory. This enabled queries to expand from a user’s 3-4 words to 50 terms including synonyms (restaurant/cafe/bistro), achieving “semantic softening” well before LLMs existed. He draws a direct line from those retrieval pipelines to modern transformer architectures.

“We moved the entire index into memory in 2001. That was a big deal. And the retrieval pipelines we built then already resemble modern LLM systems.”

The search index evolved through 5-6 major redesigns between 1999 and 2004, with update frequency improving from once a month to sub-minute latency. BERT was deployed in Google Search almost immediately after publication, shifting the paradigm from exact keyword matching to semantic understanding.

The Pareto Frontier Strategy

Dean frames Google’s AI model strategy around the Pareto frontier: the curve representing the best achievable trade-off between capability and cost/latency. Owning this frontier means offering both frontier “Pro” models for maximum capability and “Flash” models for low-latency, cost-effective deployment.

The crucial insight: these aren’t separate product lines. Flash models are direct descendants of Pro models through distillation. Each generation’s frontier model becomes the teacher for the next generation’s efficient model. This creates a flywheel where pushing the frontier in either direction benefits the other.

“You need the biggest model to make the smallest one good.”

When asked whether Flash will eventually make Pro obsolete, Dean points out that user demand is non-stationary. A year ago people asked models to write a for-loop; now they request entire software packages or global renewable energy deployment reports. As capabilities improve, expectations ratchet up, keeping the frontier models valuable.

Distillation: The Engine Behind Every Flash Breakthrough

Dean traces distillation back to 2014 work with Geoffrey Hinton and Oriol Vinyals. The origin story: they trained roughly 50 specialist models on 300 million images, clustered by categories. The resulting ensemble was extraordinarily capable but completely impractical to serve. Distillation was born from the need to compress that ensemble into a deployable model.

The core technique uses the large model’s output logits (soft probability distributions over all possible tokens) as supervision signals for smaller models. This is richer than hard labels because the logits encode the teacher’s uncertainty and relational knowledge between concepts. Smaller models can make multiple passes over training data because this soft supervision provides more learning signal per example.

The evolution has been:

Ensembles to compression: compressing multiple specialist predictions into single models
Logits as soft supervision: using full probability distributions, not just top predictions
Progressive distillation: cascading from largest model through intermediate sizes
Task-specific distillation: fine-tuning the process for particular capabilities

The result: Gemini 2 Flash outperforms Gemini 1.5 Pro on most benchmarks despite being far smaller and faster. The teacher gives the student a shortcut through capability space impossible to reach by training on raw data alone at that model size. Flash’s economics enable deployment across Gmail, YouTube, Search AI Mode, and Google’s entire product line.

Energy, Not FLOPs: The Real Bottleneck

One of the most striking parts of the conversation is Dean’s insistence that the AI community is measuring the wrong thing. The real bottleneck isn’t floating-point operations per second; it’s energy, measured in picojoules per bit.

His framing: a multiply operation costs roughly 1 picojoule, but moving data from one part of a chip to another costs about 1,000 picojoules. This 1,000x gap means data movement, not computation, dominates total energy consumption. The implications cascade through every design decision:

Batching should be understood through an energy lens: you amortize the cost of moving weights once across many inputs. A batch of 1 means spending 1,000 picojoules for 1 picojoule of useful computation; a batch of 256 brings this ratio to acceptable levels
Speculative decoding is an energy optimization: predict 8 tokens with a cheap model, accept 5-6 of them, effectively boosting the batch dimension and amortizing weight-movement cost
Lower precision is powerful because reducing bits directly reduces picojoules per transfer. Store weights at very low precision but apply scaling vectors to restore expressiveness
Sparse models reduce data movement by only loading the weights needed for each input

“A multiply costs about one picojoule. Moving data costs about a thousand picojoules. That’s the real bottleneck.”

On analog computing: Dean believes it’s theoretically lower power, but digital-to-analog and analog-to-digital conversion overhead at system boundaries often negates the gains. Specialized digital hardware still has enormous efficiency headroom.

TPU Co-design: Predicting Workloads Years Ahead

Google’s unique position in co-designing TPUs alongside ML research involves a fundamental challenge: hardware design cycles are 2-6 years, so you must predict what ML workloads will look like years from now.

Dean’s approach: identify durable trends rather than bet on specific architectures. Matrix multiplications will remain central (true for 10+ years). Models will get larger, requiring more memory bandwidth. Sparsity will matter (though the specific mechanism keeps evolving). Lower precision arithmetic will be increasingly useful.

The co-design loop works both ways. ML researchers’ frontier ideas influence the N+2 generation TPU for major changes, or N+1 for smaller adjustments. Conversely, chip characteristics shape model architecture, as when limited on-chip memory pushed researchers toward more memory-efficient attention mechanisms.

Google adds “speculative features” to chips: if a feature costs little chip area, they include it even without certainty it will be useful. Some bets pay off enormously when the right algorithm arrives; others remain unused. TPU’s 2D/3D mesh interconnects are particularly well-suited for long-context attention and serving sparse expert models.

Sparse Models and the Trillion-Parameter Future

Dean has long championed sparsely activated models. A trillion-parameter model doesn’t need to activate all parameters for every input. Route each input to the relevant 1-5% of parameters, and you get the knowledge capacity of a trillion-parameter model with the compute cost of a much smaller one.

The 2017 “Outrageously Large Neural Networks” paper with Noam Shazeer demonstrated 10x efficiency gains over dense models. These improvements are multiplicative: Transformers delivered 10-100x efficiency over LSTMs, sparse models added another 10x, and hardware plus data improvements stack on top. This compounding explains why 2026 models vastly outperform those from 2023.

Dean connects sparsity back to the energy argument: sparse models are fundamentally about reducing data movement. You only load the weights relevant to this particular input, exactly the optimization that matters when data movement costs 1,000x more than computation. Training stability and expert load balancing remain active research areas.

The Context Window Illusion

Dean’s most provocative claim: the next leap won’t come from bigger context windows alone. Even with million-token windows, there’s a fundamental mismatch between what fits in context and what a user might need the model to reason over. Quadratic attention hits its limit around 1 million tokens and cannot scale to a trillion.

His vision: systems that give the “illusion of attending to trillions of tokens” through intelligent retrieval and hierarchical attention. The architecture resembles Google Search itself: from trillions of tokens, use lightweight methods to identify ~30,000 candidate documents (~30 million tokens), refine to roughly 117 most relevant documents with more sophisticated models, then process those with the most capable model.

“What you would really want is: can I attend to the internet while I answer my question?”

This connects to personalized AI: a model that has indexed everything you’ve ever seen (every email, photo, video) and can retrieve over all of it on demand. The context window is working memory; the retrieval system is long-term memory.

The Gemini Origin Story

Dean reveals he wrote a one-page memo arguing that Google was “being stupid” by fragmenting AI resources. Google Research/Brain had LLM and multimodal efforts, while DeepMind had Chinchilla and Flamingo. The fragmentation scattered not just compute but the best people and ideas across competing teams.

This memo catalyzed the merger of Google Brain and DeepMind and the launch of Gemini. Dean named it himself, with dual meaning: two organizations coming together like twins, and the NASA Gemini program as a critical stepping stone toward Apollo. The Gemini technical report lists 10 pages of co-authors. The goal: train a single unified multimodal model that’s great at everything from the start.

On unified vs. specialized models, Dean firmly sides with unified. The IMO math competition evolution is the clearest evidence: from specialized AlphaProof + AlphaGeometry systems to a single Gemini model (with more inference budget) in just one year. He envisions “installable knowledge” as a modular architecture: 200 language modules, robotics modules, healthcare modules that can be combined on demand.

Coding Agents and the Specification Mindset

Dean sees AI coding tools as vastly improved over two years, now capable of handling complex delegated tasks. His “50 interns” mental model: each person managing 50 virtual agents organized into sub-teams, with 5 human managers maintaining high-bandwidth communication among themselves.

The critical skill shift: specification quality directly determines agent output quality. Traditional software engineering always emphasized clear specs, but nobody really took it seriously. Now, if you don’t cover edge cases, performance requirements, and corner cases in your specification, the agent won’t produce what you want.

“Being able to crisply specify what it is you want is going to be really important.”

Dean emphasizes that general engineering guides (a description of 20 fault-tolerance techniques for distributed systems, for example) placed in agent context would significantly improve agents’ ability to build reliable systems. He also notes that multimodal prompting, including screenshots and diagrams, provides the highest-bandwidth communication with coding agents.

An interesting debate emerges about iterative vs. one-shot approaches: three quick rounds with a fast Flash model and human correction may outperform a single carefully crafted prompt to a more capable model. The hosts joke that good prompting is “indistinguishable from sufficiently advanced executive communication.”

10,000 Tokens Per Second

The conversation ends with Dean’s prediction that 10,000 tokens per second is both achievable and meaningful. At current rates (~100 tokens/sec for fast models), chain-of-thought reasoning is output-speed-constrained. At 10,000 tokens/sec:

Models could run massive parallel rollouts for code generation and verification
Chain-of-thought reasoning could explore far more paths
1,000 tokens of carefully reasoned code with 9,000 tokens of thinking behind it
Interactive development with AI agents would feel instantaneous

“It may not end up with 10,000 tokens of code. A thousand tokens of code with 9,000 tokens of reasoning behind it. Which would actually be probably much better code.”

The Pareto curve keeps climbing. As Dean puts it: “Onward and outward.”

Some Thoughts

This episode stands out for its engineering pragmatism. Dean doesn’t traffic in AGI timelines or existential risk. Instead, he offers a clear-eyed view of how AI systems are actually built and improved, one optimization at a time, across every layer of the stack.

The Pareto frontier strategy reframes the “big vs. small model” debate entirely. It’s not a trade-off; it’s a pipeline. Big models exist to make small models better
The energy framing (picojoules, not FLOPs) quietly undermines a lot of conventional wisdom about hardware scaling. If data movement is 1,000x more expensive than compute, most benchmarks are measuring the wrong thing
Dean’s prediction about specifications mattering more than code-writing ability is already playing out. The best AI-assisted developers aren’t the best coders; they’re the best specifiers
The “illusion of attending to trillions of tokens” suggests the next breakthrough in model capability may not be architectural at all, but a systems engineering problem of intelligent retrieval
The Gemini origin story (a one-page memo saying Google was “being stupid”) is a masterclass in organizational intervention. Sometimes the most impactful technical contribution is a well-timed organizational argument
His comfort with making hardware bets 2-6 years ahead of research reveals how much of Google’s AI advantage comes from infrastructure, not algorithms alone

Watch original →