January 22, 2026 · Speech · 54min

Yann LeCun: Why Humanoid Robots Have No Idea How to Be Smart

#World Models#Embodied AI#JEPA#Robotics#Hierarchical Planning

The current AI industry is digging one big trench. Everyone is working on LLMs. They’re stealing each other’s engineers. And none of it will produce a robot that can clear a dinner table.

The Conversation

At AI House Davos 2026, Yann LeCun sits down with Marc Pollefeys (ETH Zurich) for a fireside chat on embodied AI. What unfolds is less a polite academic exchange and more a forceful argument for why the entire trajectory of current AI, from LLMs to VLAs, is fundamentally inadequate for physical intelligence. LeCun has left Meta, is launching a new company, and is betting his reputation on a non-generative paradigm he believes will trigger the next AI revolution.

Pollefeys, a computer vision veteran, pushes back at several points, defending the practical utility of current approaches. The tension between “what works in industry now” and “what intelligence actually requires” runs through the entire conversation.

The Big Secret of Robotics

LeCun opens with a provocation: every company building humanoid robots, with their impressive kung fu demonstrations and backflips, has a dirty secret.

“There’s a lot of companies building humanoid robots and they do those kinds of impressive things. This is all precomputed. None of those companies, absolutely none of them, has any idea how to make those robots smart enough to be useful.”

The flashy demos are motion-planned in advance using handwritten dynamical models, fine-tuned with a bit of reinforcement learning. The robots can execute choreographed routines but cannot handle novel situations. They lack common sense at the level of a house cat, let alone a human.

The root problem: approaches that work for language do not work for high-dimensional, continuous, noisy data. Language is “easy” because tokens operate at a semantic level. The physical world is fundamentally different.

Why Generative Models Cannot Understand Physics

This is the core technical argument, and LeCun is unequivocal.

The problem with pixel-level prediction: if you rotate a camera around a room and ask a model to continue the video, it would need to predict the texture of every surface, every face, every object. This is informationally impossible. A generative architecture trained to predict pixels will either produce blurry averages of possible futures, or with diffusion models, produce visually appealing outputs that completely fail to capture underlying dynamics.

“I can take a video of this room, rotate the camera and stop here and then ask the system to continue the video. There’s no way in hell you can predict what all of you look like.”

LeCun grounds this in empirical evidence from image representation learning. Masked autoencoders (MAE), which reconstruct pixels, produce inferior representations compared to joint embedding architectures like DINO, which learn abstract features without reconstructing inputs. This pattern, he argues, is not a minor detail; it reflects a fundamental principle.

Pollefeys pushes back, arguing that for narrow manipulation tasks with known starting positions, pixel-level prediction can work. LeCun’s response is blunt: “I hope you’ll pardon my French, but absolutely no way in hell.” He reports 15 years of trying generative approaches for video understanding, with consistent failure on natural videos.

The key insight: intelligence requires the ability to ignore irrelevant details. Generative models, by definition, cannot do this because they must reconstruct everything.

World Models and the JEPA Architecture

LeCun’s alternative is JEPA: Joint Embedding Predictive Architecture. Instead of predicting pixels, JEPA learns abstract representations and makes predictions in representation space. The system learns what information is predictable and discards the rest.

The training process: take a video, mask portions of it, run the full video through one encoder and the corrupted video through another, train a predictor to match the representations. The system learns to focus on the structural, predictable aspects of reality.

Their latest model, V-JEPA 2, was trained on 100 years of video (roughly one day of YouTube uploads). Despite this seeming enormous, it amounts to about 10^15 to 10^16 bytes, which is 100x more than the biggest LLMs’ text training data. This data volume advantage is precisely why LeCun believes text-only training will never reach human-level intelligence.

The common sense test: they show the model videos of physically impossible events (a ball stopping mid-air, changing shape, disappearing). The prediction error spikes dramatically. “That’s the first time I’ve seen any kind of model that has some level of common sense.”

This mirrors how developmental psychologists test infants: six-month-old babies don’t notice objects floating in mid-air, but ten-month-olds stare in surprise because their world model has been violated.

The New York to Paris Problem

LeCun uses a vivid example to explain hierarchical planning, which he calls a “completely unsolved problem in AI.”

Planning a trip from NYU to Paris cannot be done in terms of millisecond-by-millisecond muscle control. Instead, we plan at decreasing levels of abstraction: get to the airport and catch a plane; get to the airport means go down to the street and take a taxi; getting to the street means walking to the elevator. At each level, the world model operates at a different timescale and a different level of detail.

This requires a multi-level world model: low-level models predicting short-term with fine detail (millisecond muscle control), high-level models predicting long-term with coarse abstraction (taking a taxi to the airport). Low-level actions cannot be described in language. Some high-level actions can.

He draws an analogy to physics: you could in principle describe everything happening in this room using quantum field theory, but it would require measuring the wave function of a cubic kilometer of space. Instead, we use the right level of abstraction: psychology and economics, not particle physics.

The Cake Analogy, Revisited

LeCun revisits his famous cake analogy from a decade ago. The intelligence “cake” has three layers:

The cake itself (self-supervised learning): the vast majority of learning. Observing the world, building representations, learning world models. No expert needed, no rewards. Most of your parameters, most of what you know. This is also embodiment-agnostic: you learn how the world works before you have a specific body.

A thin layer of icing (supervised/imitation learning): imitating expert behavior. A smaller contribution. LeCun notes that most animals never go through this phase because they never meet their parents. Octopuses become highly intelligent in months without any parental guidance.

The cherry (reinforcement learning): minor fine-tuning. “So inefficient” that training a self-driving car from scratch with RL would require it to drive off a cliff thousands of times before learning not to.

From Generic Understanding to Specific Embodiment

A key practical question from Pollefeys: how do you transfer generic world understanding to a specific robot body?

LeCun describes the V-JEPA 2 pipeline: pre-train on 100 years of natural video to learn generic representations, then fine-tune by adding action-conditioned prediction (given robot state + action, predict next state). This fine-tuning phase requires surprisingly little data, which can come from simulation. Crucially, it’s simulation of dynamics, not simulation of specific tasks.

The resulting model is generic: you can use it for any task, from picking up a glass to pouring water. The robot can accomplish novel tasks zero-shot because it has a world model, just like a ten-year-old can clear a dinner table for the first time.

“Ask a 10-year-old who’s never done it before to clear out the dinner table and fill up the dishwasher. A 10-year-old can do it the first time. Doesn’t need to be trained for it. Why? Because of a world model.”

The Hardware Bottleneck

When Pollefeys asks what’s missing to reach brain-like efficiency, LeCun’s answer is surprising: it’s not algorithms, it’s hardware.

The human brain runs at approximately 10 Hz. Visual processing takes 100ms, motor reaction another 100ms. A braking reaction to a visual obstacle takes 300ms. Cats are faster because their brains are smaller.

The fundamental problem: in biological brains, each synapse has its own dedicated physical element. In silicon, we reuse the same hardware for multiple computations (hardware multiplexing), which means constantly shuffling data between memory and compute. Almost all energy goes to data movement, not computation.

The solution would require exotic analog technology: spintronics, carbon nanotubes, optical computing, or something that doesn’t exist yet. Nanometer-scale nonvolatile analog memory, where each “weight” has its own physical device.

LeCun adds a fascinating detail: at small scales, each device would have unique characteristics due to manufacturing variability. Systems would have to be trained in-the-loop, making each chip unique and non-reproducible. “It’s mortal, if you want. Like a human brain, you can’t make another copy.”

ConvNets vs. Transformers: A Practical Reality

A notable aside: in academic papers, vision transformers (ViTs) dominate. But every single real-time vision system deployed in the real world uses convolutional nets. Every automatic emergency braking system, every highway driving assist in Europe uses ConvNets, because transformers are too computationally expensive for real-time video processing.

LeCun cites work by his former colleague Saining Xie (ConvNeXt), showing that with equivalent engineering effort, ConvNets match transformer performance. The architecture isn’t the magic; the training methodology is.

LeCun’s Next Bet

LeCun closes by revealing he’s starting an ambitious new company built on the JEPA paradigm. The thesis: train world models from video, use them for hierarchical planning, and create AI systems that genuinely understand the physical world.

“I’m seeing a future where this is going to be the next AI revolution. We’re going to have another AI revolution brought about by this.”

He believes the timing is right because results already validate the approach: V-JEPA demonstrates common sense, action-conditioned models enable robotic planning, and the path to more powerful systems is clear.

Closing Notes

This conversation crystallizes a position LeCun has been sharpening for years, now backed by concrete results and a commercial bet:

The most provocative claim is also the most testable: that no current humanoid robot company has a viable path to general-purpose intelligence. If VLA-based robots achieve meaningful generalization in the next two years, this prediction fails publicly.
The 100 years of video vs. 10^14 bytes of text comparison is striking. If intelligence requires grounding in physical reality, then text-only training is fundamentally data-starved, no matter how much text you collect.
LeCun’s comparison of current VLA systems to 1980s expert systems is pointed. Expert systems failed not because they were useless (they weren’t), but because the cost of hand-engineering knowledge couldn’t scale. He sees the same pattern: VLAs work for narrow scripted tasks but won’t generalize.
The hardware argument is underappreciated. If the bottleneck to brain-like efficiency is analog memory technology that doesn’t exist, then embodied AI at human efficiency may be decades away regardless of algorithmic progress.
The fact that LeCun left Meta specifically because the company became “LLM-pilled” signals a genuine intellectual conviction. He’s not hedging; he’s all-in on a paradigm that most of the industry has ignored.

Watch original →