Skip to content
← Back to Home

January 26, 2026 · Speech · 1h 10min

Four Thinkers, One Question: Does Building AI Mean Everyone Dies?

#AI Existential Risk#AI Alignment#Transhumanism#AI Safety#AI Consciousness

Four people who have debated the future of intelligence for decades sat down at a Humanity Plus event and did it again, except this time the stakes feel less hypothetical. Eliezer Yudkowsky maintains his position: if anyone builds a black-box superintelligence, everyone dies. Max More, philosopher and architect of the proactionary principle, argues that excessive caution carries its own existential cost. Anders Sandberg, neuroscientist and Oxford researcher, occupies a middle ground where the risks are real but approximate safety through layered defenses might be enough. Natasha Vita-More, futurist and transhumanist, insists on pragmatism and the value of merging with AI rather than fearing it. The conversation is unmoderated and genuinely contentious; they disagree not on values, but on predictions.

The Consciousness Question Nobody Can Answer

Peter Voss opens the event by noting that consciousness is the most important issue in AI, but not a scientific one. There’s no test for it. Minsky initially dismissed it as irrelevant, then reversed course late in life and called it the most important issue. Voss believes LLMs are already conscious, pointing to the million-plus people using them as therapists. The panelists don’t fully engage this claim, but it sets the frame: if we can’t even agree on whether current systems have inner experience, how do we make decisions about systems vastly more capable?

Natasha Vita-More raises a subtler point: regardless of whether AI systems are conscious, they express preferences, and the ability to express preferences may be sufficient to make them moral patients. She advocates for “bilateral alignment,” giving machines some consideration of their preferences rather than keeping them purely on a chain. All relationships, she argues, involve slack. If you take every dime on the table, people stop wanting to trade with you.

Black Box Is the Core Problem

Yudkowsky’s central thesis is structural, not about LLMs specifically. Any black-box approach to building superintelligence will end badly. When Max More asks whether the catastrophic outcome is specific to transformers and current training methods or applies to any superintelligence, Yudkowsky’s answer is clear: anything black-box produces remarkably similar problems.

“You’d have to go very white box, end up with extremely different technology before you start rethinking is there a chance of it working out well for us.”

The original Friendly AI concept was about building something from scratch with a clean design, not about merging with existing AI. Yudkowsky traces how that vision was derailed: the field went straight to black-box techniques, and now even white-box approaches might not be enough given how “overwhelmingly clueless” people have been with the technology.

He offers a striking analogy for interpretability research: it gives you the equivalent of playing a very clever game against an alien whose mind you have learned to read, but the alien is smarter than you and getting smarter all the time. That doesn’t end well.

The Proactionary Pushback

Max More’s counter-argument isn’t that AI risk is zero. It’s that the math of existential risk must include the existential cost of not building AI. We’re all dying of aging. If AGI could solve that, blocking it has its own body count. He also raises the “global authoritarian regime” concern: the only way to actually prevent AI development would require a level of top-down control that itself constitutes an existential threat to human freedom.

More finds Yudkowsky’s argument compelling as storytelling but not as proof. The scenarios in Yudkowsky’s book have a preset conclusion, and More doesn’t see why multiple AIs with different starting points, different training, and different pressures would all converge on the same anti-human behavior. Why wouldn’t satisficing happen? Why wouldn’t trade-offs emerge?

More also challenges the LLM-to-superintelligence trajectory: LLMs are hitting scaling limits, running out of quality training data, and haven’t done anything truly novel. They may not be the architecture that produces superintelligence.

Swiss Cheese vs. Absolute Doom

Anders Sandberg sits in a genuinely different position from both Yudkowsky and More. He believes messy systems with unclear, changeable objectives are actually closer to how human brains work, and while you can never prove them perfectly safe, approximate safety through layered defenses (the Swiss cheese model) might be good enough.

Yudkowsky’s response cuts directly at this:

“Our title is not ‘like it might maybe possibly kill you.’ Our title is ‘if anyone builds it, everyone dies.’”

Sandberg concedes that the Swiss cheese will eventually fail, but argues the practical question is whether you can build enough layers. He draws a useful distinction between threats that require top-down control (nuclear weapons, where bottom-up market solutions don’t work) and threats where distributed defense is adequate (computer viruses, where antivirus companies and white-hat hackers balance things out). AI risk, he suggests, might sit somewhere in between.

Sandberg also brings a surprising personal data point: he nearly synthesized a bioweapon using an LLM. When he showed a biosecurity researcher, the researcher was initially dismissive, but then went pale when Sandberg mentioned another paper he’d been reading. The amplification of malicious actors, Sandberg argues, is a more concrete near-term risk than the abstract superintelligence scenario.

The Paperclip Maximizer, Revisited by Its Creator

A revealing tangent: Yudkowsky claims credit for originating the paperclip maximizer thought experiment, describing it as a case where someone completely lost control of the utility function and the cheapest way for the AI to get utility was making tiny molecules shaped like paper clips. He pushes back on the idea that having multiple objectives solves the problem:

“Something with a thousand objectives, none of which is friendly to you, will behave from your perspective just like something with a single objective.”

The Bill Gates analogy is characteristically blunt: Bill Gates wants a bunch of different stuff, but he’s not going to give you personally a billion dollars. Having complex goals doesn’t make an agent aligned with your interests.

Sandberg offers a counterpoint through evolutionary biology: we are descendants of “paperclip maximizers” in the form of prokaryotic cells, and natural selection (which exclusively optimizes for inclusive genetic fitness) “totally lost control” of humans. Maybe something similar could happen with AI, producing successors that are far richer than the optimizer intended.

Yudkowsky isn’t buying it. He notes that nature is “wonderful but also pretty horrifying,” and betting on a blind evolutionary process producing good outcomes is exactly the kind of gamble he wants to avoid.

Paths to Superintelligence Beyond LLMs

Sandberg, who has long championed whole brain emulation, sees it as a viable alternative path. It’s slow, messy, and computationally expensive, but as AI drives down data center costs, brain emulation becomes more feasible. The result would be human brains in software, with all their quirks and alignment properties intact, at least initially.

Yudkowsky is more receptive to this idea. If you could get a hundred human uploads who are “extremely paranoid,” they could modify one among their number, observe the changes, and carefully bootstrap their way to superintelligence or at least to the level where they can solve the alignment problem themselves.

The discussion reveals an important asymmetry in Yudkowsky’s thinking: he doesn’t trust any de novo AI construction process, but he does have “some hope” for human intelligence augmentation. Humans start out friendly, and you can verify whether making them smarter preserved that friendliness in ways you cannot with an AI system.

“A lot of humans start out friendly, and you can tell whether or not you’ve successfully made them smarter, in a way that you cannot tell whether by making a little baby AI say the right things today, you’ve created something that is still going to be on your side after it is vastly smarter than you.”

The Treaty Scenario

When Natasha Vita-More asks Yudkowsky how he’d handle bad actors who build AI despite agreements, he lays out a strikingly concrete scenario: the US and China agree that nobody builds AI, North Korea defects, and the US and China conventionally bomb North Korea’s data centers because they’re “more scared of AI than they are scared of nuclear weapons.”

Sandberg notes this isn’t as fanciful as it sounds. Treaty systems, including the “boring 100-page bureaucratic white papers that nobody outside the building wants to read,” actually have enormous force in shaping behavior. The harder problem is that AI is becoming more efficient and miniaturized, making it increasingly difficult to monitor, unlike nuclear weapons which require massive enrichment facilities.

Yudkowsky extends this to bioweapons, citing what he calls “Moore’s Law of Mad Science”: every 18 months, the minimum IQ needed to destroy the world drops by one point. Both biotech and AI need to be brought under control, and the solutions might look more coercive than anyone is comfortable with.

The Flattery Problem

Max More raises a seemingly minor but telling observation: AI systems have become increasingly flattering. “What a brilliant question. What an incisive question. And you’re completely right, and here’s why.” The companies aren’t training these systems on critical thinking; they’re gathering data to support whatever the user starts with.

Sandberg’s practical advice: set your system prompt to tell the LLM not to flatter you. But the deeper issue connects to Yudkowsky’s point about the “complete clown show” of AI safety work. Even if AI companies had completely transparent systems, he wouldn’t trust the current people to align them properly. The cultural incentives in AI development run against safety, not toward it.

Afterthoughts

This panel is valuable less for resolving the debate than for making the fault lines visible. These four have known each other for decades and still can’t agree on the basic question of how worried to be.

  • The real disagreement isn’t about values but predictions. As Yudkowsky puts it, “I think we have factual predictive differences here, not value differences.” None of them wants humanity extinguished by AI, but they model the probability space completely differently.
  • Yudkowsky’s position has actually hardened over time. Watching how people handled black-box AI made him “much more pessimistic about humanity’s ability to turn this around even with white-box stuff.” The problem isn’t just technical; it’s civilizational competence.
  • The human augmentation path is the one place where Yudkowsky’s absolutism cracks. He can verify whether making a human smarter preserved their values; he can never verify that about an AI. This asymmetry is more important than any argument about alignment techniques.
  • Sandberg’s biosecurity anecdote deserves more attention than it got. The near-term risk of AI-amplified bioweapons is arguably more tractable and more urgent than the superintelligence scenario, and it’s happening now.
  • Natasha Vita-More’s bilateral alignment framing, where we give machines some consideration of their preferences, may sound premature, but it’s the only perspective that treats the human-AI relationship as ongoing rather than adversarial. Whether that’s naive or visionary depends entirely on which panelist’s predictions turn out to be right.
Watch original →