Skip to content
← Back to Home

February 25, 2026 · Podcast · 2h 4min

Karan Singhal: How OpenAI's Health AI Reaches Attending Physician Level

#Health AI#OpenAI#AI Safety#Scalable Oversight#Medical AI

OpenAI’s Head of Health AI, Karan Singhal, is fundamentally a safety researcher who chose health as his domain. That framing changes the entire conversation. This two-hour episode of The Cognitive Revolution is not just about medical AI products; it is about why health provides the ideal real-world laboratory for solving alignment, and what happens when over 200 million people per week are already using AI for medical advice.

The conversation in context

Nathan Labenz opens with an unusually personal disclosure: his son was diagnosed with cancer, and during 30 days of intensive hospitalization, he used GPT-5 Pro, Gemini 3, and Claude as continuous medical consultants alongside the attending oncologist. Karan Singhal, recently named to the TIME 100 Health list, joins to discuss ChatGPT Health, HealthBench, and why he left Google (where he built Med-PaLM) to lead health AI at OpenAI.

The conversation moves from Nathan’s firsthand clinical experience through the architecture of medical AI evaluation, product decisions, a landmark clinical trial in Kenya, privacy design, physician adoption dynamics, and ultimately into the deep connection between health AI and alignment research. What emerges is a picture of health AI as both a product with hundreds of millions of users and a research strategy for the hardest problems in AI safety.

Thirty days in the hospital with three frontier models

Nathan’s account is one of the most detailed first-person stress tests of medical AI in a high-stakes setting. His findings:

Frontier models tracked the attending oncologist step-for-step on nearly everything. They performed significantly better than residents. In roughly half a dozen disagreements between models and the attending, the score was approximately 6:4, with doctors being right about two-thirds of the time in hindsight.

“The frontier models were step for step with the attending oncologist on almost everything. And they’re like a lot better than the residents.”

The three frontier models disagreed with each other less frequently than any single model disagreed with the attending physician. The only consistent edge doctors maintained was intuitive multimodal judgment: watching a child’s breathing, assessing skin color, reading the subtle physical cues that come from years of bedside experience. These are precisely the signals that current text-based AI cannot access.

Nathan also found that telling ChatGPT his son’s full medical history, then asking it to “interview me” for additional context, produced dramatically better results than simply asking questions. The model could synthesize longitudinal data across visits and flag patterns that individual consultations might miss.

HealthBench: 49,000 criteria for measuring medical AI

HealthBench is OpenAI’s attempt to solve a fundamental problem in medical AI evaluation: how do you measure whether an AI system is actually good at health, not just good at sounding medical?

Built with over 250 physicians, it contains 5,000 real conversations and 49,000 fine-grained evaluation criteria. The benchmark comes in three versions:

HealthBench Full ensures that score increases correspond to real health improvements, not just stylistic polish. The key design principle is “meaningfulness”: every criterion is tied to a clinically significant distinction.

HealthBench Consensus requires majority agreement from multiple physicians on each criterion. This targets trustworthiness, filtering out idiosyncratic judgments from individual doctors.

HealthBench Hard adversarially selects the worst-performing examples across all model providers. Results here are sobering: GPT-4o scored literally zero. Current best OpenAI models reach around 40%, competitors around 20%.

A striking finding: model-based graders now outperform the average physician grader in evaluation quality. Singhal calls this “signs of recursive self-improvement,” though the phrase carries very specific meaning here. It is not the models improving themselves autonomously; it is models becoming better judges of medical quality than the humans they were initially trained to match.

“The model-based grader was doing a better job than the average physician.”

The speed of model progress in health

The rate of improvement is compressing. Singhal reports that improvements in medical AI performance over the past year exceed all improvements combined since ChatGPT launched. GPT-5 nano models (available via API and open-source) now match the medical performance of the previously best O3 model.

More importantly, the latest reasoning models (5.3 Codex, 5.2 Thinking) default to less reasoning on health queries while producing better results. The goal has shifted from “more compute equals better results” to “better results at the same compute level.” This matters enormously for accessibility: if health AI requires expensive reasoning tokens, it cannot serve hundreds of millions of users for free.

Free reasoning, no ads: a deliberate product exception

ChatGPT Health provides a reasoning model for free with no rate limits. This is unique across all of OpenAI’s products and was explicitly not the default path.

“We made ChatGPT Health free. This was not the default path, providing a reasoning model for free without rate limits to all users.”

The product allows users to connect electronic medical records, wearable data, and Apple Health information, with purpose-built privacy protections. Over 200 million people use ChatGPT for health queries weekly, with consumer adoption far outpacing physician adoption.

Singhal is emphatic about one boundary: no advertising will come to ChatGPT Health.

“Ads aren’t coming to ChatGPT Health and we don’t plan for that. We think it’s really important to create a clear separation between our health impact work and things that could be seen as contributing to other incentives.”

This investment model (free, unlimited, ad-free) stands in stark contrast to OpenAI’s broader commercial strategy. It suggests health AI serves not just as a growth engine but as a critical “social license” play for the company.

ChatGPT for Healthcare: the enterprise side

Beyond the consumer product, OpenAI launched an enterprise version for health professionals with HIPAA compliance, medical guideline evidence retrieval, and clinical writing workflows. It debuted with 8 leading medical institutions, and post-launch inbound demand exceeded the team’s capacity. The goal is to make AI-assisted care part of the standard of care by end of 2026.

The Penta Health clinical trial

OpenAI ran one of the first real-world randomized studies of an LLM clinical co-pilot, conducted across a clinic network in Kenya through Penta Health. In the treatment group, clinicians received real-time AI flags in their electronic medical records when entries appeared concerning or potentially incorrect.

The result: statistically significant improvement in diagnosis and treatment outcomes for the AI-assisted group. But a key practical finding was that technology deployment alone is insufficient. The study required “active change management,” including group training sessions and usage demonstrations, to get clinicians to engage with the system effectively.

This is a pattern that appears throughout the conversation: the technical capability exists, but the adoption challenge is organizational and behavioral, not computational.

Privacy as a trust-building exercise

Singhal’s position on privacy is nuanced. ChatGPT Health does not train models on user health data. Building trust around privacy is the most important near-term priority. But longer term, he sees potential for new consent models and research data contracts where patients could voluntarily contribute data to advance medical research.

Nathan offers a counterpoint from personal experience. GitLab founder Sid Sijbrandij open-sourced his complete biology down to DNA level after his own cancer diagnosis, and the benefits (connections to researchers, access to individualized therapy companies, specialized expertise) far outweighed any privacy risks. Nathan’s advice: seek the benefit and do not worry too much about whether your data is sitting in some log somewhere.

The tension is real. Singhal must build for users who worry intensely about medical data privacy. Nathan, having lived through a medical crisis, argues that the opportunity cost of not sharing data is the bigger risk.

The physician adoption gap

Consumer adoption of health AI has outpaced physician adoption, but the dynamics are more interesting than simple resistance. Singhal reports far less professional protectionism than expected. The key differentiator is personal experience: physicians who have used ChatGPT for their own health questions become convinced of its value. Those who have not remain skeptical.

“When you talk to health system executives, you can tell instantly who’s used it and who hasn’t. The conversation just becomes incredibly easy when people have used it.”

The main barrier is not ideological opposition but workflow inertia and the time cost of change management.

Health AI as alignment research in disguise

This is the thread that elevates the conversation beyond a product discussion. Singhal’s deeper motivation for doing health AI at OpenAI is providing concrete grounding for safety and alignment research. Two years ago, he observed that frontier safety research was mostly conducted in toy settings or math problems, lacking real-world feedback loops.

“If there was only a setting where the problems that people were working on were well motivated and provided concrete feedback loops, the research could happen better.”

Health provides an ideal scalable oversight environment because models already exceed physicians in specific narrow dimensions. This creates a natural experiment: how do you ensure that a system which is already superhuman in some respects remains safe, honest, and beneficial?

Singhal decomposes scalable oversight into two sub-problems:

Rater Scaling: how to correctly elicit opinions and values from humans and experts. AI can help improve human critique capability, creating a feedback loop where models help humans become better evaluators of models.

Value Oversight: given some set of values (however elicited), how to spend substantial compute training models to internalize those values. Progress on specs and constitutions shows models increasingly generalize safety behaviors across contexts they were not explicitly trained on.

A key insight: the task of discrimination or critique is easier than the task of generation. A given model can monitor itself effectively, especially when given privileged information like chain-of-thought. This is the foundation underlying RLHF and Constitutional AI, and it means same-capability models can serve as their own safety monitors under the right conditions.

Chain-of-thought interpretability: cautiously optimistic

Reasoning models emit thinking tokens that provide a natural form of interpretability. Apollo Research and others have observed occasional “neuralese,” where chain-of-thought tokens diverge from understandable English into something more like compressed model-internal language.

Singhal’s assessment is surprisingly calm: there is no large-scale evidence that scaling RL causes chain-of-thought to slip into uninterpretable language. Models think in English because of pre-training priors, not because anything reinforces it. He expects this could change at the limit, but it has not happened yet. OpenAI has committed to minimizing optimization pressure on chain-of-thought to preserve its interpretability value.

Safety generalization holds, so far

The conversation’s most reassuring (and most carefully hedged) finding concerns safety generalization across model generations. During the pre-training scaling era, small amounts of supervised fine-tuning could extract good personas and safe behavior. The question was whether this would hold for reasoning models trained with RL.

So far, the answer is yes, and possibly even stronger at scale. Each generation surfaces new emergent problems (deception, eval awareness), but once identified, the next generation typically shows a two-thirds to order-of-magnitude reduction in those specific failure modes.

Nathan extrapolates to roughly 2028: models capable of doing months of human work, but with perhaps a one-in-a-thousand to one-in-100,000 chance per run of “actively screwing you over in some bizarre way.” Singhal does not disagree with this framing. He acknowledges the difficulty of predicting how the capability curve and the safety curve will net out, which is precisely why he cares about proactively making tangible benefits happen while working on safety.

Move 37 in health

The conversation closes with a vision. Referencing AlphaGo’s famous Move 37 against Lee Sedol (a move no human would make, but brilliant in hindsight), Singhal believes health’s equivalent is not far away.

Many patients already report seeing multiple doctors without resolution, then ChatGPT flagging the key clue that led to diagnosis. Whether this rises to the level of Move 37 is a matter of degree, but the direction is clear.

The team’s mission operates on three fronts: raise the floor (help consumers understand and manage health through ChatGPT Health), empower the system (make AI-assisted care standard through ChatGPT for Healthcare), and raise the ceiling (accelerate biomedical research, the long-term vision).

“Biology and health is one of the areas in which marginal gains in intelligence have the most obvious value in solving more and more problems for humanity.”

Singhal points out that many previous breakthroughs in biology had no physical blocker. Nothing stopped them from happening 5 or 10 years earlier except insufficient human ingenuity applied to the problem. When that is the constraint, long-running AI agents connected to the right data can make a transformative difference.

Afterthoughts

This episode works on two levels simultaneously. On the surface, it is a comprehensive overview of OpenAI’s health AI strategy: products, benchmarks, clinical trials, adoption curves. Underneath, it is an argument that health is the best available laboratory for alignment research, because it provides real stakes, measurable outcomes, and a domain where models already outperform human experts on specific dimensions.

  • Nathan’s 30-day hospital experience is the most credible first-person account of medical AI in clinical use: not a benchmark, not a demo, but a parent using three frontier models to help navigate his child’s cancer treatment. His conclusion that models are “reliably at attending level” carries weight because the cost of being wrong was not a lost point on a leaderboard
  • The free, unlimited, ad-free model for health is OpenAI’s most unusual product decision. It makes sense only if health AI is understood as infrastructure for alignment research and social license, not primarily as a revenue line
  • HealthBench Hard’s results (GPT-4o at zero, best models at 40%) suggest medical AI has significant room to grow. The gap between current performance and human-level on the hardest cases is still enormous
  • The Penta Health trial’s finding that “active change management” is required for clinical AI adoption echoes every enterprise software deployment lesson: technology that is not integrated into workflows does not get used, regardless of capability
  • Singhal’s framing of scalable oversight as two distinct problems (rater scaling and value oversight) is a useful decomposition that clarifies where progress is actually happening versus where open questions remain
Watch original →