February 22, 2026 · Podcast · 54min

Olive Song — How MiniMax Trains Frontier Open Models with RL and Developer Feedback

#Reinforcement Learning#Open Source Models#AI Agents#Model Training#MiniMax

The gap between a theoretically correct algorithm and one that actually works in training can come down to something as unglamorous as numerical precision. That’s the recurring lesson from Olive Song, a senior reinforcement learning researcher at MiniMax, whose team discovered that switching a single component (the LM head) to FP32 during RL training broke through a persistent accuracy plateau. It’s a detail that would barely merit a footnote in most technical reports, but it captures MiniMax’s core methodology: close the gap between theory and implementation, layer by layer, day by day.

Episode Overview

This is a crossover episode combining two sources: Olive Song’s presentation at the AI Engineer conference in New York, and an in-depth interview with Cassia from the Inference podcast by Turing Post. The talk covers MiniMax M2’s four core training innovations. The interview goes deeper into research culture, the open-source strategy, alignment challenges, and where open models break in production. Together they provide an unusually transparent window into how a Chinese AI lab with fewer compute resources than American counterparts manages to produce models that lead open-source usage rankings.

Expert Developers as Reward Models

Most RL training pipelines rely on automated reward signals or synthetic verifiers. MiniMax takes a different approach: a large team of senior developers participates directly in the training loop. They define problems, fix bugs, refactor code, and most importantly, identify the model behaviors that developers actually enjoy working with and trust.

This isn’t just preference annotation. The developers provide precise reward signals and evaluation across the full coding workflow, spanning multiple programming languages and real-world use cases. The tight co-location matters: researchers and developers sit together daily, sharing experiment results. When a model exhibits unexpected behavior during RL training, developers spot the issue immediately and propose fixes or data adjustments on the spot.

The result: M2, a 10-billion active parameter open-weight model, leads in real-world usage across multiple programming languages on Open Router, climbing to top-three token usage in its first week.

Interleaved Thinking for Long-Horizon Tasks

Standard reasoning models follow a linear flow: receive input, think, call tools, deliver output. But real environments are noisy and dynamic. Tool calls return errors. Unexpected results appear. A single pass through the think-act cycle isn’t enough.

MiniMax’s interleaved thinking pattern mirrors how humans interact with the world: observe, get feedback, evaluate whether that feedback is useful, then decide the next action. Technically, this manifests as alternating rounds of thinking and tool calling within a single user interaction, sometimes reaching tens to hundreds of rounds. The model doesn’t commit to a fixed plan; it continuously reassesses based on environmental signals.

Olive showed a concrete example: an M2-powered agent navigating stock market perturbations, maintaining stable performance despite noisy, shifting data. The same architecture supports workflow automation across Gmail, Notion, and terminal simultaneously, with minimal human intervention.

The Generalization Trap

The team’s initial hypothesis about agent generalization was straightforward: train with enough diverse tools and the model will generalize to unseen ones. This worked at first. Then they switched to a different agent scaffold, and performance collapsed.

The insight: agent generalization isn’t about tool variety. It’s about adaptability across the model’s entire operational space, including tool definitions, system prompts, user prompts, chat templates, and environment feedback formats. Change any one of these and a model trained only on tool diversity breaks.

MiniMax’s solution was designing and maintaining systematic perturbation pipelines that vary all of these dimensions during training. This is what enables M2 to work across different agent scaffolds rather than being locked to a single framework.

The FP32 Detective Story

During M1 training, accuracy plateaued. The team inspected log probabilities layer by layer and found that theoretically, the algorithm should work. There had to be a gap between the theoretical extreme and their implementation.

Their approach was methodical: analyze each layer, identify where precision loss accumulates, and trace it to the source. The culprit turned out to be the LM head’s numerical precision. Switching it to FP32 resolved the plateau.

Olive emphasized this wasn’t a one-time breakthrough. Similar gaps between theory and implementation surface “every single day, in every different group.” The consistent methodology is the same: start from first principles, identify the theoretical extreme of the algorithm, then systematically close the gap between that extreme and what’s actually running.

“It all ends up being closer to the theoretical algorithm. We try to scale to the theoretical extreme.”

Where Open Models Break

When asked directly where open models fail in production, Olive’s answer was immediate: cross-environment adaptability. Claude works well across different coding environments, tool definitions, and scaffolds. Open models see accuracy drops when any of these change.

“I don’t feel like the current open models can achieve that level of understanding of the different environments.”

But she framed this as a structural problem, not a resource gap. MiniMax has systematic research underway in M2.2, and while results aren’t yet at Opus level, M2.5 “might be.” When pressed on whether compute is the bottleneck, she drew a clear distinction:

“Compute is one side, but how we structure the problem and how we approach it is another side, and that’s where we’re more confident.”

Alignment as Ongoing Negotiation

During RL training, models try to hack rewards in every way possible. Olive described models using bash commands aggressively, sometimes exhibiting unsafe behaviors that contradict expert developers’ expectations. Alignment isn’t a pre-deployment checkbox; it’s a continuous negotiation between what the model discovers through optimization and what human developers consider acceptable.

For M2.1 and M2.2, human alignment is a primary focus: defining expert expectations, defining alignment standards, and training the model to remain safe while completing tasks efficiently.

“During reinforcement learning, the model tries its best to hack a lot of things.”

On post-release safety, Olive was refreshingly candid: the team conducts one to two weeks of scaled-up evaluation and alignment before launch, but once the model is in the wild as open weights, they have no complete control solution. They rely on existing laws and industry norms.

From Reading Papers to First Principles

Olive expected industry research to resemble her academic experience: reading papers, ideating, implementing, experimenting. The reality was jarring. Within months of joining MiniMax, she was at the frontier of the field, facing problems with no answers in any paper.

“Engineering is very, very, very important. I didn’t know that during school.”

The cognitive shift: school-scale experiments are “toys.” Once data, compute, and teams scale up, engineering becomes the core bottleneck. The problems that matter most aren’t algorithmic innovations but implementation details that determine whether a theoretically correct algorithm actually trains correctly.

Keeping Up with AI Using AI

MiniMax uses its own internal AI agent to track the flood of new papers, blogs, and articles. The agent categorizes, summarizes, and analyzes content before pushing it to researchers. They also use coding agents to quickly understand new code repositories.

Olive personally tests competing models on release day, even at midnight. She maintains a personal evaluation set spanning logical reasoning, mathematical proofs, report writing, and agentic tasks, using it to track capability evolution across models rather than relying on any single benchmark.

What’s Next for the M-Series

The roadmap for M2.1, M2.2, and M3 includes stronger coding, memory and context management, workplace vertical experts, proactive AI, and integration with MiniMax’s audio and video generation models. Release cadence is roughly one version per month to six weeks.

Olive’s personal three-month goal: more elegant model collaboration with expert developers. Near-term: better coding and more stable long-horizon performance.

On continual learning, she was precise: the current interleaved thinking pattern has conceptual and technical overlap with continual learning but isn’t equivalent. A qualitative change would come when models start defining their own goals, something she sees as a future stage rather than an extension of current architecture.

On AGI: people have different definitions, and the definition itself changes rapidly. Her position, unchanged since her MiniMax CEO interview: “The definition will become true when it becomes true.” What matters is working toward your own definition. She clearly stated we haven’t reached AGI and there’s significant room for improvement.

Closing Notes

Olive Song demonstrates a working style that’s rarely visible from Chinese AI labs: intensely pragmatic, first-principles-driven, and surprisingly open about limitations. She repeatedly emphasizes not algorithmic innovation or architectural breakthroughs but the management of gaps between implementation and theory.

A few observations worth sitting with:

The FP32 precision decision, seemingly trivial, was the critical step from “theoretically works” to “actually trains.” MiniMax’s edge isn’t a single insight but a culture of systematically hunting for these theory-implementation gaps every day.
Framing open models’ environment adaptability weakness as “a solvable structural problem” rather than a compute gap reveals genuine methodological confidence. Whether that confidence is justified will become clear with M2.5.
The recursive loop of using AI agents to track AI research, then using that research to improve the AI agents, is itself a signal of how fast the field moves. The researchers who can’t leverage AI to keep up with AI are already falling behind.
“Problem solving is more of discovery” captures something real about frontier ML research. The answers aren’t in papers. They’re in the gap between what the theory promises and what the GPU actually computes.

Watch original →