Skip to content
← Back to Home

January 21, 2026 · Podcast · 40min

Harrison Chase on Context Engineering and the Rise of Long-Horizon Agents

#Context Engineering#AI Agents#LangChain#Developer Tools#Agent Memory

The core algorithm behind today’s most capable AI agents is almost embarrassingly simple: run the LLM in a loop. What separates the agents that actually work from those that don’t isn’t architectural cleverness or novel frameworks. It’s context engineering, the art of controlling what information the model sees at each step. Harrison Chase, who built LangChain around this exact problem before the term even existed, maps out how we got here and where it’s heading.

The Conversation

Harrison Chase joins Sequoia Capital’s Sonya Huang and Pat Grady on Training Data as their inaugural guest. The conversation covers the evolution from scaffolding to harnesses, why building agents is fundamentally different from building software, and the emerging role of memory as a competitive moat. Chase is characteristically candid about what he doesn’t know, which turns out to be quite a lot about where this is all going.

Three Eras of Agent Architecture

Chase identifies three distinct phases in how people have built around LLMs:

Era 1: Chains and single prompts. Early LLMs were text-in, text-out. No tool calling, no content blocks, no reasoning. People built simple chains because that’s all the models could support.

Era 2: Custom cognitive architectures. Model labs trained tool-calling capabilities into models. They got decent at deciding what to do at any given point, but still needed heavy scaffolding. Developers built elaborate branching logic: “what do I do here?” followed by specific paths. Some loops emerged, but the architecture was still mostly developer-designed.

Era 3: Harnesses. Around June-July 2025, something shifted. Claude Code, deep research tools, and Manis all took off using the same underlying approach: the LLM running in a loop with clever context engineering. Sub-agents, context skills, compaction strategies, all context engineering. The models finally got good enough that the simple algorithm worked.

“Context engineering is such a good term. I wish I came up with that term. It actually really describes everything we’ve done at LangChain without knowing that that term existed.”

The inflection point may have coincided with Opus 4.5’s release and winter break, when developers went home and discovered how capable Claude Code had become. Chase isn’t sure exactly when the shift happened, but somewhere between late and early 2025, the models crossed the threshold where scaffolds became unnecessary and harnesses became sufficient.

Why Coding Agents Lead the Way

One of the conversation’s most interesting threads is whether coding agents are a subcategory of agents or whether all agents are essentially coding agents. Pat Grady frames it sharply: the job of an agent is to get a computer to do useful stuff, and code is a pretty good way to get a computer to do useful stuff.

Chase’s position is nuanced. He’s completely sold on file systems as essential infrastructure for any long-horizon agent. The reasoning is concrete: compaction strategies can summarize conversation history but store full messages in the file system for later retrieval; large tool call results can be written to files rather than bloating the context window. You can implement this with a virtual file system backed by Postgres, but actual code execution opens up capabilities that virtual file systems can’t match.

However, he distinguishes between “a general-purpose agent that can code” and “today’s coding agents being general purpose.” Current coding agents are heavily optimized for coding tasks. Whether they become the general-purpose interface is one of the biggest open questions on his mind.

“I very, very strongly believe that right now if you’re building a long-horizon agent, you need to give it access to a file system.”

Traces Are the New Source of Truth

The most substantive section of the conversation lays out Chase’s thesis on why building agents is fundamentally different from building software, not as a cliche but with specific, actionable implications.

In traditional software, all logic lives in the code. You can look at the code and know what the software will do. In agents, the logic is split between code and the model’s behavior. You can’t just read the harness code and predict what the agent will do at step 14, because 13 preceding steps could have pulled arbitrary things into context.

This creates a cascade of consequences:

Traces become the primary artifact. In software, traces are something you turn on in production when you suspect errors. In agent development, people use traces from day one, even in local development. They tell you what’s actually in the context at each step, which is the only way to understand what the agent is doing.

Collaboration shifts from code to traces. When something goes wrong with an agent, the response isn’t “show me the code,” it’s “send me the trace.” LangChain’s own open source support now defaults to asking for LangSmith traces rather than code snippets.

Testing requires human judgment. Software testing relies on programmatic assertions. Agent testing often requires evaluating outputs that only humans can meaningfully judge. LangSmith’s annotation queues bring human evaluators into the trace review process. LLM-as-judge systems attempt to proxy this human judgment, but calibrating them against actual human preferences is critical and hard.

Online testing matters more. Agent behavior doesn’t fully emerge until it encounters real-world inputs. You can do some offline unit testing of the harness, but the important testing happens in production with real traces.

The Self-Improving Agent Loop

Chase outlines an emerging pattern where agents use their own traces to improve themselves. LangSmith now offers an MCP server and a CLI that coding agents can use to pull down traces, diagnose what went wrong, and then modify the codebase to fix it.

This sounds like recursive self-improvement, and it sort of is, but with an important caveat: there’s still a human in the loop. The agent proposes changes to prompts and instructions, the human reviews them. It’s a first-draft mechanism, not autonomous evolution.

LangSmith’s Agent Builder takes this further with built-in memory. When you interact with an agent and say “you should have done Y instead of X,” the agent edits its own instruction files. The next step Chase wants to build is “sleep time compute,” a process that runs nightly, reviews the day’s traces, and updates instructions, a term he credits to Letta.

Memory as Competitive Moat

Chase makes an unexpectedly strong case for memory as a defensibility mechanism. He recounts migrating his personal email agent from a custom setup to LangSmith’s Agent Builder. Despite having the same starter prompt and the same tools, the new version was noticeably worse because it lacked the accumulated memory of the original. He still hasn’t fully switched over.

This illustrates something important: for agents that handle repeated, domain-specific tasks, the accumulated knowledge of user preferences and domain patterns creates real switching costs. This is different from ChatGPT’s memory feature, which Chase says hasn’t created any stickiness for him because his ChatGPT usage is all one-off queries across diverse topics.

The distinction is between general-purpose chat (where memory adds little value) and purpose-built workflow agents (where memory is the entire value proposition).

Async-Sync Switching and the Agent Inbox

Long-horizon agents need both asynchronous and synchronous interaction modes. If an agent runs for a day, you’re not going to sit and watch. You’ll kick off multiple agents and manage them like tasks on a Kanban board.

But at some point, you need to drop into synchronous mode, review what the agent produced, give feedback, course-correct. LangChain’s first version of their agent inbox only supported async mode: the agent would ping you, you’d respond, then wait for the next ping. Chase found this insufficient when using his own email agent; when switching into a conversation, he wanted full synchronous back-and-forth, not message-and-wait.

The addition of a synchronous chat mode to the inbox was a significant unlock. Pure async doesn’t work yet because agents still need too much correction. The Anthropic Claude Co-work model, where you designate a directory as the agent’s workspace, is a good mental model for how this evolves: shared state that both human and agent can view and modify.

On Code Sandboxes, Browser Use, and File Systems

Chase ranks these capabilities by current viability:

File systems: completely essential. Even for non-coding agents, file systems enable context engineering by storing overflow information that would otherwise bloat the context window.

Code execution: about 90% convinced. For the long tail of use cases, writing and running scripts is irreplaceable. Repeated tasks might need less code, but the context management benefits of file systems remain.

Browser use: models aren’t good enough at it yet. You can approximate some browser tasks by giving a coding agent a CLI for browser interaction, but native browser use is still immature.

Can Existing Software Companies Make the Leap?

Pat Grady raises the on-prem-to-cloud analogy: very few incumbents survived that transition because building cloud software was fundamentally different. What about the transition to agents?

Chase offers a two-part answer. At the people level, he’s consistently heard that agent engineering teams skew younger, with more junior developers who don’t have preconceived notions about how software should be built. LangChain’s own applied AI team skews young. But senior developers who adopt agentic coding can also make the transition; it’s more a mindset shift than an age thing.

At the company level, data is the key asset. If you’re an existing software vendor with valuable data and well-built APIs, you should be able to plug those into agent harnesses and extract real value. One finance-sector contact told Chase that data value is “going up and up and up.” The piece that’s genuinely new is the instructions, the knowledge of what to do with that data, which was previously handled by humans and never codified.

This is why vertical AI startups with deep domain expertise are doing well: agents are driven by domain-specific knowledge about how to perform specific patterns, and that knowledge needs to come from somewhere.

Some Thoughts

A few threads worth pulling on:

  • The “harness, not scaffold” framing is more than semantic. It captures a real shift in where intelligence lives: from the developer’s architecture to the model’s judgment, with the developer providing constraints and context rather than decision trees.

  • Chase’s admission that building your own harness is “actually way harder than building a framework” is notable coming from someone who sells frameworks. His prediction that most people won’t build their own harnesses is essentially a bet on consolidation in the agent infrastructure layer.

  • The trace-as-source-of-truth thesis has a provocative corollary: if you can’t understand an agent by reading its code, then code review, the bedrock practice of software engineering, becomes insufficient. We need new practices for a world where behavior is emergent.

  • The memory-as-moat insight deserves more attention. If agents accumulate irreplaceable context through use, then the switching cost isn’t features or data lock-in in the traditional sense. It’s the loss of a trained collaborator who knows how you work.

  • Perhaps the most honest moment: “I look forward to being back on sometime in the future and being completely wrong about everything I said today.” In a field where everyone presents certainty, acknowledging that the future is genuinely unpredictable is refreshing.

Watch original →