February 4, 2026 · Podcast · 1h 58min

Infinite Code Context: AI Coding at Enterprise Scale

#AI Agents#Enterprise AI Adoption#AI Coding#Context Engineering#Developer Tools

Enterprise software development sits in an awkward spot. The best AI coding tools are built for individual developers, but real enterprise work means navigating codebases with tens of millions of lines, intricate dependency chains, and institutional conventions that no single prompt can capture. Blitzy’s bet is that solving context, not model capability, is what unlocks AI-driven enterprise development at scale.

The Episode

Nathan Labenz hosts Brian Elliott (CEO) and Sid Pardeshi (CTO) of Blitzy on the Cognitive Revolution. This is a technically dense, two-hour deep dive into how Blitzy has built a system that ingests entire enterprise codebases, dynamically generates specialized agents, and autonomously completes major software projects. Brian, a former enterprise software executive, handles the product and business narrative. Sid, a former prolific inventor at NVIDIA, gets into the architectural details. The conversation covers everything from graph-based code representation to model selection strategy to the economics of charging 20 cents per line of code.

Context Is the Bottleneck, Not Intelligence

Blitzy’s core thesis: the gap between what frontier models can do and what they actually deliver in enterprise settings is almost entirely a context problem. Brian frames it bluntly: you can get “AGI effects without AGI” if you solve context engineering well enough.

Their system ingests codebases of 100 million lines or more. Not by summarizing or embedding them into a vector store, but by building a full graph representation of the codebase, mapping every function call, every dependency, every naming convention, every architectural pattern. When a task comes in, the system doesn’t search for “relevant code.” It traverses the graph to understand exactly which files, functions, and patterns matter for that specific task, then constructs a context window that gives the model everything it needs and nothing it doesn’t.

“We use all of that to create context windows that give the model the highest chance of getting the output right the first time.”

This isn’t retrieval-augmented generation in the usual sense. RAG typically retrieves chunks of text by semantic similarity. Blitzy’s approach is structural: it understands the topology of the codebase. If you’re modifying a function, it knows which other functions call it, which tests cover it, which API contracts it must honor.

Agents That Build Agents

The most architecturally novel part of Blitzy’s system is what Brian calls the “dynamic harness.” Instead of pre-defining a fixed set of agents with fixed prompts and fixed tools, Blitzy generates agents just in time. An orchestrator agent looks at the task, examines the relevant code context, and then writes the prompts and selects the tools for the worker agents that will actually do the job.

This means the system’s behavior is not hard-coded. A task involving a React frontend gets agents configured with React-specific prompts and validation tools. A task involving a Python microservice gets a completely different agent configuration. The orchestrator is itself an agent, so the whole system is recursive: agents building agents, each customized for the specific slice of work.

Brian estimates they have about 78 different agent templates, but the actual behavior space is much larger because each agent’s prompt is dynamically composed based on the code context and task requirements.

The Model Zoo

Blitzy doesn’t bet on a single model. They run what Sid describes as a “model zoo,” selecting different models for different subtasks based on benchmarks they maintain internally.

The selection logic is granular. For planning tasks, they might use one model. For code generation in a specific language, another. For code review and validation, a third. They run cross-checks between models: if two independent models agree on a code change, confidence goes up. If they disagree, the system flags it for deeper analysis or human review.

They prioritize three properties in model selection: raw reasoning ability, instruction following (does the model actually do what you ask?), and context window utilization (does the model degrade gracefully at 100K+ tokens?). Brian notes that models often have wildly different strengths across these axes, so the optimal choice varies by task.

The cross-checking approach is particularly interesting for security-sensitive code. Sid describes using one model to generate code and a different model family to audit it, on the theory that different training distributions create different blind spots.

Memory Over Fine-Tuning

A recurring theme in the conversation is Blitzy’s preference for memory systems over fine-tuning. Their argument: fine-tuning is expensive, slow to iterate, and creates model-specific lock-in. Memory, stored as structured context that gets injected into prompts, achieves similar customization effects while remaining model-agnostic and instantly updatable.

Their memory system captures several layers:

Codebase conventions: naming patterns, architectural rules, style preferences extracted from the existing code
Project history: what changes have been made, what patterns were established in previous tasks
Error patterns: when the system makes a mistake and a human corrects it, that correction becomes memory that prevents the same mistake in future tasks

“We’ve always been memory-forward. We believe that memory is going to be more important than fine-tuning for most enterprise use cases.”

This is a bet on the continued improvement of frontier models. If base models keep getting better, memory gives you customization without the maintenance cost of retraining. Fine-tuning makes more sense if you need capabilities that base models fundamentally lack, but Brian argues that frontier models already have the raw capabilities; they just need the right context.

Planning at Scale

For large projects, Blitzy breaks work into what Brian calls a “planning tree.” A top-level planner decomposes the project into modules, each module into tasks, each task into subtasks. The tree can go 4-5 levels deep for major modernization projects.

The key insight is that planning and execution are interleaved, not sequential. As agents complete lower-level tasks, the results feed back into the planner, which can adjust the remaining plan. If an agent discovers that a particular API doesn’t behave as documented, the planner can restructure downstream tasks that depend on that API.

Parallelism is aggressive. Independent subtasks run concurrently, with the system tracking dependencies to ensure correct ordering where it matters. Brian describes projects where hundreds of agents are working simultaneously on different parts of the codebase, with a coordination layer preventing conflicts (like two agents modifying the same file).

The 20-Cent Line and the Last 20%

Blitzy charges approximately 20 cents per line of generated code, which Brian frames as roughly 100x cheaper than having a human engineer write it. The pricing reflects their confidence in the output quality and their goal of making AI-generated code a commodity input to enterprise development.

But they’re candid about limitations. Their current system autonomously completes about 80% of tasks in major enterprise projects. The remaining 20% requires human intervention, usually because the task involves ambiguous requirements, novel architectural decisions, or edge cases that the codebase context doesn’t fully resolve.

Brian sees the path to 99%+ autonomous completion as primarily a context and memory problem, not a model capability problem. Better memory means fewer repeated mistakes. Better context means fewer ambiguous situations. Better planning means the system catches integration issues earlier instead of at the end.

Strange Behaviors and Judges

Nathan steers the conversation toward AI reliability, and Sid shares several examples of strange model behaviors they’ve observed at scale:

Hallucinated APIs: Models sometimes generate calls to functions that don’t exist in the codebase, often “inventing” reasonable-sounding function names that follow the naming convention but have no implementation. Their graph-based validation catches this by checking every function call against the actual codebase graph.

Style drift: Over long generation sessions, models gradually drift away from the codebase’s style conventions. Their solution is periodic “style anchoring,” where the system re-injects style examples from the codebase into the context window.

Confident errors: Models occasionally produce code that is syntactically correct, logically coherent, and completely wrong in the business logic. These are the hardest to catch automatically and are the primary reason they use cross-model validation.

They’ve built what they call “judge” models, separate agents whose only job is to evaluate the output of generation agents. The judges have their own specialized prompts that focus on common failure modes, and they can flag issues before code is committed.

Reasoning Budgets and Autonomy

An interesting operational detail: Blitzy assigns “reasoning budgets” to different tasks. Simple, well-defined tasks (rename a variable across a codebase) get small reasoning budgets, meaning cheaper, faster model calls. Complex tasks (architect a new microservice that integrates with 15 existing services) get large reasoning budgets, meaning more expensive models, more cross-checks, and more planning iterations.

The budget allocation is itself automated. An initial classifier agent looks at the task description and the relevant code context, then assigns a budget tier. Brian notes that this has been one of their most impactful optimizations: spending reasoning compute where it matters and saving it where it doesn’t.

This connects to their broader autonomy philosophy. They don’t aim for full autonomy on every task. Instead, they want the system to be maximally autonomous where it can be confident, and to escalate to humans early when it can’t. The worst outcome is a system that spends hours going down a wrong path before asking for help.

“We’d rather the system say ‘I see two valid approaches here, which do you prefer?’ than silently pick one and have to redo everything.”

Security of AI-Generated Code

Sid addresses security directly. Their approach has several layers: static analysis tools run on all generated code, the cross-model audit catches patterns that single-model generation misses, and they maintain a library of known vulnerability patterns that get checked against every output.

They also run generated code in sandboxed environments before it touches the actual codebase, checking for unexpected behaviors like network calls, file system access outside expected paths, or resource consumption anomalies.

Brian notes that in their experience, AI-generated code has a roughly comparable vulnerability rate to human-written code, but the types of vulnerabilities are different. AI tends to be good at avoiding common injection attacks (it’s seen the patterns in training data) but occasionally introduces subtle logic bugs in authentication flows or access control that a human reviewer might catch by reasoning about the business logic.

The Future of Software Work

The conversation closes with a surprisingly nuanced take on what this means for software engineers. Brian doesn’t predict mass unemployment. Instead, he sees a shift toward what he calls “specification engineering”: the ability to precisely describe what you want a system to do becomes more valuable than the ability to write the code that does it.

Sid adds that deep systems knowledge becomes more valuable, not less. Someone who understands database internals, network protocols, or security models can specify tasks that AI completes well. Someone who only knows surface-level coding patterns will find that AI can do exactly that surface-level work.

They both predict that the number of software projects will increase dramatically because the cost drops so much, meaning total demand for people who can envision, specify, and validate software systems goes up, even as the per-project need for hands-on coding goes down.

A Few Observations

This is a sponsored episode, and Brian and Sid are obviously selling their product, but the technical depth is genuine. The conversation reveals a team that has spent significant time in the weeds of enterprise AI deployment and has non-obvious opinions.

The “context over fine-tuning” thesis is a strong bet that aligns with how the model landscape is evolving. If frontier models keep improving, systems that inject the right context will outperform systems that fine-tune on yesterday’s model.
The dynamic agent architecture, where agents generate agents, is an elegant solution to the combinatorial explosion of enterprise use cases. You can’t pre-build an agent for every combination of language, framework, and business domain. But you can build an agent that builds the right agent.
The “reasoning budget” concept deserves wider adoption. Most AI coding systems spend the same compute on trivial and complex tasks, which is both wasteful and suboptimal.
Their candor about the “last 20%” is refreshing. The honest framing is not “AI replaces developers” but “AI handles the 80% of work that is well-specified, freeing humans for the 20% that requires judgment.”
The cross-model validation approach to security is genuinely clever. Using different model families as independent auditors exploits the fact that different training distributions create different blind spots.

Watch original →