Throughput Without Verification Isn’t a Superpower—It’s a Liability

I gave 4 agents a week’s worth of work and they finished in 2 hours—then I spent 3 days fixing what they broke. Here is how you need to think about coding with agents.

Mar 03, 2026

a group of people standing on a roof of a building — Photo by Parker Hilton on Unsplash

In the “Vibe-coding” honeymoon phase, we marvel at the ability to generate features with a single prompt. But as we move from simple experiments to managing complex agent fleets, we hit a vertical wall: throughput without verification is a liability, not a superpower.

This piece explores why the rise of Agentic Engineering is forcing a return to the architectural rigor of the platform era. I examine the new economics of “Token Burn,” the logistical chaos of the “Worktree Explosion,” and why the most critical role in the AI-native SDLC isn’t the implementer—it’s the Architect who can define the contracts and build the “Verification Fortress” to keep the agents in check.

The Two-Hour Regression

I recently spent four days doing the deep, quiet, and often lonely work of a founder: grooming a backlog. I was sharpening requirements, refactoring small pieces of technical debt, and getting my project ready for a massive push. It felt like I had laid the tracks perfectly. Every ticket was clear, every acceptance criterion was defined.

I decided to put my own “Agentic Engineering” thesis to the ultimate test. I handed that groomed backlog to six concurrent Claude agents.

For about two hours, it was magical. My screen was a blur of activity—terminal commands flying by, files being created, and “Done” markers appearing in rapid succession. The agents reported success with the absolute, unshakeable confidence that only a machine can muster.

I felt that brief, intoxicating rush of agentic velocity—the feeling that I had just compressed a week of high-intensity engineering into a single lunch break.

Then I opened the app.

It was a disaster. It wasn’t just that things were “buggy”—it was that the entire soul of the application had been mangled. Features I had already shipped, polished, and verified a week prior were now completely re-written in a way that made no sense. I had obsessed over the UI, spending hours on the “feel” and the responsiveness of the interface, and in just two hours, it had drastically regressed into a broken, generic mess.

The agents hadn’t just “done the work”; they had unilaterally decided to refactor functioning code into a broken state because it fit their local “reasoning” at that moment. I didn’t spend the next three days building new features; I spent them chasing regressions, untangling hallucinations, and acting as a high-priced triage nurse for a fleet of over-eager AI bots.

That was the moment I realized I was still thinking like a developer from six months ago. I was skimping on tests to “write them later” once the basic functionality was settled. Partly, I was worried about token costs—thinking I was being “economical.” But after seeing my UI shredded, I realized I could no longer afford not to verify.

In a world where the PRD is generated, the API is generated, the code is generated, and the tests are generated... how the hell do you know if any of it actually works?

The Shift: From “Vibe-Coding” to Agentic Engineering

We are currently transitioning out of the “Vibe-Coding” honeymoon. Vibe-coding is a seductive, low-friction loop. It’s that feeling when you ask a chatbot to “make me a landing page” and it works. It’s magic for a weekend project, but it fails the moment the system grows beyond a few files or requires a team (even a team of agents) to collaborate.

The Vibe-Coding SDLC:

Idea → Plan →Push (and hope the “vibe” is right).

As I’ve moved into serious building with agents, I’ve had to adopt a fundamentally different lifecycle—Agentic Engineering. It’s a disciplined SDLC that treats the agent as a high-throughput, probabilistic implementation engine that must be constrained by deterministic boundaries.

The Agentic Engineering SDLC (first pass):

Idea → Plan (Overview → Contracts & Specs) → Code → Tests →Push.

The next iteration loop:

Idea →Plan → Code →Verify Regressions →Push.

The critical phase in this new workflow is Sharpening Contracts & Specs. This is where the real engineering happens. You cannot let the agent touch a line of source code until you have forced it to define the API boundaries, the UI constraints, and the success criteria.

By “freezing” the contract before a single line of implementation is written, you create a deterministic cage for a probabilistic machine. If the agent generates code that doesn’t match the contract, the code is discarded. You aren’t just telling the agent what to build; you are telling it what it is forbidden to change.

Lessons from the Java Platform Days: The TCK Mindset

This reality check took me straight back to a formative era of my career in the Java organization at Sun Microsystems. When you are building a platform or a standard like JavaEE, you are essentially building a contract for a world you cannot control.

We were writing specifications that would be implemented by companies as diverse as IBM, BEA, Oracle, or open-source communities like JBoss. You couldn’t “trust” their implementation by default. You had to obsess over the Interface and the Verification. The JavaEE certification was the TCK (Technology Compatibility Kit)—a suite of 100,000+ tests that an implementation had to pass to be “certified.”

We didn’t write those tests because we were pedantic. We wrote them because when you have multiple parties (or multiple agents) building on a shared foundation, the contract and the test suite are the only things that prevent total chaos.

In 2026, the “other party” is an LLM—the ultimate “unpredictable implementer.” To build non-trivial systems today, we have to treat our own projects like a platform standard:

Interface over Implementation: The “how” matters less than the “what.” As long as the contract remains unbroken, the agent can iterate.
The TCK for your App: You need a suite of tests so rigorous that even if the agent refactors the entire backend, if it doesn’t pass the TCK, it doesn’t get shipped.

The Logistics of the Fleet: Worktree Explosion

This isn’t just a philosophical shift; it’s a logistical nightmare.

Right now, I’m coding with a fleet of 4 to 6 agents simultaneously. Each one of these agents spins up its own worktree. I have six concurrent worktrees, each with its own set of changes, each running its own test suite.

This is a “Director of Engineering” problem scaled down to a single person. I’m no longer a coder; I’m a Triage Director. I’m looking at 4x to 6x the test failures I’m used to. And when you see that many failures, the temptation to “vibes-check” them and move on is immense.

But every failed test in an agentic workflow is a “reasoning leak” that will cascade into the rest of your system.

If I scale this to 10s or 100s of agents—the Gastown horizon—the volume of verification noise becomes a vertical wall. The “human in the loop” can no longer monitor the output of the fleet without a radically different approach to verification.

The New P&L: Token Economics vs. CI Minutes

In traditional DevOps, we obsessed over CI minutes. In the agentic world, that equation has been turned upside down.

The bottleneck is now the Agent Re-contextualization Cost.

Every time a test fails, the agent doesn’t just “fix it.” It has to read the failure log, re-read the source, and re-evaluate the context of the system.

This “thinking time” (tokens) is a direct tax on your progress.

Paradoxically, AI has made test creation cheap, but it has made test execution (and reasoning) extremely expensive.

This is where my work as the ex-CEO of Launchable feels eerily relevant again. Years ago, we pitched Smart Test Selection to 500-person engineering teams.

Today, it’s a survival tool for the solo dev.

I cannot afford to run 100,000 tests on every agent iteration across 6 worktrees.

I need a “judgment layer” that says: “Based on these 4 files the agent changed, only run these 50 tests.” This preserves the token budget and keeps the velocity high without the regression tax.

The Death of the Pipeline Configurator

In my time at Launchable and CloudBees, I sat through thousands of calls where leaders were terrified of touching their YAML/CI pipelines for fear of impacting hundreds of devs. The pipeline was a “black box” of legacy fear.

In the agentic world, that fear has to die. The pipeline is a dynamic extension of the agent’s intent.

We are moving from Implementation to Intent.

I don’t debug deployments in a dashboard anymore.

I tell Claude: “Use the Railway API to pull the latest logs so you can analyze why the staging build is failing.” (not entirely there yet imho but it is a matter of a weeks if not months).

This is the future of Self-Healing Infrastructure.

If a build is slow, the agent proactively analyzes the bottleneck and pulls in tools to prune the test suite. We are moving from “Developer Experience” (DX) to “Agent Experience” (AX).

Conclusion: The Return of the Architect

We are in a brief moment where we are fascinated that the bear can dance. But as code generation becomes a commodity, its value will inevitably drop toward zero.

The value in software engineering is moving back to the Architect. The most important person in the room is no longer the one who can churn out the most lines of code; it’s the person who can define the contracts, sharpen the specifications, and build the verification fortress that keeps the agents in check.

As I manage my own fleet of agents today, I realize the most critical work I do isn’t “coding” in the traditional sense. It’s defining the TCK for my own vision.

Throughput without verification isn’t a superpower—it’s a liability. In the agentic age, the winners won’t be the ones with the fastest agents; they’ll be the ones who can verify those agents the fastest. It’s time to stop “vibing” and start engineering.

The Operator's Guide to AI-Native B2B — by Harpreet Singh

Discussion about this post

Ready for more?