Why spec-driven development isn't enough

A spec is only as good as the decisions underneath it, and those keep changing.

Spec-driven development works. Teams that write a clear spec before they let an agent build get better results than teams that prompt and pray. That's why GitHub shipped Spec Kit, AWS shipped Kiro, and ThoughtWorks put the practice on its radar.

But the spec stops short of where the real problem is.

Output was never the bottleneck

The bet of the last year was that volume was the win: more agents, more tokens, more code. The receipts don't support it.

METR's randomized controlled trial found experienced developers were 19% slower with AI tools, while they reported feeling faster. The sensation of speed and the measurement of speed had come apart.
Faros AI, across roughly 22,000 developers, found teams merged 98% more pull requests but spent 91% more time reviewing them. The work didn't disappear. It moved downstream, from writing code to checking code nobody wrote.
Atlassian found 89% of executives say AI sped up work, and 6% can point to an organization-wide result. Speed moved at the individual layer; ROI gets measured at the org layer, and the two stopped lining up.

There's a name for the pattern these numbers describe. Tokenmaxxing: optimizing for volume of output as if it were the goal. Uber's COO told Business Insider it's getting harder to justify the spend, after the company burned a year's token budget in a single quarter without proportional gains. This is Goodhart's Law on a build pipeline: once tokens became the target, tokens stopped being a useful measure. More output is not more value. Spec-driven development is the industry's first honest answer to that problem: write down what you actually want before the agent builds the wrong thing fast. Where it stops short is what happens after.

A spec is a snapshot

Most spec-driven development is an old discipline in new clothing: a tech spec or RFC that describes how to build something. The how matters, but on its own it's thin. A tech spec is only worth the why underneath it: the PRD, the customer requirement, the reason the team chose this approach over the alternatives. Hand an agent the how without the why and you've told it how to build something without telling it what the thing is for. That holds up right until reality shifts and the how needs to bend, and the agent has no idea which way to bend it.

But a PRD isn't done when the work starts. That's the entire premise of agile. You can't know everything up front, and most of what you learn about a problem you learn by building it. The first real implementation surfaces a constraint nobody saw. A customer call moves a priority. An engineer finds a better approach halfway through. Each of those moves a decision the spec was built on, while the spec itself sits exactly where you left it, still describing the old plan with total confidence.

The gap is where decisions live

Spec Kit's loop, specify then plan then tasks then implement, keeps the spec and the code in sync inside the repo. But the decisions that make a spec stale don't happen in the repo. They happen in Slack, on calls, in the meeting where someone says "actually, let's not build that."

So the spec drifts. It was true the day you wrote it and a little less true every day after. The agent doesn't know that. It reads the document, trusts it completely, and builds exactly what you decided three weeks ago. Fast, and wrong.

That's the output trap again, wearing a process as a disguise. And it's the most expensive kind of waste, because it looks like progress. On our own platform, across 2,444 companies, 82 cents of every dollar spent on AI coding never reaches a shipped product: 44 cents fixing bugs the agents created, 27 cents reworking code, 11 cents lost to review and merge friction. A stale spec feeds that number directly: the agent executes quickly against a decision the team already moved past.

Bigger specs make it worse

The obvious fix is to write more down. It backfires twice.

First, longer specs rot faster, because there's more to keep current and more to contradict the next decision. Second, more context makes the agent worse, not better. Research from Chroma shows model accuracy degrades as the context window fills, even on simple retrieval, and across every frontier model they tested. They call it context rot. A 40-page spec doesn't make the agent smarter. It buries the part that matters in the part that doesn't.

This is why context engineering is the practice, not document length. The goal isn't to hand the agent everything. It's to hand it the few things that are true right now. A spec fails the same way an overloaded context window does: too much, too stale, and no signal about which line still holds.

What the spec depends on

A spec is only as good as the decisions underneath it, and those decisions live outside the document, scattered across the tools where your team actually works.

That layer, the live record of what the team has decided and why, is what spec-driven development assumes and never provides. Brief keeps it current, so the spec the agent reads matches the decision the team actually made.

Telling the agent what to build is the first half of the job. The harder half is keeping that instruction current as the team's decisions change, and that's the half spec-driven development leaves to you.

GET TLDR FROM:

← Back to Blog