Why your engineers got coding agents and shipping didn't speed up

The bottleneck moved from code generation to review, coordination, and scoping.

You bought every engineer a coding agent. Cursor, Claude Code, Copilot. A year in, your engineers are writing more code and your release calendar looks about the same.

The problem is real, and it isn't a tooling failure. Your engineering team feels faster while delivery stays flat because the bottleneck moved. Code got cheap to produce, so the constraint shifted to the steps a model can't do for you: review, coordination, and deciding what to build. The next model won't close that gap, because the bottleneck isn't model capability. It's review, coordination, and scoping.

A controlled trial found AI made developers slower

In July 2025, the nonprofit METR ran a randomized controlled trial, which is rare for AI productivity claims. Sixteen experienced open-source developers completed 246 real tasks on codebases they averaged five years working in. Each task was randomly assigned to allow or disallow AI tools, which were mostly Cursor Pro with Claude 3.5 and 3.7 Sonnet.

The developers predicted AI would make them 24% faster. Afterward, they estimated it had made them about 20% faster. The measured result was the opposite: tasks took 19% longer with AI allowed (METR, 2025).

The size of that gap matters. People who write software for a living, on code they know cold, were slower with AI and did not notice. They expected AI to make them faster, believed afterward that it had, and the measurement showed otherwise. The tools were not failing them. The lost time went into prompting, reviewing, and verifying AI output, which they were not tracking.

Three caveats worth naming. The sample was small, sixteen developers, and the confidence interval is wide. The models were early 2025, and frontier agents have improved since. METR started a follow-up study and reported in February 2026 that they could not get a clean signal, partly because too many developers refused to work without AI at all. The METR result doesn't prove AI always slows people down. It shows that team-level output and the feeling of speed diverge, and that the feeling is a poor proxy for the measurement.

Where the time goes: review, not coding

When code becomes cheap to produce, the constraint moves somewhere else. The METR developers spent less time typing and searching, and more time prompting, waiting, reading, and double-checking output they did not write.

The same pattern shows up well past sixteen people. Faros AI, tracking roughly 22,000 developers, found that teams with high AI adoption merged 98% more pull requests and completed about 21% more tasks. In the same teams, pull request review time rose 91% (Faros AI, 2025). The added work concentrated at the one step a model cannot do for you: a human deciding whether the change is correct and ready to ship.

Your commits climb, your PRs climb, the repo looks busy, and deployment frequency stays flat. Once code got cheap to produce, activity and delivery stopped moving together. The scarce resource now is your attention: first to confirm the code is right, and before that, to decide it was the right thing to build at all.

Context rot: why longer inputs degrade output

LLMs get measurably worse as their input gets longer, and that hidden degradation is why agents add overhead instead of removing it. The degradation starts well before they run out of room. Chroma's 2025 research tested 18 frontier models, including GPT-4.1, Claude 4, and Gemini 2.5, and found that every one of them degrades as context grows, even on simple retrieval tasks (Chroma, 2025). Chroma's researchers named the effect context rot. A model with a 200K token window can show real degradation at 50K. The decline is gradual, not a cliff, which is exactly what makes it hard to notice.

It compounds with a problem researchers have documented since 2023: models pay most attention to the start and end of their context and lose information buried in the middle, the "lost in the middle" effect first measured at Stanford. Performance gets worse still on tasks that need the model to connect two facts rather than retrieve one.

Now apply that to a coding agent. As it searches your repo, opens files, backtracks, and reasons through a task, it accumulates tokens. Most of those tokens are noise: dead ends, stale file contents, half-explored paths. That noise degrades every output that follows. The model is usually smart enough to solve your problem. It just gets worse at it as the noise builds up. Clean the input and the model's full capability comes through. That is the one variable here you actually control.

Why a bigger context window won't save you

The instinct is to reach for more capacity. Pick the model with the million-token window, load in the whole codebase, every doc, every Slack thread, and let it sort things out.

Context rot says capacity is the wrong thing to optimize for. What matters is the ratio of signal to noise. Adding more to the window adds more noise, and the research shows the agent performs worse as that noise grows. More context means more for the agent to filter, not more help.

Giving the agent the right context is a different thing from giving it all of it. Loading the full codebase into the window feels like the safe default and tends to hurt performance instead.

Coordination is the real constraint

Step back from the individual developer and the picture gets clearer. Atlassian's State of Teams 2026 report surveyed more than 12,000 knowledge workers and 170-plus Fortune 1000 executives. 89% of executives said AI increased the speed of work. 6% felt confident they could point to specific, organization-wide ROI (Atlassian, 2026). Speed went up at the individual layer; ROI gets measured at the organization layer, and the two stopped lining up.

Their read matches what the engineering data shows: organizations are spending as if writing code is the bottleneck. It is not, and it has not been for a while. The constraint is coordination. It is alignment on what to build, shared understanding of why a past decision was made, and the review step where a human confirms the work is right. Point ten agents at a codebase and the coordination constraint becomes the loudest thing in the room, and the place to invest if you want all ten to pay off.

How we think about this at Brief

We build a tool for exactly this gap, so treat the following as our bias, not a neutral verdict. But the conclusion holds whether or not you ever use us.

Curation fixes context rot. An agent needs the three decisions that constrain this task at the moment it acts on them, not your entire decision history. The hard part is selecting only what is relevant and leaving out the rest.

That is also the fix for the coordination bottleneck. Most delivery drag is not engineers typing too slowly. It is re-deriving decisions that were already made, scattered across Slack threads, old specs, and someone's memory. A spec written on Monday is stale by Friday once the team makes a new call, and an agent reading the stale spec will confidently ship the wrong thing. The answer is a decision layer that stays current and hands the agent the right context on demand, instead of a static document it has to grep or a context window you hope is "big enough."

The agents are a genuine unlock. They moved the bottleneck to review and coordination rather than removing it, and that new bottleneck is solvable. So instead of "how do we get the agents to write more code," ask "where does the work pile up after the code is written, and what context would each agent need to get the work right the first time?" Answer that and the agents you already pay for start shipping faster, not just typing faster.

Sources

METR, Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity (July 2025): arxiv.org/abs/2507.09089
METR, We Are Changing Our Developer Productivity Experiment Design (February 2026): metr.org
Chroma, Context Rot: How Increasing Input Tokens Impacts LLM Performance (2025): trychroma.com
Faros AI, The AI Productivity Paradox Report (2025): faros.ai
Atlassian, State of Teams 2026 / The AI Efficiency Paradox (2026): atlassian.com

GET TLDR FROM:

← Back to Blog