The context window is not the bottleneck. Relevance is.
Why a ten-million-token window makes selecting the right decisions harder, not optional.
The pitch writes itself. Context windows are blowing past a million tokens and heading for ten, so the obvious move is to stop curating anything and hand the agent everything: the whole codebase, every document, the entire chat archive, all of it. With enough room, the reasoning goes, the model finally has all of your context, and the problem of agents ignoring your decisions dissolves on its own. It is a clean story. It is also wrong in a way that gets worse, not better, as the window grows.
It is wrong because capacity was never the scarce thing. The agent that re-proposed the architecture you killed last sprint did not fail because your decision would not fit in its window. It failed because, out of everything that could have been in the window, the one decision that mattered was not selected, and nothing marked it as current and authorized even if it had been. A bigger window does not select, and it does not authorize. It only makes room for more. The real job is relevance: putting the few right, current, authorized decisions in front of the agent for the task in front of it, and that job gets harder as capacity grows. Two separate obstacles stand in the way, and it is worth keeping them apart, because one of them might yield to scale and the other cannot.
The obstacle that scale might erode
The first is that adding context is not free. The tempting assumption is that surplus context is harmless, that at worst the model ignores what it does not need. It is not harmless. In the simplest terms, a model's attention is a finite budget spread across the tokens present, so adding low-value ones tends to cost the relevant ones some share; the mechanism is a simplification, but the empirical result is not. Chroma's context rot study shows that frontier long-context models degrade well before the window is full, and that adding low-relevance text measurably lowers performance on the part that matters. A ten-million-token window does not let you pour in ten million tokens of mostly-irrelevant history and reason cleanly over it. It lets you degrade more thoroughly. The marginal irrelevant decision is not ignored. It is a cost.
This is the obstacle scale might chip away at. Perhaps some future model truly pays nothing for distractors. Grant it, for the sake of argument, and the second obstacle stands entirely untouched, because it was never about capacity or attention at all.
The obstacle scale cannot touch
Whether a decision is still in force is a fact about your team's history, not about the sentence that states it. "We use eventual consistency here" does not carry, in its own text, whether it was overturned three weeks later. That is the relation supersedes, and it lives between decisions across time, not inside any one of them. Whether a decision was authorized is a fact about who decided, not about how plausible the wording sounds. A context window holds text. It does not hold the supersession relation, and it does not hold the fact "ratified by someone with standing." So a perfect, infinite window that paid nothing for irrelevant tokens still could not tell the agent which of two contradictory decisions is current, or which was ever actually decided, because those are not questions a larger context answers. They are questions about a governed record. This is the floor, and it does not move with model size.
Selection, and its three parts
Put the two obstacles together and the agent's real need at the moment it edits a file comes into focus. Not your whole decision history. The small set of decisions that are relevant (they govern the code being touched), current (they have not been superseded), and authorized (someone with standing decided them, rather than the model or a passerby having asserted them). Call the act of returning exactly that set selection. All three parts are load bearing. A relevant decision that was overturned last week is worse than useless, because it points the agent confidently in the wrong direction. A relevant, current statement that no one authorized is a guess in a decision's clothes. Relevance on its own is precisely how an agent ends up confidently consistent with the wrong thing.
A larger model reasons better over whatever you selected. It does not perform the selection, because selection depends on facts the weights do not carry: which decisions exist, which are in force, who had the standing to make them. Those live in a store, not in the model. Scaling the model improves the reasoning over the selected context and leaves the choice of context exactly where it was.
Why scale makes selection harder, not optional
Here is what the "windows will save us" story has backwards. When little fit, the window's smallness did your filtering for you, and "load the obvious files" was a passable heuristic. As everything becomes loadable, that free filtering disappears and the burden lands on you: out of thousands of decisions, which few belong in this task's window, and can you show they are current and authorized, when the window will now just as readily swallow the stale and the unauthorized, and every irrelevant one carries the cost from two sections ago. Capability moves the binding constraint from "can it fit and can it reason" to "is this the right, current, authorized context for this task." That question is not commoditized by scale. It is enlarged by it.
The honest version of "just retrieve it"
The obvious objection is that this is already solved: embed everything, retrieve the relevant decisions per task, done. Retrieval is necessary and it is not sufficient, because similarity is neither currency nor authority. An embedding search ranks by semantic closeness, so it will return a superseded decision sitting contentedly beside the one that replaced it, with no notion of which is in force and no record of who decided either. And here is the concession that makes the point rather than dodging it: an index that also carries governed status and authorship is exactly the store this argument is asking for. The read mechanics are ordinary information retrieval. The hard part is not the query. It is maintaining a corpus in which "current" and "authorized" are true and known, which is the write-side governance from the previous piece. A bare vector dump of undated, unauthored text returns relevant guesses. Put the governance in and you have rebuilt the layer, whatever you call it.
What this does and does not claim
It does not claim retrieval is useless; retrieval is half of selection. It does not claim bigger windows have no value; they genuinely improve reasoning over whatever context you supply. The load-bearing claim is narrower and model-independent: recency and authority are properties of a governed record, and a context window cannot supply them at any size, so the job of choosing the right, current, authorized decisions for a task survives every increase in capacity. And none of this is specific to a vendor. Build the governed, queryable store yourself if you prefer; the requirement does not move, because it follows from what "current" and "decided" mean. We built Brief to be that store and to do the read at code time, for that reason and no other.
A growing context window is the best thing that could happen to the case for a product context layer, not the worst. It takes "make it fit" off the table, which was always the easy half, and leaves the hard half in plain view: selecting the few decisions relevant to this task, proving they are current, and proving they were authorized, at the moment the agent acts. A window can hold your context. It cannot tell the agent which part of it is true right now, and which part you ever actually decided.
Stay in the Loop
Get notified when we publish new insights on building better AI products.
Get Updates