7. Agent Limits and Failure Modes

You hand the agent a task: refactor the connection-management module of a networking library.

The module handles TCP connect, keepalive, timeouts, and graceful shutdown. It's not enormous—around eight hundred lines—but the logic is delicate. Timeout handling has three nested layers of context cancellation. Graceful shutdown has to wait for in-flight requests to finish. Keepalive probing has to deal with half-open connections.

The agent reads the code and produces a refactor plan. The first few steps go smoothly. It changes the connection pool from a slice to a sync.Pool. It moves the lock granularity from a single global lock down to per-connection locks. The code looks cleaner. Performance should improve.

Then it starts simplifying error handling.

It looks at the three nested contexts and decides the structure is overcomplicated. It collapses them into one. The comment it leaves behind says: the original three-level context nesting added unnecessary complexity; the merged version is cleaner.

The thing is, those three layers weren't unnecessary complexity. The first was a request-level timeout. The second was a connection-level timeout. The third was a global shutdown signal. The cancellation semantics of the three were not the same. A request timeout shouldn't close the connection. A connection timeout shouldn't trigger global shutdown. After the merge, any one timeout fires all three levels: a request timeout closes the connection, a connection timeout triggers global shutdown.

The agent has no idea any of this is wrong. It runs the tests, all green, because the existing tests don't cover these edge cases. It reports back with a confident summary: refactor complete, all tests passing, line count reduced by 30%.

You see the line count reduced by 30% line and something feels off. You go back, read the diff, and find the bug. If you hadn't checked, this bug would have surfaced in production in the most insidious way possible: occasional reports of connections dropping for no apparent reason, only reproducing under high concurrency.

This kind of failure is not an exception. In the daily life of an agent it can happen at any moment.

The previous chapters have all been about what an agent can do. The ReAct loop lets it run multi-step tasks. MCP gives it standardized tools. Skills give it predefined capability packs. Multi-agent setups let it handle more complex scenarios. All of those capabilities are real. All of them are valuable. But if you only see the capabilities and never see the limits, you'll trust the agent in places where you shouldn't, and the failure will arrive without warning.

So this chapter asks a more basic question: what actually determines an agent's limits?

Is it that the model isn't strong enough? That's most people's instinctive first answer. But over the past two years models have gone through several generations, each one noticeably smarter than the last—and the way agents fail on long tasks has barely changed. Still the same root-cause misjudgments. Still the same drift mid-execution. Still the same confident "done" that turns out not to be done. If the underlying problem were really not enough intelligence, you'd expect each model generation to relieve a chunk of it. That hasn't happened. Which suggests the problem is structural, not capability-level.

7.1 One Slip Is Fine. Twenty Slips in a Row Is an Avalanche.

Start with a phenomenon anyone who has used an agent has seen.

In single-shot Q&A it looks fairly smart. Ask how to write a SQL query, explain an error, write a utility function, translate a snippet from one language to another—it does these almost without fail, and when it does fail you spot it immediately and just ask again. But the moment you stretch the task out, let it run a dozen-plus steps, call several tools, edit a few files, and finally hand back something that should run, the whole thing can go off the rails. Even if you walk back through it step by step, each step looks reasonable on its own—and the final delivery is still wrong.

The easy explanation is that the model isn't strong enough. So you swap in a stronger one and run the same task again. You'll find the long-task success rate hasn't moved much. That's the counter-intuitive part: it's clearly smart at single steps, so why does it fall apart the moment the task gets long?

To make sense of this, look at one number: the agent's per-step error rate.

The actual number varies by task and by model, but one fact is stable—it is never zero. No matter how simple the task, every single step has some non-trivial probability that the agent reads the wrong file, passes the wrong argument, misjudges a boundary, or skips a nil check. On a one-shot task you barely feel it; if it's wrong you just retry. But on a long task it isn't a single step. It's a string of steps in a row.

Stringing them together carries a cost most people never compute. Suppose per-step success is 95%, which is already an optimistic estimate. Two steps in a row: 0.95 × 0.95 ≈ 90%. Five steps: 0.95⁵ ≈ 77%. Twenty steps: 0.95²⁰ ≈ 36%. If per-step success is 90%, twenty steps gets you to 12%. You can't intuit this without doing the math, because our brains are wired to add, not multiply. Each step feels solid, the sum feels fine. But the actual composition is multiplicative, and by step twenty you're closer to a coin flip than to a working system.

How errors propagate and amplify down an execution chain

What makes it worse is that the per-step errors aren't independent events. A wrong call at step three writes that wrong call into the context. At step four the agent reads that context and treats the error as a fact. Step five makes a decision based on that fact. Step six builds on top of step five. Errors in this chain don't just fail to self-correct—they get re-cited and amplified. The agent hallucinates a function that doesn't exist at one step; the next step writes a call site against it; the step after that modifies surrounding code to fit; by the time the build finally fails seven steps later, the agent has no idea the root cause is buried back at step three.

The important corollary is this: letting the agent check itself doesn't get you out of this.

It's the most natural-sounding fix. Since each step can be wrong, have it stop every few steps and reflect, then weed out the bad ones. The problem is that reflection is itself a step. It's still a probabilistic prediction over the same context. If forward decisions are wrong 5% of the time, reflection decisions are also wrong about 5% of the time. Adding reflection to the chain just makes the chain longer—you've added another factor to the multiplication. You can't grade your own exam in a way that catches systemic errors, because the grading and the writing use the same pen.

So can the multiplicative chain be cut at all?

Yes—but the thing that cuts it cannot be the agent itself. It has to be something outside the model that doesn't make probabilistic mistakes: the compiler, the type checker, unit tests, assertions, linters, the CI pipeline. None of these are smart. They don't write code, don't reason, don't infer your intent. But they have one property no version of the agent will ever provide. They are deterministic. The compiler is not going to mistake compilable code for uncompilable 5% of the time. The type system is not going to mistake an int for a string. In a chain that is otherwise pure probabilistic multiplication, every deterministic feedback node flattens the error and resets state, letting the next step start from a clean baseline.

If you look back at how AI coding tools have evolved over the past two years from this angle, you'll notice the better ones are all doing the same thing: pushing more deterministic feedback into the chain. Cursor pipes lint errors back into the next agent turn. Claude Code runs a build right after a code edit, with the build error becoming the next step's input. Better agent workflows force a unit-test run after every change, and on failure they roll back a step. The common shape of all these mechanisms is the same: take agent does it → agent checks itself and replace it with agent does it → something deterministic checks it. The point is not making the checker smarter; the point is the checker doesn't make the same class of mistakes.

So when you face a long-chain agent task, the real question isn't whether the model is smart enough. It's how dense the deterministic feedback is along the chain. Which steps have a compiler waiting at the bottom, and which steps only have the agent itself? That question predicts how far the agent can run better than any difference between models. An agent runs further inside a Rust project not because the model is smarter at Rust, but because the Rust compiler does more checking, so every step has something deterministic giving it a free physical. Move the same model and the same task to Python and the drift rate jumps a tier. The model didn't get dumber. It's just walking a road with fewer guards.

This sounds simple, but it represents a fairly central engineering judgment: the reliability of a complex system is never guaranteed by every part being mistake-free. It's guaranteed by the system being able to absorb mistakes from every part. The agent is no different. The next section takes us to another failure mode—one that doesn't come from per-step errors at all, but from the agent getting stuck inside the first judgment it wrote down.

7.2 Stuck Inside Its Own First Hypothesis

If you've used an agent to fix bugs, you've probably seen this scene.

You ask it to fix a bug. It reads a few files and reaches an early judgment: the issue is probably in OrderService's transaction handling. Then it starts editing. First round: changes go in, tests run, fail. It keeps editing OrderService, trying a different transaction-boundary style. Tests still fail. It edits again, adjusting lock granularity, adding logging in the rollback path, moving exception capture up a level. Every round's diff is different. Every round it tries hard. And every round it stays inside OrderService.

Eventually you can't help yourself and tell it: go look at setUp in OrderTest. It opens the file, the root cause is sitting right there—a mock object inside the test fixture is configured wrong, and it has nothing to do with OrderService.

It isn't that the agent is incapable of finding the test file. Open a fresh session, describe the same bug, and it will often go straight to the test fixture. It can find it. In that particular conversation, it can't.

The first time this happened I assumed the model was being careless. After seeing it enough times I realized it isn't carelessness. It's a property of how the thing works.

When the model generates a token, it's distributing attention over the prior context. Whatever is written in the prior context gets weighted into the prediction. Once the agent has written the issue is probably in OrderService a few steps in, that sentence becomes a gravity well for everything that comes after. Every subsequent step—reading code, analyzing errors, deciding the next move—does its attention math under the pull of that sentence. The pull steers the lens. The more it pulls, the less attention goes anywhere outside OrderService's neighborhood.

This is similar to how a person sometimes gets stuck on a problem and only finds the answer after stepping away from the desk to do something else. There's one thing in that move the agent can't do: drop what it's currently thinking about and go look elsewhere. There is no put it down gesture in its repertoire. It can only keep generating along the existing context. The whole conversation is like a car with no brakes—once an early judgment has been written down, every subsequent thought has to take the same lane. It can rephrase, reorder, add logging, soften the wording. But none of those moves leave the lane.

This is why "have it think harder" almost never works.

A common first reaction to this loop is to nudge the agent with more prompts. Look more carefully. What if the issue is somewhere else? Consider other possibilities. These nudges sound like they're guiding it out, but in practice the effect is limited. Those prompts are also just appended to the context as additional weight. Meanwhile the original the issue is in OrderService sentence is still there, still pulling. The two weights cancel partially, and the agent will earnestly tell you I have considered other possibilities, then go right back to OrderService and try yet another phrasing. It isn't being stubborn. It genuinely thinks it has considered other options. It just can't actually let go of the original judgment.

So is there a way out? Yes, and the way out is direct: close the session and start a new one.

Throw the current conversation away, open a fresh chat, describe the bug again. In the new session there is no anchored prior context. Its attention is clean. There's a real chance the very first step lands on the test fixture. Everything you ran in the old session—every step, every version it tried, every diff that looked like progress—gets discarded. It sounds wasteful. Do the math: continuing to grind in the original session usually costs more tokens and more time than starting over.

I've watched experienced colleagues use Cursor, and their use of clear context is dramatically more frequent than that of someone using AI coding for the first time. The newcomer feels we've made progress in this conversation, throwing it away is a waste. The veteran does not. The veteran knows the agent's progress isn't in the conversation; it's in the code that has already landed. If nothing in the code is right yet, no length of conversation is an asset—it's a liability. Wipe it, let the agent see the problem with a fresh field of view. That's far more reliable than trying to talk it into changing its mind.

The agent does not have put down a thought and look again as a primitive. Its reflection looks like reflection but is mechanically identical to its generation: a probabilistic prediction over the current context.

So the engineering solution to this isn't making it reflect better. It's giving it an external reset. The tooling has actually had this for a while: Cursor's New Chat, Claude Code's /clear, the unassuming new session button in every IDE plugin. None of these are decorative product features. They are engineering fallbacks forced into existence by exactly the failure mode this section describes.

7.3 Its World Is Whatever It Read Into Context

Picking up from the bug scenario in the last section. The agent is stuck inside OrderService because the sentence the issue is in OrderService anchored its attention. But there's a more general and more insidious version of this: the agent isn't anchored at all—it simply has no idea OrderTest exists. It hasn't read the file. As far as it's concerned, the file is not in the world.

This shows up most clearly during planning.

An agent picks up a task with even mild scope: upgrade this interface in module A. It reads a few related files first—module A's interface definition, a couple of callers, the corresponding tests. Then it gives you a plan: change A's interface signature, adjust callers B and C, add a caching layer, run tests to verify.

The plan looks reasonable. You nod and let it execute.

Then it sinks into the swamp. First module D blows up—that's another team's service, it calls A's interface via reflection, A changed and D didn't, and the runtime panics. Fix that and a new problem surfaces: a version pin in the monorepo's root lockfile no longer matches, the build fails. Fix that and another problem surfaces: a CI script that only runs on the release branch validates that an old field of A's still exists, so the merge into main passes but the release pipeline breaks.

The plan wasn't wrong. Each step's logic was fine. The agent simply had no idea it was planning inside a box much smaller than the real system.

This looks like the agent didn't read enough files. The instinctive next step is to make it read more. Try it and you'll find it doesn't help. The agent doesn't know what it hasn't read. After reading A, B, C, and the tests, it will tell you in good faith I've reviewed all relevant files, but module D's reflection call, the lockfile in the root, the validator on the release branch—it has no awareness those things might exist. It didn't miss them. It doesn't have a might-have-missed category at all.

The hidden assumption agents bring into planning is: what I've read is everything.

There's no I might have blind spots concept in the picture. Code it didn't read, configs it didn't read, env vars it didn't read, downstream services it didn't read—those aren't might exist but I didn't see them; they're don't exist. Those two states are clearly distinct in an experienced engineer's head. A seasoned engineer dropped into an unfamiliar project's first instinct is there are things I don't know yet, I need to ask around. They go talk to ops, ping the frontend team, look at recent CI failures, skim the release notes. Every one of those moves rests on acknowledging blind spots.

The agent doesn't have that move. Its world equals its context, and what's outside the context is not something it actively probes. Even if you tell it there might be other relevant files I haven't shown you, when generating the next step it will still plan based on the files it has read. To the agent, unknown and nonexistent have no mathematical difference; both end up with near-zero weight in its probability distribution.

This becomes especially painful in real software systems, for one simple reason: real software systems are full of implicit coupling.

Explicit coupling is easy to handle. A imports B, you read A and follow the import to B. Static analysis sees this clearly. Agents almost never break on these. Where they break is on the implicit coupling.

Reflection calls are one class. A's interface name is built from a string, the caller never has that name in source, grep won't find it, there's no import edge. It only surfaces at runtime. Build-system dependencies are another. A Makefile runs codegen at build time that synthesizes new call sites; a go generate directive writes code into a sibling package; that visually unremarkable data reference in a Bazel BUILD file. Runtime configuration is yet another. The same code takes different branches in different environments depending on env vars and the config service—reading the source tells you nothing. Below that, cross-service protocol contracts: a field in this service's interface is shared with a downstream service's protobuf schema, you change one without the other, integration tests are the first to find out. And shared storage: multiple services reading the same table, the same Redis key, the same message queue, with no visible coupling in code at all.

The shared property of all of these is that they rarely sit inside a single file, but they really are coupled. The picture an agent builds from reading code is always going to differ from the picture that code paints when actually running in production. The picture narrows with more context, but it never converges to zero distance, because the total volume of implicit coupling will always exceed any single context window. No window, however large, can hold all the implicit dependencies of a real system.

There's a counter-intuitive corollary here: stronger models make this problem more dangerous, not less.

A direct consequence of a stronger model is that the output looks closer to what you wanted. A vague plan becomes a clear, well-structured, professionally worded plan. The believability rises sharply. The actual correctness does not. What's bounding it isn't expressive ability; it's field of view. The field of view didn't expand. A more polished plan is just a more carefully sculpted artifact inside the same box.

This means the upgrade looks-more-correct effect from a stronger model makes you more likely to skip the question that matters most: might it be missing something it can't see? Early agents produced halting plans, and you naturally read them with skepticism. Today's agents produce neat, bullet-pointed plans that even consider edge cases. The reflexive reaction is they've thought it through, let me run it. That reflex is subtle, and it shows up in production constantly.

So is there a fix?

The honest answer: there's no complete fix. There's only a division of labor.

The agent's field of view will always be its context, and the context will always be smaller than the real system. This is structural—it doesn't go away with smarter models or larger context windows. The only thing you can change is this: the critical planning decisions need a human covering the field-of-view gap. The parts the agent can see, let it handle. The parts it can't see, you handle. Your role isn't to second-guess its judgments inside its visible scope (where it's often quite good); your role is to look at the things it can't see.

In day-to-day workflow, this shows up as: when an agent hands you a plan that looks complete, the question to ask isn't did it write this correctly?—it's a different question entirely: of the things this plan doesn't mention, is that because the agent judged them irrelevant, or because the agent doesn't know they exist?

The agent can't answer that question. Only you can, because only you stand in a vantage point larger than its field of view. Build the habit: don't audit what the agent gave you, audit what it didn't give you. The former it writes increasingly well. The latter it will never know about.

7.4 Reversibility Is the Real Boundary of Agent Autonomy

The previous three sections were all about reducing how often the agent makes mistakes: cut the multiplicative chain with deterministic feedback, break anchored attention with a session reset, cover blind spots with humans. Do all three well and the error rate drops, but it never reaches zero. The multiplication earlier already showed why: as long as per-step error isn't truly zero, a long task will hit at least one wrong step somewhere.

Is the mistake itself the end? Not yet. One mistake and one mistake are not the same scale. Writing a wrong line of code and migrating a production database wrong are not in the same league. The cost distribution of failure matters more than the failure itself.

Have the agent edit some code, gets it wrong, git checkout—a few seconds, done. Have it run a local test, it fails, run it again. Have it read a file, wrong file, re-read. The cost of failure for these operations is so small it's basically free. The agent can fail many times and it doesn't matter; every attempt can return the world to the state it was in before.

Have it run a database migration script, it gets it wrong—and millions of rows on that line might not come back. Even with backups, the recovery window means the entire business is down. Have it call an external API with side effects—a payment endpoint, a push notification, an email send—the money is wired, the message is delivered, the notification is on someone's phone. The cost of recall is astronomical. Have it publish a package to the central registry; downstream consumers have already pulled it; you cannot unship that. One failure on this kind of operation can erase every win that came before it.

Both kinds of operations are agents executing tasks. But they live in fundamentally different categories from a system-design perspective.

If you treat them under one standard, you fall into a dilemma immediately. Trust everything—you're betting it never errs on critical paths, which the previous sections showed is mathematically impossible. Audit everything—every single step pops a confirmation modal, the agent decays into a slow-motion script, all the automation value evaporates.

What works is sorting operations by one simple criterion: if it goes wrong, can it be undone?

Once you split operations into reversible and irreversible, the autonomy design becomes clear.

Start with the reversible side. Writing code, reading files, running local tests, reading docs, grep, building locally, running a script in a sandbox, committing on a local branch—the shared trait is either no external side effect, or all side effects are inside a container that can be reset with one command (Git, the local file system, container images). When it goes wrong, the cost of undoing is near zero.

In this region, the agent should have full autonomy. Every confirmation popup that pauses the agent here is a pure productivity tax. Reversible-region operations should let the agent run free; if it gets it wrong, you reset.

The irreversible side is the other end. Database migrations (schema changes and data mutations both), production deploys, file deletion (especially rm -rf style), external-side-effect API calls (payment, email, push, third-party writes), npm publish / cargo publish and other registry operations, merges into the main branch, merges into the release branch. The shared trait is that once the side effect ships, you either can't recall it, or recalling costs more than not doing it.

In this region, the agent should not have autonomy. No matter how strong the model is, no matter how many times it's gotten it right before, this one needs a human checkpoint. Not because we don't trust the agent, but because the cost of error in this region exceeds anything you can backstop. The cost of confirmation is one extra Enter press. The cost of misfire is a production incident. Those are not on the same scale, so the checkpoint is always worth keeping.

There's a gray zone in the middle. Local edits across many files, config changes not yet committed, sandbox runs that haven't merged, base-library changes that may affect downstream where downstream hasn't pulled yet. Operations here are technically reversible but the rollback has a cost. How you handle this depends on how much rework you're willing to absorb, your project, and your team's habits.

Agent autonomy zones, partitioned by reversibility

Why is version control the foundational infrastructure of the agent era? Because it provides a stable rollback mechanism for editing code, an operation with side effects. Sandboxes and containers do the same thing at a larger radius.

Approval flows and code review do the inverse job. They take operations that would otherwise sit in the reversible zone but have residual risk you don't want to delegate to an agent, mark them explicitly as irreversible, and force in a human-check interaction. This isn't always about preventing agent error. Often the team needs an accountability owner—someone has to press the button for the action to fire.

7.5 How Far an Agent Goes Has Less to Do With the Model Than You Think

If the model got twice as strong, would the previous problems be solved by the model alone?

Per-step error from 5% down to 1%? A 50-step task is still 0.99⁵⁰ ≈ 60%, less than two-thirds. Down to 0.5%? A 100-step task lands at 60%. The cost of multiplication is exponential. Knock the per-step rate down a tier, the chain holds for a few more steps. The underlying multiplicative structure didn't move. The problem is structural, not technological.

Stronger models relieve each problem somewhat, but none of them disappear. The way attention works guarantees it will always be weighted by the prior context—that doesn't go away with more parameters. The context window can grow from tens of thousands to millions—the implicit coupling of a real system will always exceed it. The cost of an irreversible operation is set by the physical world. No matter how smart the model is, the money you wired out doesn't come back.

So when you put the four sections of this chapter side by side, you find they're describing four faces of the same fact. Error compounding has to be cut by deterministic external feedback. Attention anchoring has to be broken by an external session reset. Field-of-view blind spots have to be filled by a human. Irreversible cost has to be partitioned by separating autonomy zones. Every typical failure mode has its solution outside the agent, not inside. The model can't solve them on its own—not because this generation isn't strong enough, but because these are things that sit outside the model's rules of operation by their very nature.

If the solutions all live outside, then how far an agent goes is no longer a property of the model. It's a property of the layer around the model. The same agent and the same model dropped into different engineering environments will travel very different distances. A repo with full CI, mandatory PR review, healthy coverage, and human-confirm gates on critical paths versus a repo with main pushes allowed, sparse tests, and manual deploy scripts—give the same agent the same task in both, and the long-term failure rate differs by an order of magnitude. The difference isn't the model. It's how many gates the surrounding chain has.

When an agent feels great in a particular project, don't credit the model first. Look at the project's scaffolding. Most of the time the project itself has plugged the holes the agent tends to fall through—it has strong types, it has coverage, it has lint, it has clean module boundaries, it has humans gating critical operations. Conversely, when an agent feels useless in a project, don't curse the model first. Count how many of those holes are still open. The gap between those two situations predicts whether an agent task lands better than the gap between model generations.

The reliability of complex systems was never guaranteed by every part being mistake-free. It was guaranteed by the system absorbing every part's mistakes. Aviation, nuclear power, distributed systems, manufacturing lines—none of them assume parts don't fail. They assume parts will fail, and they cap the cost of failure with redundancy, validation, rollback, and isolation. Agents are the same. Their reliability has to be built in the same way.