6. When One Agent Is Not Enough

You ask the Agent to refactor a payments module.

Three thousand lines of code. It covers order creation, the payment-gateway integration, callback handling, refunds, and reconciliation. The code was written three years ago, and the architectural decisions made back then no longer fit today's traffic and product surface. You want the Agent to break the module into a few independent submodules, redesign the interfaces, migrate everything to the new error-handling pattern, and keep all existing tests passing.

The Agent gets to work. It reads every source file, analyzes the module structure, and lays out a refactoring plan. Then it starts rewriting function by function—first order creation, then the gateway integration, then callback handling. After each function, it runs the tests to confirm nothing was broken.

Around step 15, things start to drift.

The Agent's context is now stuffed: chunks of the three-thousand-line legacy code, the refactoring plan, the new code that has already been rewritten, every test run's output, plus a few edge cases it picked up along the way. Context-window utilization has crossed 70%. Subtle misbehavior begins to surface. While rewriting the refund logic, the Agent forgets the change it made to the order-creation interface back at step 5, and generates refund code that calls the old signature. A test fails, the Agent reads the error, tries to fix it—but in the wrong direction. It thinks the bug is in the refund logic itself, never realizing the real cause is an interface mismatch.

By step 25, the context window is nearly full. The Agent's behavior gets unsteady—it repeats checks it has already done, occasionally produces code that contradicts code it generated earlier, and its diagnoses of test failures get worse and worse. You end up interrupting the task, manually inspecting what it actually did, and finishing the rest yourself.

This is not a "bug" in the Agent. It is a structural limit of the single-Agent architecture.

6.1 The Ceiling of a Single Agent

The opening scenario is not a one-off. Any sufficiently complex task—multiple modules, multiple roles, large amounts of context—will push a single Agent into its structural ceiling. These ceilings exist not because the Agent is "not smart enough"; they exist because a single-Agent architecture has several structural bottlenecks, and those bottlenecks do not go away as models get better.

The most direct one is the physical limit of the context window. Everything an Agent works with—system prompt, tool descriptions, Skill instructions, conversation history, intermediate results—is crammed into the same window. The window has a fixed size. Even at hundreds of thousands or a million tokens, complex tasks will run it out. What matters is not the absolute size of the window but the rate at which information accumulates. Every step of a ReAct loop—reasoning, calling a tool, receiving a result—appends content to the context. Tool returns can be long: the contents of a file, the results of a search, a block of test output. Each step eats window space. By the back half of a 30-step task, the context is already piled high with the history of earlier steps, and the room left for the current step keeps shrinking.

It is worse than just "not enough room." As Chapter 1 noted, the model does not pay even attention across the context. Information in the middle gets "forgotten" most easily; the beginning and the end get attended to most strongly. When the context is long, key information from earlier steps—say, a constraint in the refactoring plan—can drown in the flood of later intermediate results. The model "saw" it, but it never registered.

The context-window limit pulls two more bottlenecks behind it. They look independent on the surface but share the same root: everything sits in one context, so everything interferes with everything else. Those two bottlenecks are role confusion and capability conflict.

The clearest example of role confusion is "writing code" versus "reviewing code." When writing code, the Agent is in creation mode—biased toward considering its own approach reasonable and toward pushing forward rather than questioning. When reviewing code, the Agent needs to switch into critical mode—doubt every line, hunt for problems. Asking the same Agent to write code and then review what it just wrote rarely works well. It tends to assume its own code is correct, because the very thinking that produced the code is still sitting in the context, biasing the review. It is like asking someone to proofread their own essay—you have a hard time spotting your own mistakes, because your brain keeps "filling in" what you meant to write instead of seeing what you actually wrote.

Capability conflict is the same disease in a different form. As Chapter 5 covered, different Skills can carry conflicting instructions. In a single-Agent architecture, every Skill is loaded into the same context, and the conflict has nowhere to go. An Agent that has loaded both a "rapid prototyping Skill" (favoring speed, tolerating technical debt) and a "production-grade Skill" (favoring quality, demanding error handling and tests) is left rudderless when it tries to write code—is the goal "ship fast" or "polish carefully"? The two sets of instructions sit in the context simultaneously, and the model's behavior turns unpredictable.

These three bottlenecks are about quality—the context cannot hold everything, the roles bleed into each other, the instructions fight each other, and the output degrades. There is one more bottleneck that is about speed: a single Agent runs serially. It can only do one thing at a time. Inside the ReAct loop, every step has to wait for the previous step to finish. But many real tasks contain naturally parallel subtasks—write unit tests for ten functions in a module, where the ten functions have no dependencies on one another and the tests share no state. In principle, ten things could happen in parallel; with a single Agent, they happen one after another. If each function takes two minutes to test, ten functions take twenty minutes; if it could be parallel, the whole batch could finish in three.

These bottlenecks do not stay isolated. They amplify each other. The fuller the context window, the worse the role confusion (because information from different roles is now mixed together); the more roles, the more Skills get loaded, and the higher the chance of capability conflict; the more complex the task, the more steps required, the longer the serial execution, and the faster the context fills.

The diagram below shows the four bottlenecks side by side and how they amplify each other:

The four structural bottlenecks of a single Agent

It is a vicious cycle. Once task complexity crosses some threshold, the single Agent's performance does not degrade linearly—it falls off a cliff. It goes from "basically working" to "completely unreliable" in a very short stretch.

How do you break the cycle?

The idea is straightforward: if one Agent cannot hold everything, use several Agents, and let each one hold only what it needs.

6.2 Sub-agents: The Basic Pattern of Division of Labor

The most basic pattern in multi-agent collaboration is orchestrator-and-workers: one orchestrator agent does the planning and coordination, and several sub-agents handle individual subtasks.

Back to the refactoring scenario. With a multi-agent setup, the flow looks like this:

The orchestrator agent receives the refactoring task, analyzes the module structure, lays out a refactoring plan, and breaks the work into subtasks:

Subtask 1: refactor the order-creation module
Subtask 2: refactor the payment-gateway integration
Subtask 3: refactor the callback-handling module
Subtask 4: refactor the refund-logic module
Subtask 5: update all tests

Each subtask is handed to its own sub-agent. Each sub-agent has its own context. It only receives information relevant to its own subtask: the parts of the refactoring plan that apply to it, the source files it needs to modify, the relevant interface definitions. It does not need to know what the other sub-agents are doing, does not need to see the code in other modules, and does not need to track the global progress of the refactor.

The core advantage of this design is context isolation.

Each sub-agent's context is "clean"—it only contains information relevant to the current subtask. No history from unrelated tasks to interfere, no irrelevant tool descriptions to occupy space, no conflicting Skill instructions to muddy the waters. The sub-agent can stay focused on its own task, very much like a developer responsible for a single module: it does not need to understand the whole system, only its own area and the interfaces to neighbors.

Context isolation brings a side benefit: role clarity. Each sub-agent can load whichever Skill best fits its job—the code-writing sub-agent loads the coding-standard Skill, the test-writing sub-agent loads the testing-standard Skill, the review sub-agent loads the review-checklist Skill. Different roles run in different contexts, and they do not step on each other.

The orchestrator's role becomes clearer too. It no longer has to execute every subtask itself. It only has three jobs: split a complex task into subtasks that can be executed independently—this is the most critical capability, because a clean split lets every sub-agent work efficiently and a bad split makes them constantly collide; decide what information each sub-agent needs, and pass each sub-agent precisely what it needs and nothing more—not "broadcast everything to everyone," which would defeat the point of context isolation, the same way a project manager hands each developer the spec and interface they need rather than dumping the full project archive on them; collect the results from every sub-agent, check for conflicts or omissions, and produce the final integrated outcome—if two sub-agents made conflicting changes (both modifying the same interface signature, for instance), the orchestrator has to detect the conflict and negotiate the resolution.

The pattern is divide-and-conquer at heart—split a big problem into smaller problems, hand each smaller problem to a focused worker. The orchestrator splits and coordinates; the sub-agents execute. They communicate over well-defined interfaces and do not reach into one another's internal state.

But communication between Agents is far more complicated than ordinary inter-process communication—what flows across the boundary is not just data, but context, intent, and judgment. The diagram below shows the full collaboration loop—the orchestrator's three responsibilities, each sub-agent's independent execution, and the structured report flowing back at the end:

Orchestrator and sub-agents: the basic multi-agent pattern

The "structured report" on the right side of the diagram is the central design choice in multi-agent communication. When a sub-agent finishes, it does not ship every execution detail back to the orchestrator. It reports the key information in a fixed shape. That choice opens the next question directly: how much should that report contain?

6.3 Inter-Agent Communication: Compression vs. Completeness

One of the most important design decisions in a multi-agent system is how Agents pass information to each other.

A sub-agent finishes its subtask and needs to send the result back to the orchestrator. The question is: send what?

Two extremes show up in practice.

Send the entire execution trace. Pipe the sub-agent's whole context back—every step of reasoning, every tool call, every intermediate result—into the orchestrator. Information is complete. The orchestrator can see what the sub-agent did, why it did it, and what surfaces it bumped into along the way. The cost is enormous. A sub-agent running a 15-step task might carry tens of thousands of tokens of context. With five sub-agents, the orchestrator has to absorb hundreds of thousands of tokens—well past its own window. And most of it is useless to the orchestrator: it does not need to know which file the sub-agent read at step 7 or which test command it ran at step 12. It only needs the final result and the key findings.

Send only the final result. The sub-agent returns a one-line summary—"task complete, modified 3 files, all tests passing." Information is heavily compressed. The orchestrator's context is barely affected. But the risk is just as obvious. If the sub-agent discovered something important during execution—"the refund logic depends on an API that is being deprecated, recommend replacing it as part of this refactor"—and that finding never made it into the summary, the orchestrator will miss it. Decisions made on incomplete information will be wrong.

Worse: a sub-agent's task can "look done" while hiding real problems. A sub-agent changes a function's signature but does not update every caller—because some callers are outside its scope. If the report just says "changes complete, tests passing," the orchestrator will not know there are unupdated callers, until another sub-agent—or the eventual integration test—runs into the problem.

The pragmatic middle ground is the structured report. The report has a fixed set of fields: execution status (success / partial / failure); a change summary (which files were modified, a high-level description of the changes); key findings (important issues, risks, or recommendations encountered during execution); declared dependencies (the assumptions this subtask relied on; the other modules it touches); open issues (problems the orchestrator or another sub-agent needs to handle). This format finds a balance between volume and manageability. The orchestrator can absorb each sub-agent's outcome quickly without drowning in detail. Key findings and dependency declarations make sure important information does not vanish.

The catch is that this format is only as good as the sub-agent's reporting ability. Can it accurately judge what counts as a "key finding" and what counts as an "open issue"? That judgment is itself an intelligence-heavy task. If the sub-agent's judgment is shaky, it may swallow critical information as irrelevant detail or escalate routine detail as critical information.

The deeper issue is this: who decides what counts as "key information"? Defined well, the orchestrator can locate problems fast; defined poorly, signal gets buried in noise. There is no clean general solution—only different trade-offs in different scenarios.

6.4 Toward a Protocol: A2A and the "Internet Between Agents"

Section 6.3 is about the cognitive layer of inter-agent communication—what to send, how much to compress. There is a more foundational question sitting behind it: what "language" do Agents use to talk to each other in the first place?

So far in this chapter, the multi-agent collaboration we have described has all been "inside the same framework." The orchestrator and the sub-agents are created by the same system, run inside the same process or under the same scheduler, and their "communication" is essentially a function call, a message queue, or a structured dictionary. That is homogeneous multi-agent.

The real world is starting to look different. Your coding Agent is from Vendor A, your security-audit Agent is from Vendor B, your deployment Agent is from Vendor C—each running on a different model, a different framework, a different toolchain—and you want them to collaborate on the same task. A coding Agent finishes a critical change in a payments path and needs to hand it to an independent security-audit Agent for review; once the audit passes, it goes to an independent deployment Agent for rollout. Three Agents, three vendors. How do they discover each other? How do they hand off tasks? How do they pass intermediate artifacts? How do they report failures?

In that scenario, the "function call" model of orchestrator-and-workers stops being enough. Heterogeneous Agents share no memory, no scheduler, not even a language runtime. They need a protocol—a standard way for one Agent to speak to another.

The most prominent attempt in this direction so far is A2A (Agent2Agent), originally proposed by Google and now maintained by the Linux Foundation. Its scope is clean. A2A does not solve how an Agent reasons internally, and it does not solve how an Agent calls tools—the former belongs to the model, the latter is already being handled by MCP. What A2A solves is interoperability between Agents themselves.

Its core abstractions roughly come in three pieces. First, the Agent Card—each Agent declares "who I am, what I can do, how to reach me, how to authenticate against me," the agent-world equivalent of a service profile. Other Agents can discover and connect to it without prior acquaintance. Second, a task lifecycle—tasks have an explicit state machine: submitted, in progress, awaiting input, completed, failed. The caller does not have to poll a vague "is it still running?"; the state is part of the protocol. Third, a message channel—bidirectional transport for multi-modal payloads (text, files, structured data); long-running tasks can push progress updates instead of forcing the caller to block.

Lined up next to MCP, the division of labor becomes clear:

Dimension	MCP	A2A
Communication between whom	Agent ↔ tools / resources / data	Agent ↔ Agent
One-line positioning	"How an Agent uses the world"	"How an Agent works with peers"
Analogy	USB-C / a tool bus	HTTP / an internet between Agents
What the protocol cares about	Tool description, invocation, capability exposure	Identity discovery, task orchestration, state synchronization

The two are not substitutes; they are complementary layers. Internally, an Agent uses MCP to call tools. Externally, it uses A2A to coordinate with other Agents. A complete multi-vendor Agent system usually needs both layers.

But honestly: A2A is nowhere near as mature as MCP. MCP has effectively become the industry standard for tool calling, and the major Agent platforms are following it. A2A has stabilized as a specification and has early implementations, but its ecosystem is thin. Most teams are still struggling to make a single Agent run reliably and have not yet reached the point where they need cross-vendor interoperability.

So this section is not here because A2A will solve your team's problem today. It is here because A2A draws out a structural direction: as multi-agent collaboration grows from "function calls inside one framework" to "service calls across vendors," a protocol layer becomes inevitable. Homogeneous multi-agent systems do not need A2A, because they already share a runtime. But the moment you start treating "Agents" as objects that can be deployed independently, evolve independently, and come from different sources—the way we now think of microservices—the slot for an A2A-shaped protocol opens up.

Back to the spine of the chapter. From the single-Agent ceiling in 6.1, to orchestrator-and-workers in 6.2, to the cognitive question in 6.3, to the protocol layer in this section, the multi-agent picture has been unfolding from inside out. The next section pulls back inward and looks at coordination problems: dependencies between subtasks, the trade-off between parallel and serial execution, and how to handle conflicts.

6.5 Parallel and Serial: The Subtask Dependency Graph

A central advantage of multi-agent systems is parallelism—several sub-agents working at the same time can sharply cut total execution time. But parallelism is not just "run them all at once." It requires handling the dependencies between subtasks.

The most favorable case is subtasks with no dependencies. Writing unit tests for 10 independent functions—the 10 functions do not call each other, the tests share no state, 10 sub-agents can run simultaneously, each owning one function's tests, and total time approaches the time of the slowest sub-agent. The hardest case is subtasks with hard dependencies—first design the interface, then write the implementation, then write the tests. The three subtasks have a strict order: implementation depends on the interface definition, tests depend on the implementation. They cannot start at the same time. The most common case is partial dependencies—inside a refactor, some submodules depend on each other (module A calls into module B's interface) and others do not (module C and module D are completely independent). The independent submodules can be refactored in parallel; the dependent ones run in order.

Subtask dependencies form a directed acyclic graph (DAG). Each node is a subtask; each edge is a dependency. Nodes with no incoming edges can start immediately; nodes with incoming edges have to wait for all their predecessors. This is the same model as job orchestration in CI/CD pipelines—some jobs can run in parallel, others have to wait for upstream jobs to finish.

In the Agent world, building this dependency graph is itself the hard part.

In a CI/CD pipeline, dependencies are defined by humans—the developer writes "Job B depends on Job A" in the config. In a multi-agent system, the orchestrator has to figure them out itself. It has to read the structure of the task, understand the data flow and control flow between subtasks, and then decide what can run in parallel and what must run in series.

This judgment is probabilistic. The orchestrator may miss a dependency—it might fail to realize that refactoring module A will affect module C's interface, and it sends two sub-agents off to work in parallel. The two sub-agents each modify a different aspect of the same interface, and a conflict appears.

Conflicts during parallel execution are one of the trickiest problems in multi-agent systems, and they show up in three different shapes.

The most visible shape is the file-level conflict: two sub-agents modify the same file at the same time. This is the same family as Git merge conflicts—two people change the same chunk of the same file, and the merge breaks. It is harder in the Agent world, because Agents do not "negotiate" the way human developers do. Each one works in its own context and has no idea what the other is doing.

Harder still is the semantic conflict, which is more hidden. Two sub-agents do not modify the same file, but their changes are incompatible at the semantic level. One sub-agent changes a function's return type from error to (result, error). Another sub-agent calls that function in its own code but still handles the return as if it were the old signature. Each sub-agent's code is internally correct, but they do not compose. There is also the state-level conflict: two sub-agents both modify a shared piece of state—a config file, a database schema, a global constant. Each modification is reasonable on its own, but together they produce inconsistency.

The responsibility for handling these conflicts lands on the orchestrator. When it integrates results from sub-agents, it has to detect conflicts and coordinate the resolution. But detecting semantic and state-level conflicts requires deep understanding of the code—itself a high-difficulty task.

A practical rule of thumb: prefer less parallelism over having to resolve complex conflicts. If you are not sure whether two subtasks have a dependency, run them in series. The cost of serial execution is time. The cost of a conflict is correctness. In most scenarios, correctness matters more than speed.

6.6 Multi-Agent Topologies

Orchestrator-and-workers is the most basic multi-agent pattern, but not the only one. As task complexity grows, the way Agents are organized grows with it.

The simplest is the hub-and-spoke (star) topology—the orchestrator-and-workers pattern we have been discussing. One orchestrator at the center, multiple sub-agents around it, all communication going through the center, and no direct communication between sub-agents. This is the easiest topology to reason about, and it fits well when subtasks are relatively independent. Its strength is centralized control—the orchestrator has a complete global view and can make globally optimal decisions. Its weakness is that the orchestrator becomes a bottleneck—everything has to flow through it, and if there are too many sub-agents the orchestrator's own context fills up with their reports.

One layer up is the hierarchical topology—the orchestrator manages a few "mid-tier Agents," and each mid-tier Agent manages its own group of sub-agents. A "backend refactor" mid-tier Agent might manage several sub-agents responsible for different backend modules, while a "frontend refactor" mid-tier Agent manages a different group of sub-agents. The orchestrator only talks to the mid-tier Agents, never directly to the leaf sub-agents. This relieves the orchestrator's bottleneck—information is summarized and compressed at every level, and the orchestrator only handles mid-tier reports rather than every leaf detail. But the deeper the hierarchy, the more information loss—important findings at the leaves can quietly disappear as they get summarized upward layer by layer.

Another shape is the pipeline topology—Agents arranged in sequence, each one's output feeding into the next. For example: analysis Agent → design Agent → implementation Agent → testing Agent → review Agent. Each Agent owns one phase, hands its result to the next, and stops there. It fits tasks with clean phase boundaries. The strength is that each Agent's responsibility is sharp and its context is very clean—it only deals with the input and output of its own phase. The weakness is rigidity—if the review Agent finds a problem rooted in the design phase, the information has to flow upstream against the pipeline, which is awkward.

The diagram below contrasts these three topologies and where each one fits:

Three typical multi-agent topologies

There is also a fourth shape that is theoretically appealing—the peer-to-peer topology. No clear orchestrator; Agents talk, negotiate, and argue with each other directly.

It is appealing because it mirrors how human teams collaborate. In practice, the coordination cost is enormous. Without a central coordinator, communication can fall into deadlocks (A waits on B's result, B waits on A's result), or produce inconsistent decisions (A wants approach X, B wants approach Y, no one to break the tie).

In AI coding specifically, hub-and-spoke is the most common topology, because it is the simplest and most controllable. Hierarchical topologies appear occasionally on very complex tasks. Pipeline topologies show up where the task has clean phases—code review, CI/CD integration. Peer-to-peer is still experimental, with very few real deployments.

Which topology to pick depends on the structure of the task. If subtasks are relatively independent, hub-and-spoke is enough. If the task has clean phase boundaries, a pipeline reads more naturally. If the task is large and deep, a hierarchy manages complexity better. There is no topology that is best in every situation—the choice tracks your task's structure and complexity.

6.7 The Coordination Problems of Multi-Agent Systems

Multi-agent is not "split the task across a few Agents and call it done." It introduces a new family of complexity that does not exist in a single-Agent architecture. The complexity falls along two main axes—one is how to decompose the task and how the cost of doing so adds up; the other is what happens after decomposition, when sub-agents drift out of sync and you cannot tell who broke what.

Start with the first axis: the quality of task decomposition, and the cost ledger behind it.

A multi-agent system's effectiveness rests first on the quality of its task decomposition. The shape of the subtasks the orchestrator produces directly determines the system's outcome. Too coarse—the scope of a single subtask is too large, and the sub-agent will hit the same single-Agent ceiling. Treating "refactor the entire payments module" as one subtask is not meaningfully different from giving the whole task to a single Agent. Too fine—dependencies between subtasks become extremely complex, and coordination cost outweighs the gains from parallelism. Treating every line of every function as its own subtask forces sub-agents to communicate constantly to stay consistent, and the orchestrator does more work coordinating than it would have done executing. Wrong cuts—the boundaries between subtasks land in the wrong places, splitting a logically unified change across two sub-agents. A function's signature change and the updates to its callers end up in different sub-agents that must agree on the change but execute independently in their own contexts. Consistency becomes very hard to guarantee.

Good task decomposition requires deep understanding of the task itself—knowing which parts are cohesive (and should stay together), which parts are loosely coupled (and can be separated), and which parts have implicit dependencies (that need extra attention). Today's models are not yet stable at this kind of judgment.

Decomposition quality is one face. The cost ledger is the other. One more Agent means one more context, and one more context means more tokens. The orchestrator's context holds the task description, the decomposition plan, and every sub-agent's report. Each sub-agent's context holds its subtask description, the relevant source files, and its own execution history. A task that costs 50K tokens with a single Agent might cost 30K for the orchestrator plus 20K per sub-agent—five sub-agents and you are at 130K, 2.6× the single-Agent cost. With more sub-agents and more complexity, the multiplier climbs further. And the cost is not only token count—every Agent's every step is a model inference call, and a multi-agent run accumulates many more calls than a single-Agent run. The conclusion is: not every task is worth multi-agent. If a task takes 10 steps for a single Agent, it might take 3 Agents at 5 steps each in a multi-agent setup—15 steps total, 50% more cost—while wall-clock time only drops by 30%. The math does not always pay off.

Now the second axis: inconsistency between sub-agents, and the cost of debugging it.

When sub-agents work independently, they can make decisions that do not line up. A sub-agent refactoring module A decides to change a shared function's parameter from string to []byte because that is more efficient inside module A's context. A sub-agent refactoring module B is still calling the same function with the old string signature. Each sub-agent's decision is reasonable inside its own context. Together they are inconsistent. More subtle inconsistencies are stylistic. One sub-agent uses context.Context as the first parameter; another uses ctx context.Context. One sub-agent writes error messages in English, another in Chinese. None of this causes a compile error, but the code reads like it was written by different people—because it really was written by different "people."

Resolving inconsistency is the orchestrator's job. But to detect it, the orchestrator has to cross-check every sub-agent's output—itself a complex task—and the orchestrator's context is finite, so it may not be able to hold every sub-agent's full output side by side for comparison.

Worse, once inconsistency happens, debugging cost grows exponentially. When a single Agent goes wrong, you read its execution log—linear, step by step, you trace its reasoning and actions to the step where things broke. When a multi-agent system goes wrong, you have to read multiple execution logs and understand the interactions between them. What instruction did the orchestrator give the sub-agent? What did the sub-agent return? What did the orchestrator decide based on that? Did this sub-agent fail because of its own logic, because the orchestrator's instruction was wrong, or because another sub-agent's output upstream contaminated it? Problems often live not inside any single Agent, but in the interactions between Agents.

You need a global view to understand the system's behavior, but the global view is itself hard to construct. Observability becomes critically important in multi-agent systems—you need to trace each Agent's execution, log every cross-agent message, and reconstruct root cause when something fails. This area has been chronically underinvested. A few leading tools have made meaningful progress on multi-agent execution-trace visualization and inter-agent interaction tracking, but in the broader tooling ecosystem, you can usually see an Agent's final output and very little of its full execution trace or its conversation with other Agents.

6.8 When to Use Multi-Agent—And When Not To

Multi-agent is not "a stronger Agent." It is an architectural choice that trades coordination complexity for capability scaling. Like every architectural choice, it has scenarios where it fits and scenarios where it does not.

There are roughly three kinds of fit. One: the task naturally decomposes into independent subtasks—write tests for several modules in parallel, fix several unrelated bugs in parallel, run code review across several files in parallel. Subtasks have few or no dependencies, and parallelism cuts total time meaningfully. Two: the task needs different roles. Writing code and reviewing code call for different mental modes; designing architecture and building details call for different focus. Letting different Agents take on different roles—each loaded with the Skills that fit its role—works better than asking one Agent to keep switching modes. Three: a single Agent's context simply cannot hold enough. The task references too many source files, documents, and historical records to fit into one window. Multi-agent uses context isolation so each Agent only handles the slice of information it actually needs.

The misfits are just as common. When the task itself is not complex—write a function, fix a bug, answer a question—a single Agent is plenty, and adding a multi-agent layer only buys coordination overhead. When subtasks are tightly coupled and each one depends on the results of others, parallelism has nowhere to live, and the multi-agent system collapses back into serial execution while still paying the inter-agent communication overhead. When consistency requirements are extremely high—an atomic refactor that touches many files—independent execution by multiple sub-agents is hard to keep consistent, and a single Agent running serially is more reliable. And when debugging and observability are not yet mature—if your tooling does not support tracing and debugging multi-agent runs, you will not be able to diagnose failures—using multi-agent for critical tasks before observability is in place is a real risk.

A simple rule of thumb: if you are not sure whether to use multi-agent, do not use it. Multi-agent is a tool you reach for when you need it, not a default architecture. Try a single Agent first; if you hit the ceiling—context too small, roles too tangled, execution too slow—then bring in multi-agent. Reaching for multi-agent too early is like splitting one person's work across three people: communication and coordination overhead can outrun whatever speedup you got from doing things in parallel.

Even when you have correctly decided "multi-agent fits here," the system should still be set up with a clear fallback path, not built on the assumption that coordination will always succeed. The pragmatic version usually has four layers: cap each sub-agent's execution time and reclaim the task on timeout; allow partial results to be preserved instead of throwing the whole batch away when one subtask fails; fall back automatically to single-Agent serial execution when coordination cost clearly exceeds the gains; and request explicit human intervention when even automatic fallback cannot resolve the situation. The principle behind them is simple: better a slower correct result than a fast wrong one.

For exactly this reason, multi-agent does not make failure go away. It changes the shape of failure from "one Agent did something wrong" to "a group of Agents went wrong together." As capability stacks higher, failure modes get more complex along with it—and that is the question the next chapter takes head-on.