Skip to content

5. Skill as a Packaged Capability

Your team has a Go coding standard. It is not long—maybe two thousand words—covering error-handling patterns, logging format, package naming conventions, and interface design principles. Every new hire reads it once on day one, then gets reminded of it in code review again and again until it turns into muscle memory.

Now you want your AI coding assistant to follow that same standard.

The first instinct: cram the standard into the system prompt. You try it, and it actually works—the Agent's code does start to look more like the team's style. But the system prompt has finite room. Your standard takes up two thousand words; add tool descriptions, task instructions, and project context on top, and the system prompt has already ballooned to several thousand tokens. And it is not just Go—you also have Python projects, TypeScript projects, each with its own conventions. You cannot reasonably stuff every language's coding standard into one system prompt.

The second instinct: turn it into an MCP tool. Write a get_coding_standard tool, let the Agent fetch the standard whenever it needs it. You try this and run into an awkward problem—the Agent keeps "forgetting" to call the tool. It just starts writing code, falling back on the generic style it learned from training data, ignoring your team's standard entirely. You add a line in the system prompt: "before writing code, please call get_coding_standard to fetch the coding standard." Things improve, but not reliably—sometimes it calls, sometimes it does not.

Where is the mismatch?

A coding standard is not a "tool." It is not a one-shot operation you "invoke once and you're done." It is a set of constraints that need to continuously shape the Agent's behavior. It is closer to the Agent's working habits than to a hammer in its toolbox. You do not check the screwdriver manual every time you turn a screw—you already know how to use a screwdriver, and that knowledge is internalized into how you work. The same should be true for a coding standard: not "look it up when needed" but "know it from the start."

The trouble is, the system prompt cannot hold every "know it from the start." Different projects need different standards, different tasks need different knowledge, different roles need different behavioral modes. What you actually need is a mechanism that can load behavioral constraints on demand—pull in the Go standard when writing Go code, pull in the review checklist when reviewing code, pull in the migration playbook when handling a database migration.

That mechanism is the Skill.

5.1 Not Every Capability Is a "Tool"

In an Agent's world, "capabilities" can be split, roughly, into two kinds.

The first kind is "do a specific thing." Read a file, search a piece of code, run a command, query a database. These capabilities have well-defined inputs and outputs, a clear moment of execution (the Agent decides "I need to do this now"), and an unambiguous completion signal (the operation finishes and returns a result). MCP tools are a clean fit for this kind.

The second kind is "do things in a particular way." Write code following the team's coding standard, review code against a security checklist, design a system following a particular architectural pattern, run code review through a fixed process. These look much more like working modes or behavioral constraints. They do not have the discrete "trigger moment" or sharp "input/output" of a tool call. Instead, while they are active—whether across an entire task or loaded on demand at a specific step—they act as background context that continuously shapes how the Agent thinks and what it produces.

If you try to express the second kind as an MCP tool, you hit exactly the problem we opened with: the Agent may forget to call the tool, or call it once and stop following its guidance afterward. That is because the tool abstraction is designed for "do a specific thing," not for "continuously influence how things are done."

Skills are the abstraction designed for the second kind. A Skill is not a function, not a tool, not a prompt template. It is a capability package—a bundle of instructions, resources, and workflow that can be loaded into an Agent on demand.

5.2 What's Inside a Skill: Instructions, Resources, Workflow, and Tool Orchestration

Once you accept that Skills exist to "continuously shape behavior," the natural next question is: what does a Skill actually contain inside? For a capability package to change an Agent's behavior and be reusable across scenarios, a single block of natural-language instructions is not enough. A complete Skill usually contains four kinds of content.

The most central layer is the behavioral instructions—a set of rules that tell the Agent "here is how you should do this." This is the primary lever a Skill uses to change model behavior, and it is the sharpest difference between a Skill and a tool description. A tool description tells the model "this is a hammer; it drives nails." Behavioral instructions tell the model "when you drive a nail, hold the hammer perpendicular and press straight down—do not strike at an angle." For example, a "Go coding standard Skill" might include rules like: "error handling must use the error-wrapping pattern, in the form fmt.Errorf("context: %w", err)," "use a structured logging library; do not use fmt.Println," "every exported function must have a comment, and the comment must begin with the function name." These rules are not "called" at a particular moment—they constrain every token the model generates for as long as the Skill is loaded.

Pure instructions are often not enough. Many constraints are far easier to communicate by showing a concrete example than by describing them in prose. That is where reference resources come in. A Skill can point to external documents, code samples, configuration templates that the Agent can consult when needed. A "microservice architecture Skill" might reference an architecture design document, a set of service template files, and an API design specification. We saw in Chapter 2 how few-shot examples interact with the attention mechanism: examples guide the model far more effectively than abstract description. Reference resources play the same role inside a Skill—when the behavioral instructions say "create new services using the team's service template," the reference resource is the actual template the Agent looks at, and seeing the "right answer" in concrete form is much stronger than reading an abstract description of it.

Behavioral instructions answer "how to do it." Reference resources answer "what to model it on." But some Skills are not just constraints on how things should be done—they encode an entire way of doing things. Code review is not a single action; it is a process: read the PR description, read the diff, review file by file, write up a report. Test-driven development is not a single rule; it is a loop: write a failing test, write the implementation, refactor. These cases call for workflow definitions—explicit steps, an explicit order, a goal and an output for each step. A "code review Skill" might define a five-step process: check style, check error handling, check security, check performance, generate the review report. With a workflow, a Skill stops being merely "a set of constraints" and becomes "a script": the Agent is no longer just "following the rules"—it is "following the script."

There is one more layer that is often overlooked: each step in a workflow usually involves calling specific tools. Should you run tests first or lint first? Should you read every file before analyzing, or read and analyze incrementally? "The order and combination in which tools get called" is itself part of what a Skill knows. This is tool orchestration: a Skill can specify which tools to use during execution and in what sequence. A "test-driven development Skill" might specify: "run the existing tests to confirm a baseline → write the new test → run the test and confirm it fails → write the implementation → run the test and confirm it passes → run the linter." Tool orchestration ties "which tools are used" to "the pattern in which they are used"—the same set of tools, orchestrated differently, produces a fundamentally different result.

The four ingredients of a Skill: instructions, resources, workflow, and tool orchestration

Once you put these four layers side by side, the boundaries between Skills and the concepts they get confused with become much sharper.

The difference from an MCP tool is one of what level you are operating at. A tool is "do a specific thing." A Skill is "do things in a particular way." If MCP tools are "the hammer, the screwdriver, the wrench," a Skill is "the carpenter's manual"—the manual tells you when to use the hammer, how to use the screwdriver, and in what order to assemble the pieces. The difference from a system prompt is when it takes effect. A system prompt is global and fixed: it is in force for the entire conversation. A Skill is local and dynamic: it can be loaded and unloaded on demand. You do not pile every book in the library onto your desk; a Skill lets you "pull whichever book you need, when you need it."

A more precise engineering analogy: a Skill is like middleware. In a web framework, middleware does not implement business logic—it changes how requests are processed. Authentication middleware ensures every request is authenticated; logging middleware ensures every request is recorded. Skills do something similar for an Agent. A coding standard Skill ensures every piece of code conforms to the standard; a security review Skill ensures every change goes through a security check. Middleware does not replace business logic; Skills do not replace tool calls. What they change is the way things get done, not what things get done.

The four ingredients sound abstract, but they are very concrete in the actual product surface. Take Claude Code's Skills as an example: a Skill is a directory; inside the directory there is a SKILL.md file; at the top of the file there is a YAML frontmatter block with at least a name and a description; below it is the full body of instructions the Agent will read. The directory can also hold reference materials, template files, even executable scripts. So the four layers are not flattened into a single block of text—the body of SKILL.md carries the behavioral instructions and the workflow, the supporting files in the directory carry the reference resources, and the scripts in the directory let a Skill cross the line from "knowledge" into "execution," because the Agent can shell-invoke them directly. That makes a Skill more than just a behavior-shaping document. It is a capability package with executable artifacts—and that detail will matter again when we walk through the interaction flow.

5.3 Progressive Disclosure: The Core Mechanism Behind Skills

Knowing what a Skill is matters less than the next question: how does a Skill actually work?

How a Skill Changes Model Behavior

Recall the central observation from Chapter 1: every step of generation is P(next token | all preceding tokens)—the probability distribution over the next token, conditioned on everything that came before. The contents of the context directly shape that distribution.

That is exactly the foundation a Skill operates on. When a Skill is loaded, its instructions and resources are injected into the context window. From that moment on, every token the model generates is conditioned on those instructions. A Skill is not "invoked" at a particular moment—it sits in the context and, by being there, reshapes the model's probability distribution across the entire task.

This is the essential difference between a Skill and a tool. A tool description sits in the context too, but what it tells the model is "here are the operations you can perform"—read a file, search code, run a command. After reading the description, the model decides whether to call the tool when it needs to, calls it, and that interaction ends. A Skill tells the model "here is how you should do things"—what pattern to use for error handling, what style to use when writing code, in what order to perform the steps. After reading the Skill's instructions, the model is not "calling" it at a particular moment; it is continuously constrained by it throughout generation. One is a capability declaration; the other is a behavioral constraint—they affect the probability distribution in completely different ways.

A concrete example. Suppose the Agent is writing a piece of Go error-handling code. With no Skill loaded, the model is likely to generate:

if err != nil {
    return err
}

This is the most common pattern in the training data, so it has the highest probability. Now load a Go coding standard Skill, and a new instruction shows up in the context: "error handling must use the error-wrapping pattern, in the form fmt.Errorf("context: %w", err)." The probability of the fmt.Errorf token sequence rises sharply, because the context now has an explicit instruction pointing at it. The model becomes much more likely to produce:

if err != nil {
    return fmt.Errorf("create user: %w", err)
}

The Skill did not "teach" the model anything new—the model already knew the error-wrapping pattern. What the Skill did was raise the probability of that pattern and lower the probability of others. This is the same mechanism we described for system prompts in Chapter 2; Skills are simply the dynamic, pluggable version of it.

Once that mechanism is clear, a natural question follows: if Skills work by injecting context, how many should you load? Why not load all of them?

That is exactly where the system breaks. Which brings us to the most important design principle in Skill systems—progressive disclosure.

Why Progressive Disclosure Is Necessary

The core idea of progressive disclosure is simple: do not dump every piece of information on the model at once; reveal the right information at the right moment.

Why not dump it all in? Two reasons.

First, the context window is finite. If you load every Skill that might possibly be relevant, the window fills up and there is no room left for the actual task.

Second, and more importantly: more information dilutes the influence of each piece. Chapter 2 covered the relevant property of attention—the more content there is in the context, the more the model's attention is spread across it. If you load the Go coding standard, the Python coding standard, the TypeScript coding standard, the security review checklist, the performance optimization guide, and the database best-practices guide all at once, the model is staring at a stack of instructions and the "probability of being followed" for each of them goes down. It is like a person being talked at by ten people simultaneously—each of them has something reasonable to say, and the listener catches none of it.

Picture a full-stack engineer's day. Mornings spent on a Go backend service, afternoons on a React frontend, evenings on code review. Writing Go does not need React component patterns. Reviewing code does not need the database migration guide. Irrelevant knowledge is not just "wasted space"—it actively interferes with the model's attention. The more unrelated material there is in the context, the less the model focuses on what is actually relevant.

Progressive disclosure resolves this by only revealing the information needed at the current stage of the task. While writing Go, load the Go coding standard Skill. When switching to React, unload the Go-related Skill and bring in the TypeScript standard. When the task shifts to code review, swap in the review Skill. The Agent's effective capability set changes as the task changes, and the context window stays occupied by whatever is most relevant right now.

Skills loading dynamically: different scenarios bring in different capability sets

Three Layers of Progressive Disclosure

Progressive disclosure is not a crude on/off switch. It is a layered information-management strategy.

Session level: set the baseline. At the start of a session, load foundational Skills based on the project type. Open a Go project and the Agent automatically loads the Go coding standard Skill; open a React project and it loads the TypeScript standard Skill. This layer is driven by project configuration—coarse-grained disclosure.

Task level: set the approach. When a task begins, load extra Skills based on the task type. The user says "help me run a code review," and the Agent loads the code review Skill. The user says "help me write a database migration script," and the Agent loads the database migration Skill. This layer is driven by task semantics—medium-grained disclosure.

Step level: focus on the moment. During task execution, dynamically adjust what is in the context based on the current step. A code review Skill might define a five-step process: check style → check error handling → check security → check performance → write the report. While inside the "check security" step, the Skill can inject the detailed security checklist into the context, and compress or evict the detailed results from earlier steps to make room. This layer is driven by step semantics—fine-grained disclosure.

The three layers stack. Each one narrows the model's "field of attention," keeping it focused on whatever matters most at that moment.

But who decides which Skills to load at which layer? Three patterns are common in practice. The first is configuration-driven: a project config file declares "this project uses these Skills"—simple and direct, but inflexible. The second is LLM-driven autonomous selection: expose every Skill's full body to the model and let it decide what to use. Flexible, but expensive and unstable. The third—now the dominant pattern in practice—is summary-driven automatic matching: at startup the Agent injects only a short summary of each Skill (in Claude Code's Skills, that is the description field of the SKILL.md frontmatter), and lets the LLM choose which one matches the task. Only the chosen Skill is then expanded into full content. The third pattern is essentially the marriage of "lightweight declaration" and "model-driven selection," and it is the configuration that makes progressive disclosure practical at all. In real systems these patterns are usually mixed: the project config provides a baseline set, and the LLM picks additional Skills from the summary list as needed.

The Agent ↔ Skill ↔ LLM Interaction Flow

Putting progressive disclosure into the full interaction flow makes the collaboration between the three parties concrete. The flow below is not pure theory—it is essentially how Claude Code's Skills work in practice. The description field at the top of SKILL.md is the "summary" referenced below; the body of the file is the "full content."

Agent ↔ Skill ↔ LLM interaction flow

A complete interaction loop runs like this:

  1. The user starts a task. "Help me write a Go HTTP server with proper error handling and structured logging."
  2. The Agent injects the Skill summary list. It places the summaries of every available Skill (name plus a one-line description) into the context and sends them to the LLM. Note: only summaries at this point, not full instructions or resources. The context might contain entries like go-coding-standard: "Go coding conventions, including error-handling patterns, logging format, package naming" and security-review: "Security review checklist, including injection defense and authentication checks." These summaries cost very few tokens, but they are enough for the LLM to know "what capabilities are on offer."
  3. The LLM decides which Skill it needs. Looking at the task description and the summary list, the model reasons that "this task involves Go coding and error handling, so I need the full content of go-coding-standard." This decision uses the same mechanism the model uses to choose tools—a probabilistic selection based on context.
  4. The Agent loads the full Skill content. On receiving the LLM's request, the Agent retrieves the full Skill (behavioral instructions, reference resources, workflow definition) from the Skill library and injects it into the context window. From this moment on, the full Skill instructions shape every subsequent token.
  5. The LLM generates code under the Skill's constraints. With instructions like "error handling must use the error-wrapping pattern" and "use a structured logging library" present in the context, the generated code automatically reflects them. This is not the model "understanding" the standard—it is the standard's presence reshaping the probability distribution.
  6. The LLM decides to call tools; the Agent executes. When the model decides it needs to write a file or run tests, it emits tool-call requests; the Agent runs them and folds the results back into the context.
  7. The loop continues; new Skills load on demand. When the task moves into a new phase (say, from "writing code" to "writing tests"), the LLM may pick a new Skill from the summary list, and the Agent loads its full content while unloading Skills no longer needed. That is step-level disclosure in action.

This is the full picture of progressive disclosure: hand over a "table of contents" first, then "open the specific chapter" only when needed. The Agent does not blast every Skill's full body into the context up front—that would blow the window and dilute the influence of every instruction in it. Instead it offers a lightweight "menu" and lets the LLM decide what it actually wants opened.

Notice the division of labor in this flow: the LLM decides (which Skill to choose, which tool to call, what code to generate), and the Agent executes (loading Skills, invoking tools, managing context). The LLM is the brain; the Agent is the hands and feet. The LLM says "I need the Go coding standard," and the Agent fetches it. The LLM says "write to this file," and the Agent writes it.

One more thing worth being clear about: a Skill does not "command" the model to do something—it changes the context, and through the context shifts the probability distribution. With the coding standard Skill loaded, the Agent's code follows the standard most of the time, but not by guarantee. A Skill makes "the right thing" more likely; it does not eliminate the possibility of "the wrong thing" being generated.

5.4 Skill Orchestration: When Multiple Capability Packs Need to Cooperate

In simple cases, a single Skill is enough for one task. Real engineering tasks rarely stay that simple.

"Implement this feature, write the code, write the tests, then do a self-review." That single sentence already pulls in three Skills: the coding standard Skill, the testing standard Skill, and the code review Skill. All three need to cooperate inside the same task.

Multi-Skill cooperation introduces a few problems.

Order of execution. In what order should the three Skills come into play? Is it write-then-test-then-review (waterfall), or write a bit and test a bit (TDD-style)? If you let the Agent decide on its own, sometimes it will pick something reasonable, sometimes it will not. If you encode the order in predefined orchestration logic, you lose flexibility.

Conflicting instructions. This is the hardest problem in multi-Skill cooperation. Your coding standard Skill says "function bodies must not exceed 50 lines." Your performance optimization Skill says "reduce function-call overhead; inline critical paths where possible." When the Agent is writing a performance-critical function, who does it listen to?

A model facing contradictory instructions behaves unpredictably. It may follow whichever appears first (because it sits earlier in context), or whichever appears later (because attention weights tilt toward it), or attempt a compromise, or simply ignore the conflict altogether.

Common mitigations today include priority declaration (the security Skill has top priority, the style Skill has lower priority), scope isolation (the coding standard Skill only takes effect during the "write code" phase, the testing standard Skill only during the "write tests" phase), and conflict detection (when multiple Skills are loaded, conflicts in their instructions are flagged so a human resolves them). None of these is elegant—conflicting instructions are at heart a "multi-constraint satisfaction" problem, and the Agent's ability to handle that kind of judgment is still far from mature.

Competition for context space. Loading several Skills at once means each one occupies space in the context. There is a delicate trade-off here: the more detailed a Skill's instructions, the better the Agent follows them, but the more context space they consume. In a multi-Skill setting, you may have to choose between "writing every Skill in detail" and "loading enough Skills at once." A practical rule of thumb: write the core Skill in detail; keep auxiliary Skills lean.

Progressive disclosure pays off again here. Rather than loading three Skills simultaneously in full, load different Skills at different stages—the coding standard Skill while writing code, swap to the testing standard Skill when writing tests, swap to the review Skill when reviewing. Each stage has only one Skill in context, the space pressure relaxes, and the conflicting-instructions problem dissolves on its own.

Stage switching is moving things around inside the same context. A more thoroughgoing solution—now becoming mainstream—pushes isolation further: hand the complex Skill to a sub-agent running in its own context. A Skill like code review, which carries a full multi-step workflow and needs to walk through files one by one, is a natural fit for a separate sub-agent task. The orchestrator agent passes the review request and the necessary inputs to the sub-agent; the sub-agent loads the full review Skill in its own context, calls the tools it needs, produces a report, and returns only the conclusion to the orchestrator. The thousands of tokens of review-Skill instructions, the source code read along the way, the intermediate analysis—none of it pollutes the orchestrator's context. The instruction-conflict problem also disappears down this path: different Skills are not even in the same context, so there is nothing to coordinate between them. Claude Code's Sub Agents take exactly this route, and it is not a separate mechanism from Skills—it is the same idea expressed at a different scale. Progressive disclosure layers things inside a single context; sub-agents push the disclosure boundary out across processes.

5.5 Design Philosophy and Limits

When you design a Skill, there is one fundamental choice: should the Skill be declarative or procedural?

A declarative Skill describes the "target state"—"error handling must use the error-wrapping pattern," "test coverage must not drop below 80%." The Agent decides on its own how to reach those targets. Flexible, but uncertain.

A procedural Skill describes the "execution steps"—"step one, read the source file → step two, locate the error-handling code → step three, check whether error wrapping is used → step four, generate the fix." Controllable, but rigid.

The most effective Skills in practice are usually hybrid—declarative at the high level for goals and constraints, procedural at the critical-path steps for concrete operations. A code review Skill, for example: the declarative part defines "reviews must cover correctness, error handling, security, performance, and maintainability," while the procedural part defines "first read the PR description → then read the diff → review file by file → produce the report." The critical workflow is fixed, ensuring no important step is skipped; the specific judgments are flexible, so the Skill can adapt to different code and different scenarios.

Finally, Skills have limits worth being honest about.

A Skill cannot replace model capability. If the model is not good at a class of task, no amount of Skill engineering will fix that. What a Skill can do is "guide the model to use the capabilities it already has in the right way." It cannot "give the model capabilities it does not have."

A Skill's effect is probabilistic. This thread runs through the whole chapter—Skills shape probability distributions through context, and probability is not certainty. In edge cases the model can still generate output that violates the Skill's instructions.

A Skill needs ongoing maintenance. Coding standards get updated. Architectural patterns evolve. If a Skill drifts away from the current reality—the standard has been updated, the Skill still encodes the old version—the Agent will write code against an outdated standard. This kind of "Skill rot" is gradual and silent, and it requires regular review and updates.

A side note worth flagging: the word "Skill" does not mean exactly the same thing in every product. What this chapter describes is closer to Skills in the Claude Code sense—SKILL.md directories, on-demand loading, progressive disclosure, organized at roughly individual or team granularity. OpenAI Codex also uses the name "Skills," but its product framing leans more toward organization-level capability accumulation—encoding a team's workflows, standard documents, and operational standards so the whole organization can reuse them. The two are solving different layers of the same problem, but the underlying design principle is the same: bundle behavioral constraints and reference resources into a package, inject the package into context on demand, and make the Agent "do things in a particular way."

There is one more boundary that gets blurred easily and is worth stating directly: the decision to load a Skill is usually made by a human or by an external orchestrator, not by the model deciding "to load" something during inference. What the model actually does is operate inside the constraints and resources of a Skill once that Skill's content has already entered the context. "Loading on demand"—the demand is judged by the system or by the user before the model is called. This is genuinely different from how a Tool works: a Tool can be triggered by the model itself during reasoning, but a Skill is loaded by something outside the model before reasoning begins.


Let's look back at the path traced through Volume Two. Chapter 3 gave the Agent the ability to "do things." Chapter 4 gave it standardized tools. This chapter has given it a "way of doing things"—through progressive disclosure, Skills inject the right knowledge into the model at the right moment and reshape its behavior pattern.

But the more complete the Agent's capability stack becomes, the sharper a structural tension grows: the more roles a single Agent takes on, the more crowded its context becomes—tool descriptions, Skill instructions, task history, intermediate results, all crammed into the same window. The wider its decision space, the lower the probability of making the right decision. That tension does not go away as models get better, because it is a direct consequence of the context window being finite.