11. Spec-Driven Development: When Requirements Become Engineering Artifacts
Monday morning. You sit down in Cursor and tell the AI: add a timeout-and-retry to this HTTP client.
Round one comes back with a flat three retries, one second apart. You read it and say no, make it exponential backoff — one second, two seconds, four seconds. It rewrites. You look again and say the total wait must not exceed thirty seconds; past that, fail outright. It rewrites again. By round three you remember the endpoint is non-idempotent, so you ask it to add a request-ID for deduplication. Round five, you tell it to emit metrics — retry count and final outcome. Round six it looks fine. You merge.
Three months later, a teammate inherits the file. What they see is the merged code: exponential backoff, the thirty-second cap, the request-ID dedup, the metrics, all sitting there in neat rows. What they cannot see is the six rounds it took to get there — why exponential and not fixed, why thirty seconds and not sixty, why dedup belongs there, how those metric fields were chosen. Every "why" from those six rounds has evaporated. Or worse: the inheritor is not a teammate, it is another AI session. It reads the same code. The six rounds of conversation are just as invisible to it. The next time someone asks it to wire timeout-and-retry into a neighboring endpoint, it reads this file, and it almost certainly does not write the new one the same way. It will pick something that looks reasonable to it, by the style it can see, which may or may not match what you originally meant.
This looks like nothing more than forgetting to write documentation. But traditional software teams never recorded the verbal back-and-forth in a code review either; those whys also lived only between the reviewer and the author, in a conversation. That used to be fine. The moment you swap one side of that conversation for an AI, it stops being fine. Why?
11.1 Where That Conversation Went
Start by sharpening the picture from the opening.
The six rounds themselves are nobody's idea of an artifact. You do not paste them into a wiki. You do not attach them to the PR. What you do is take the polished result, merge it, and close the chat window. Once the window is gone, so is the process.
Nothing is technically lost in the code: the implementation is still there, the behavior is still there, everything still runs. What is lost is the discussion behind every "why." When you stare at the merged code, it looks like it could only have been written that way, because its current logic is reasonable. But it ended up reasonable because in those six rounds you eliminated the unreasonable options one by one. The elimination is gone. What remains is a final state that looks self-evidently right.
Did this process exist in traditional software work? It did. It just lived somewhere else. Two engineers go back and forth in a PR review thread, reach a conclusion, the PR merges, and the thread sits in GitHub or GitLab forever. The "why" never goes into the code, but it has a storage location bound to that exact code change. Three months later you can click into the PR and read it again.
Working with an AI today, that storage location does not exist. The chat window is a session-scoped, ephemeral container — close it and it is gone. The repo holds only the merged final state. The PR review thread holds only the review of that merge action, not the six rounds with the AI behind it.
Could you write a record after the fact? In theory yes. You finish the code, spend five more minutes summarizing why you made each call, and either paste it into the PR description or have the AI summarize it. Anyone who has tried this seriously a few times knows it does not survive contact with real work. After-the-fact and at-the-moment are not the same thing. By the time the code is done you are tired, your attention is fully on the final state, and what you remember most clearly is round five and round six — the early reversals are already blurred. And that summary has no structural binding to the code. It is just a paragraph in a PR description, and the next AI session reading this file will not read it. Worse, no team is going to enforce discipline around this; the default is "code review passed, we are done."
So the six rounds really do evaporate, and the after-the-fact patch path does not survive in practice. But traditional development was full of execution results with no recorded process and that was somehow fine. Why is it not fine anymore?
11.2 Who Used to Fill in What Nobody Said Out Loud
Look back at how traditional collaboration actually worked.
A PM hands a developer a one-pager: "after login, show personalized recommendations." Can that line be implemented straight from the page? Not even close. It does not say what to recommend, how many items, what to do on failure, what to show when the user has no history, what the cache strategy is, what the loading state looks like. None of it is on the page.
But the developer does not immediately go pin down the PM. They fill in the gaps themselves. They draw on the product's history, the team's coding conventions, what the company does in similar surfaces, five or eight years of accumulated engineering judgment. "Personalized recommendations" gets translated, in their head, into a concrete set of choices: which service to call, which cache, which fallback, which error codes, where the logs go, which three or four metrics to emit. Some of those choices end up in a doc, others go straight into code.
Between the fuzzy requirement and the concrete code, there is this filling-in step. The developer's head is an invisible translation layer that turns fuzzy into concrete, using project history, engineering experience, team habits, and a feel for the business. The better that layer is, the more roughly the requirement can be written. A title-level requirement handed to a senior internal developer becomes shippable software. The same requirement handed to a fresh hire takes five or six rounds before it is even close.
So traditional software collaboration ran on an unspoken truth: the precision of a requirement is not produced by the document, it is filled in by the developer's experience. Documents could be vague and code could still come out precise, because the vagueness was being absorbed by the developer's head. Now swap "the developer" for "the AI."
The AI has no project history. It does not know what conventions were left behind by the refactor six months ago. It has no thickness of engineering experience — it has seen ten thousand timeout-retry implementations, but it does not know which one passes muster in your codebase. It has no team-level tacit knowledge: it does not know your error codes are five digits, does not know where you put logs, does not know you have already decided not to take on new dependencies. Most importantly, it has no professional judgment as a stable property. Hand the same fuzzy requirement to it ten times and you will get ten reasonable but mutually inconsistent answers. Today it picks A; tomorrow it picks B; when somebody else edits the file later, it gets pulled toward C. The filling-in is unstable.
This is not a matter of the AI not being smart enough. Its language ability, its coding ability, its reasoning ability are already past the level of a fresh new hire. What it lacks is what a person who has worked in this project for a year carries — the kind of background knowledge it cannot accumulate by itself, because every session starts from zero.
The vagueness that used to be tolerable is no longer tolerable. Somewhere between the fuzzy requirement and the precise code, somebody has to absorb that conversion. Either the vagueness leaks into the code itself and the code grows crooked, or the requirement stops being vague and the requirement itself becomes precise.
So is the answer just "write a more detailed PRD"?
11.3 The PRD Isn't Failing Because It Wasn't Detailed Enough
The PRD has not exactly been left untended. The format has been polished, templated, and proceduralized for over a decade. A mature PRD can run thick: user stories, acceptance criteria, state diagrams, interface contracts, error scenarios — all the things you might think to include are in there.
Hand that PRD straight to an AI and it is still not enough. The reason is not that the PRD is not detailed enough. It is that the PRD rests on four underlying assumptions, all of them designed for a human recipient, and none of them survive the move to AI collaboration.
The first assumption is that a PRD is written, then archived. A PRD's lifecycle is: drafted, reviewed, signed off, filed away. The developer reads it, internalizes it, and goes off to work. It is not consulted at the moment of writing each line of code, because the developer has already digested it. The AI does not digest. Every moment it is generating code, it is reading from the context immediately in front of it. A PRD that is not in that visible context might as well not exist. "Archive" is, for AI collaboration, the equivalent of cutting the power.
The second assumption is that the PRD's unit is a feature. A typical PRD describes "build a personalized recommendations module" or "build a user authentication flow" — a complete product feature. The actual unit of AI collaboration is a change: add a field, alter a piece of logic, patch an edge case. Those two units differ by an order of magnitude. You are not going to write a PRD just because some function needs a new timeout parameter. But that kind of change is exactly what AI gets asked to do most often. The PRD does not reach down that far, and the layer below is left exposed, filled in only by chat — which puts you back in the evaporation scene from the opening.
The third assumption is that the PRD does not go through review. A PRD change does not require a PR. There is no such thing as a "diff against the previous version of the PRD." It is owned by product; engineering is just a consumer. In an AI-collaboration setting the coupling between requirements and code is too tight for that to hold: change a paragraph in the requirements, and the AI may rewrite a swath of code. If the requirement-level change cannot be reviewed the way code is reviewed, its real-world effect is invisible: you see the code PR, you do not see the requirement change that triggered it. A requirements document that does not pass through review, in AI collaboration, is a document that is silently drifting in the background forever.
The fourth assumption is that the PRD describes "what to do," not "what done looks like." A PRD says "the user should be able to log in," it does not say what state means done: form submitted does not count, token returned does not count, token written to storage and readable in the next request does count. "What to do" is enough for a person, because a person fills in the rest — they know what "done" means for a login endpoint. The AI does not fill in. It will stop somewhere that looks "close enough." If you do not explicitly tell it what done means, its "done" and your "done" are not even pointing at the same concept.
Put the four assumptions together and the picture changes: the PRD is not failing on quality, it is failing on audience. It was built for a human recipient who has professional judgment, project history, team-level habits, and the ability to fill in gaps. Every one of those holds for a person; not one of them holds for the AI.
So fine — drop the PRD format and write something new that closes all four gaps. Put it in the repo so the AI can see it. Slice it down to per-change granularity. Run it through review. Spell out done explicitly. Is that enough?
11.4 You Aren't Writing a Document. You're Signing a Contract with a Probability Cloud.
Hidden in that "fix the four gaps" plan is a fatal assumption: that the collaborator on the other side is still a person, just reading a different format.
The collaborator is not a person. It is a machine that picks the next word according to how many constraints you have given it.
Chapters 1 and 2 already covered how it works. Every word it produces is sampled from a distribution. Give it enough constraints and the candidate space narrows down to something tight; the output reads like a steady, reliable engineer. Give it too few constraints and the candidate space stays wide; it has to pick from several options that all "sound reasonable," and the pick itself is probabilistic. Today it picks A, tomorrow it picks B.
Where it is stable is on the parts it knows well — syntax, common APIs, generic patterns. Where it is unstable is on the project-internal definition of "good enough": what counts as a passing timeout-retry in this codebase, what your error logs are supposed to look like, what counts as introducing a new dependency. None of that is built in. You have to supply it. If you do not, the model is blindly picking among reasonable-looking candidates.
The consequence of this is not "write the document with more detail." It is that the structure of the spec has to be different.
First, you have to write down what done means.
Not because the AI lacks professional judgment, but because "done" is an open-ended concept until you pin it down. "Add timeout-and-retry" — done could mean the code compiles, or the unit tests pass, or the integration tests pass, or all three plus the metrics are wired up. All of those sound plausible; the model can only walk in one direction. Writing down the completion criteria closes the open-endedness: "done" means the code is merged, unit tests cover this path, error logs are structured, three named metrics are emitted, each of those checkable. When the model samples its way to "I'm done," it has a concrete checklist to compare against, not a vague sense of direction.
Second, every criterion has to be verifiable.
Not because the AI cannot self-evaluate, but because un-checkable criteria do not converge on a word-picking machine. "Write clean code." Every time that phrase is read, it activates a different slice of training data, and "clean" lands on different concrete features in each slice. Today it converges on short functions and shallow nesting. Tomorrow it converges on extensive comments and long names. It is not being capricious; the phrase itself is wandering between several interpretations. A verifiable criterion shuts the wandering down: "function bodies under forty lines"; "every error return must wrap the underlying context." Each of those is a closed judgment. There is no way to satisfy it in two opposite directions at once.
Third, you have to explicitly forbid things it might do.
Not because the AI is undisciplined, but because without an explicit no-go zone, its candidate set will include actions you assumed it would never take. Delete existing code, introduce a new dependency, rename a public function — every one of those appeared in its training data, often. Without a forbidden list, the moment it hits a "this would be more elegant if we just refactored…" temptation, the probability that it walks over the line is non-zero. An explicit forbidden list is not because it loves making trouble; it is the only way to push that non-zero probability to zero.
Put the three together and the new spec is structurally a different thing from the old PRD. The old PRD describes "what to do." The new spec defines "what done is," "what done looks like," and "what cannot be touched." It is no longer a narrative product description. It is a checkable, verifiable, completion-defined, no-go-zone-bounded engineering contract.
OpenSpec, and the family of tools around it, makes its first appearance here. Why those tools insist on the shape they do — a forced spec → tasks → acceptance criteria triad, every task written as a checkbox, the completion state pulled out as its own field — is not a product preference. It is not for the look of professionalism. It is what the three constraints above force into existence. If a product has to carry specs that are structurally different, this is the shape it ends up in: you want specs to be checkable, so tasks become a checklist; you want done to be explicit, so acceptance criteria become a separate slot; you want the two kinds of information separated, so spec and tasks become layered. Spec Kit takes a similar route — different directory structure, same skeleton.
These look like new tools. They are, more accurately, the carrier of this structure. Which product you pick is not the point. What matters is why this structure had to be designed this way at all.
11.5 Which Kind of Spec Are You Talking About?
Picture the "spec" you had in your head a moment ago. Which one was it? The one from the previous section — the one that captures this timeout-retry change, describes what to do this round, what done looks like, then gets archived once the change merges? Or a long-lived document describing how the retry mechanism in this project actually works — one that lives alongside the retry code and changes whenever the code changes?
Most people have never separated the two, because day-to-day they are both called "spec." But their lifecycles are nearly opposite. Argue about whether to maintain "the spec" without separating them and the conversation runs in circles forever: you say "yes, maintain it," and someone counters with "the PR description gets archived after merge, why bother." You say "no, don't bother," and someone counters with "API documentation cannot be allowed to go stale." Both sides are right. They are not talking about the same thing.
The two kinds can be called change-level specs and capability-level specs.
A change-level spec is at the granularity of one change. The "add timeout-and-retry" task above, from kickoff to merge, corresponds to one change-level spec. In OpenSpec it lives under the changes/ directory; each change gets its own subdirectory containing files like proposal.md and tasks.md. It describes "this round, push the code from state A to state B; here is how we know we got there." Its lifecycle is single-shot: written, executed, merged, archived (OpenSpec ships an archive command for exactly this). After archive, it does not change.
If you want an analogy, it is the union of "PR description" and "commit record." It is not a living document. It is a decision archive pinned to a moment in time. Three months later, you read it to see why it was decided this way back then — not just what the current logic happens to be.
A capability-level spec is at the granularity of one long-lived capability. As long as "the retry mechanism in this project" exists, there is a corresponding capability-level spec. In OpenSpec it lives under the specs/ directory (a different directory from changes/). It describes the current shape of this capability: what parameters exist, where the boundaries are, what cannot be moved, how it degrades, what makes it throw. Its lifecycle equals the lifecycle of the code itself — when the code evolves, the spec must evolve with it. The closest analogy is the union of "API documentation" and "list of critical constraints." It is not a write-once-then-frozen artifact. It is a living document that must always reflect the present.
The two have nearly opposite lifecycles. One is single-shot, has a terminus, gets frozen on completion. The other evolves continuously, lives as long as the code, must always be the current truth. Mash them into a single conversation about "do we maintain the spec" and of course no answer will ever fit.
OpenSpec, Spec Kit, Kiro and their kin actually carry both kinds in their directory structure: changes/ serves one, specs/ serves the other. But because the outer label is unified ("specs," "the spec"), first-time users often collapse the two into a single concept and then run into "do we just throw it away after writing it?" That confusion comes entirely from asking the question of both at once.
Separated, the answer falls out cleanly — and the answers point in opposite directions. Once the lifecycle distinction is clear, you also get a direct working rule: which kinds of code change should go through a spec, and which should not. Teams argue about this constantly — "every change must go through one" versus "that's too heavyweight." The real answer is not "strict" or "loose." It is what kind of thing this change touches.
Renaming a variable, fixing a typo, tweaking a magic number — none of that touches any capability-level spec, and the discussion behind it has nothing worth re-reading three months from now. Just edit the code. No spec needed.
Changing a retry policy, adding a new parameter, redefining the completion criterion — that touches what a capability-level spec is supposed to describe. It must go through specs: first update the capability-level spec to describe the new state, then change the code to implement that state, then file a change-level spec describing why we made this jump. The first prevents the description of "current state" from drifting; the second preserves the archive of why we did it.
This boundary is the same idea as the "collaboration half-life" check coming up in 11.7, just stated differently: does the meaning of this change still need to be understood three months from now? If yes, go through a spec. If no, just edit the code.
11.6 Why Both Contracts Have to Live in Git, but for Different Reasons
There is a line that gets repeated in traditional software work: the code is the truth. All truth eventually lives in the code; documents can be vague, stale, lost.
In AI collaboration, that line breaks. The next time the AI edits this file, what it reads is not just the code — it is also the spec describing this code. The code is still the ultimate truth, but code alone is not enough. The code can tell it "this is how it is currently written." Code cannot tell it "this is why it is written this way," and code cannot tell it "these are the constraints this implementation must not violate." Both of those have to come from the spec. A spec that is not in the repo means the code is missing half of its collaboration signal.
So both kinds of spec have to live in Git. Right direction — but the two kinds have to live in Git for different reasons.
Change-level specs go into Git to anchor a decision archive at a point in history.
Their value comes from being bound to that exact code change. Half a year later another developer (or another AI session) reads retry_delay = 2 ** attempt and wants to know why exponential and not fixed, why base 2 and not 1.5, why thirty seconds. They are not going to dig through a wiki. They follow the Git history back to the change that introduced this line, find changes/2024-Q3-add-retry/proposal.md, and read the decision that was made at that moment.
That binding is the whole reason change-level specs must be in Git. A decision archive that lives outside Git — say, in Notion or Confluence — has no time anchor. It says "back then we decided X," but back when, exactly? Against what state of the code? You cannot replay it. Git provides that binding: every change-level spec belongs to the same commit / PR as a code change, and three months later you can line up the code state and the decision precisely.
After it is in Git, it stops changing. Like a commit log entry, it freezes on archive. Its value is in "what was decided at that moment." Not in "what it has become now."
Capability-level specs go into Git to evolve in lockstep with the code, so there is always one truth.
Their value comes from staying perfectly synchronized with the current state of the code. The next time the AI edits the retry mechanism, it reads specs/retry/spec.md to know the current constraints: max retries, what kinds of errors are retryable, how idempotency is enforced, which key metrics are emitted. The moment that document drifts an inch from the code, the collaboration intent the AI reads is already off.
We have already said it once: the staleness of code memory is not a function of time; it is a function of causality. A memory becomes stale not because it has been a long time since it was written, but because the reality it describes has changed. The same rule applies to capability-level specs. Whether one is stale or not is determined by whether the constraint it describes still holds, not by when it was written.
So the reason capability-level specs must live in Git is this: only being in Git lets them be reviewed and evolved with the code, in lockstep. A PR that changes the code without changing the spec should be blocked by review. A PR that changes the spec without changing the code should be questioned by review. That kind of forced binding is the only way to keep one truth under non-deterministic collaboration. Outside of Git, relying on people to remember to update Notion or the wiki, the two will desynchronize.
And the entry point for changing a capability-level spec must be spec-first: change the spec to describe the new state, then change the code to match. The reverse — change code first, patch the spec later — is the same "after-the-fact documentation" pattern collaboration with humans had, the one that fails because it depends on individual discipline and there is no mechanism to enforce that discipline. Spec-first is not ceremony. It is the structural way to take "lazy" off the table.
11.7 Two Failure Modes
Spec-driven work is not free. There is maintenance cost, there is collaboration friction, and yes, it really can fall apart. There is more than one way for that to happen.
The first failure mode is rot, and it only happens to capability-level specs.
A capability-level spec lives or dies on its synchronization with the code. The instant the synchronization breaks, it begins to rot. Week one, code changes, spec does not, nobody calls it out. Month one, three out of ten PRs touch the code without touching the spec, review lets them through. Six months later, the spec file is still in Git, still looks like a serious document — clean structure, careful wording — but what it describes as "the current state" diverges from the actual code by a noticeable margin. The next AI session that reads this spec to edit the code is reading a misleading collaboration intent. It is trusting a document that lies.
A capability-level spec drifting from the code is collaboration-intent-level memory poisoning. What gets poisoned is "why this is done this way" and "what the boundaries are." It is not that the spec was poorly written. It can be written carefully, with clean structure and accurate phrasing. The problem is that nobody is on the hook for keeping it true over time. A capability-level spec cannot just be written. It needs a discipline that forces it to live and die with the code, and that discipline is the spec-first + same-PR-touches-both rule from 11.6. If that discipline lapses, rot starts.
Change-level specs cannot rot in this sense. They are one-shot archives by design; freezing on completion is the intended behavior, not a failure. A three-year-old change-level spec that no longer matches the current code is not stale — its job is to describe what was decided at that moment three years ago, of course it does not reflect the present. Calling that "rot" is conflating the lifecycles.
The second failure mode is theater. Both kinds can suffer from this, but they wear different costumes.
Change-level theater looks like this: every PR is forced to carry a change-level spec, but the contents are written after the fact. The code is already done. The developer goes into the changes/ directory, creates a file, and restates "what was done" in slightly more formal language with a few official-looking headers. It has no relationship to the actual decision process — that process happened in the six rounds of chat, and this archive is just the final result re-stated in a different format. Three months later, you go back and read it and you can see what was done, not why X and not Y, because the why was bypassed.
Capability-level theater wears a different costume. The team, for the appearance of professionalism or for compliance, writes specs that look immaculate: complete sections, full fields, the cadence of a serious engineering document. In practice, nobody works against them. The AI does not consult them (because nobody loads them into context). The developer does not update them (because nobody is checking). Review does not enforce them (because everyone has tacitly classified them as wallpaper). They get pulled out for compliance audits, technical reviews, external presentations. They look alive. They have already died.
The common thread between the two flavors of theater: writing the spec stops being something that affects the collaboration and becomes something done for show. It consumes the team's energy without buying any precision in return.
Looking at these two failure modes alone, the reflex is "fine, then don't do specs at all." That reflex is also wrong. Earlier sections already dug out why you have to do this, and those reasons do not go away just because there are failure modes. The real question is not "do specs, yes or no." It is "when is it worth doing them?"
The most tempting answer is a checklist by task type: exploratory tasks no, throwaway tasks no, immature domains no — the kind of list that sounds professional and turns out to be useless in practice. The reason is simple: nobody actually classifies their task on a checklist before deciding. When a developer picks up a task, they are not thinking "is this exploratory or non-exploratory?" They are thinking "do I roughly know how this should go, or not?" and "is this thing going to keep getting changed after?" Checklists are a post hoc taxonomy. They are not a pre-emptive judgment.
The honest pre-emptive judgment is one line: Will the meaning of this change still need to be understood three months from now?
If three months from now somebody (or another AI session) is going to come back to this code, modify it, depend on it — the collaboration half-life is long. Worth a spec. The change-level spec preserves "why we decided this." The capability-level spec keeps tracking "what this thing is now." Three months later both of them speak for you when you are not in the room. If three months from now this code has likely been replaced, rewritten, or simply fallen out of use — the half-life is short. Whatever got hammered out in the chat window is enough. The cost of writing a spec exceeds the precision it would buy.
Notice the criterion is not the kind of task, it is the half-life of this particular collaboration. Adding timeout-and-retry to a core payment path has a half-life of years; spec required. Adding the same timeout-and-retry to a weekend POC tool an intern wrote has a half-life of days; not required. The tasks sound identical. Their half-lives are not. The need for a spec is not either. "Half-life" is a plainer criterion than a checklist, but applied to your own work it is harder, because it asks you to be engineering-honest about the thing you are doing right now: is anyone going to read this in three months? It is not a question you can fudge to yourself.
11.8 How This Relates to Memory
By now the line between "spec-driven" in this chapter and "memory boundary" from earlier chapters is starting to blur. On one side there are the files Chapter 8 covered: cursorrules, AGENTS.md, CLAUDE.md — a file in the repo root that says "this project uses TypeScript, naming convention is camelCase, error handling goes through a single exception base class, no new dependencies are introduced." It rides along with Git, gets auto-injected on every AI session. On the other side is the change-level spec from 11.5: a file under changes/, one per change, describing "this round, push from A to B, here is what done looks like," archived once merged.
Both get called "spec-driven." They are not the same animal. The lifecycles differ, the entry points differ, and the problems they solve differ.
| Dimension | Persistent Behavioral Preferences | Change-Level Spec | Capability-Level Spec |
|---|---|---|---|
| Granularity | Cross-cutting: affects every task | Vertical slice: one change | Vertical slice: one capability |
| Lifecycle | Long-lived, stable, occasionally updated | Single-shot, frozen on archive | Lives as long as the code, evolves continuously |
| Question it answers | How should the AI work in general | How do we push from A to B this time; what does done look like | What does this capability look like right now |
| Typical form | .cursorrules, AGENTS.md | OpenSpec proposals / task lists under changes/ | Specs under specs/ in OpenSpec |
| Entry point for changes | Occasional manual review and update | Written, then frozen; new changes go in new files | Spec first, then code |
| Analogy | Working agreement / team charter | PR description / commit record | API documentation / critical constraints |
Lined up like this, you can see the three columns sit at different layers. Behavioral preferences cover the base color of work: naming, errors, dependencies, log placement — every task in this project, whatever it is, runs against them. Change-level specs cover not the base color but this specific change in motion: where this code goes this round, what done means, why it was decided this way. Done, archive. Capability-level specs cover not the change either, but the long-lived capability the change affected: after the action lands, what does this capability now require? Next time someone changes it, this is the description they implement against.
So in any given project, persistent behavioral preferences are the always-on base color, change-level specs are the records each passing change leaves behind, and capability-level specs are the living documents for each long-lived capability. Three different granularities, three different cadences, three different readers.
Why are they all called "spec-driven"? Because the underlying move is the same: take an implicit agreement out of someone's head and turn it into an explicit engineering artifact. Behavioral preferences used to be the team conventions a developer carried in their head — when you worked with a senior engineer, you did not need to say "use the unified exception base class," they just did. AI collaboration does not have that tacit channel, so the preference has to become an explicit .cursorrules file. Both of this chapter's specs used to live in PR comment threads and ad hoc chat — when humans worked with humans, that "why" propagated naturally inside the conversation. AI collaboration does not propagate it, so it has to become explicit changes/ and specs/ artifacts.
All three are "implicit becomes explicit." The artifacts they produce are not the same. One is cross-cutting behavioral preference. One is a single-shot decision archive. One is the continuously evolving truth of a capability.
Stacked together, the division of labor is clear, and none of the three substitutes for any other. A new requirement comes in. You write a change-level spec describing what to do this time and what done looks like. While the AI executes, .cursorrules-style behavioral preferences run in the background — it knows naming, error handling, dependency policy without being told. In the foreground, it reads the change-level spec to know exactly which state to push the code into. After the change lands, the capability-level spec reflects what this capability has become. The next AI session that touches this code reads that spec to know the current constraints.
11.9 When Requirements Become Engineering Artifacts
Starting from "where did those six rounds go" at the top of the chapter, the chain of questioning lands on one line:
The point of spec-driven development is to move collaboration precision from being an individual communication skill to being shared engineering infrastructure.
In traditional software collaboration, precision rode on "experienced developers fill in the gaps" — an individual capacity. Hand the same fuzzy requirement to a senior engineer and to a fresh hire and the precision of the output differs by a wide margin. For decades, that has been treated as natural: of course teams lean on senior engineers, of course juniors lean on learning. With AI, that path stops working. The vagueness can no longer be quietly absorbed by an invisible head. Precision can only end up in one of two places: as disorder inside the code, or as explicit deposits on the requirement side. Spec-driven development chooses the second.
You are not making the AI work to a spec. You are building, for a collaborator that cannot be fully controlled, a working contract that can be reviewed, replayed, and evolved. It is a probability prediction machine, not a deterministic execution machine. Spec-driven development is the attempt, in collaboration with that machine, to take collaboration intent out of the air and deposit it as engineering artifacts.
The next chapter turns the lens from how we talk to the AI to what happens when it actually starts acting — when an agent can call real APIs and change real data, the trust boundary stops being something you can patch on later, and the security model has to be designed in from day one.