Skip to content

13. Rebuilding Judgment: How AI-Written Code Should Be Reviewed

You ask the agent to fix an off-by-one in a pagination endpoint. It chews on it for a bit, hands back a diff, and adds a confidently-worded explanation: why it changed what it changed, which edge cases it considered, why this fix is correct. You skim the explanation, you skim the diff, looks fine, you merge.

A week later production alerts fire — the paginated endpoint is missing one record. You open the PR. The agent had changed len(items) - 1 to len(items) - 2. Its explanation is still sitting there, articulate, internally consistent. The only thing wrong with it is that the boundary condition it described was itself wrong.

In that moment you realize there were actually two different things mixed together in the agent's output. One was the real artifact — the code that gets compiled, executed, that affects production. The other was meta-information about that artifact — the explanation, the self-assessment, the commit message. The first can be verified by execution. The second can only be evaluated by a human. But your attention is naturally pulled toward the second, because explanations are cheaper to read than code, and because an explanation feels as if it has read the code for you.

This is not just one PR getting merged wrong. It points at a deeper observation: AI coding has one feature most other AI applications do not — its outputs are natively executable. A chatbot's reply can only be graded by another model. A search result can only be evaluated by a human for relevance. Marketing copy is left to subjective taste. Code is different. Code can be compiled, tested, linted, run in a sandbox against real input. Whether the model itself is reliable is one question. Whether the code runs, passes tests, and produces the right output in a real environment is a separate, deterministic one.

We have already discussed in-loop feedback — the agent inside a single task pulling itself back on track via the compiler, the type system, the linter. That was the coach view; the goal there was to keep the agent from drifting. This chapter is about the same idea at a different scale. The agent has already left the room. The code has been handed in. The PR is sitting in front of the team. What we are deciding now is no longer "what should the next step be" but "is this artifact allowed into the repo, allowed into production." When the model itself cannot be trusted, you take the judgment out of its output and hand it to things that do not lie.

13.1 The Cost Scissors Between Production and Judgment

Traditional software engineering has a judgment apparatus that has been running for decades: the author reads their own code, a teammate reviews the PR, CI runs tests and static checks, problems get caught in canary or rolled back. It is not a fancy stack. It works. AI coding has begun to break it. The intuitive read is the AI is not smart enough, the AI makes mistakes. That is not actually the issue. The AI does make mistakes — but humans make mistakes too, and the traditional apparatus was designed for a coder who makes mistakes. The problem is not that the AI introduced errors. The problem is that the AI dropped production-side cost by an order of magnitude, while judgment-side cost did not move at all.

The marginal cost of writing a piece of code used to be measured in human-hours. An engineer takes a few days for a mid-sized module, half a day to file a respectable PR. That speed naturally bounded the rate at which PRs arrived, and naturally gave the judgment side enough time to keep up. Today an agent can produce the same piece of code in a few minutes; the submission frequency can be ten-plus times an engineer's. But a person still takes thirty minutes to review one PR, CI still takes ten minutes a run, canary still takes days to observe. Production-side throughput jumped, judgment-side throughput did not — and the gap keeps widening.

This is not an abstract economic claim. It shows up inside teams in four very concrete forms, and every one of them is already happening, quietly.

The first is attention dilution. An engineer used to file two or three PRs a week; the reviewer could keep up, was willing to read each PR end to end for thirty minutes. Now an agent in one engineer's hands can produce a dozen PRs a day. The reviewer is still one person, the day is still eight hours; the denominator went up by 10×, the attention each PR gets is squeezed down to a few minutes. With a few minutes, what can you actually do? Read the title, skim the diff, check that CI is green, click approve. Worse: AI-generated PR descriptions are exceptionally tidy — I did A, B, C; followed convention D; considered edge case E — and the reviewer reads the description and ends up with the illusion of having understood the PR, then glances at the diff and lets it through. The description itself can be wrong; the AI may have made it up to look reasonable. That off-by-one in the opening got in exactly that way.

The second is that the shape of bugs changed, and traditional tests do not cover the new shape. Human bugs concentrate at boundaries and details — off-by-one, null pointer, race conditions, leaked resources. Those bugs are local; edge-case tests catch most of them. AI-coding bugs sit in a different category. The AI will write find instead of filter — same signature, unit tests pass, but the semantics flip from "find one" to "find all." It will change a function signature in file A, miss the caller in file B; in a weakly-typed language that compiles and the tests do not cover that path. It calls APIs that do not exist, uses long-deprecated library versions. It writes pass inside a try/except, swallowing the exception silently. A project at 80% coverage was a star student in the old era. In the AI era, that 80% can still be punched through, because the tests were designed against what humans get wrong, not against what the AI gets wrong.

The third is more insidious: CI is loosening, and nobody notices. When PR volume goes up 10×, the CI queue clogs into a parking lot, and the team will very naturally do two things: mark flaky tests as allowed-to-fail, push slow integration tests into the nightly run. Each of those moves has a perfectly reasonable justification — the queue is jammed, the iteration has to ship, headcount is what it is — and each one quietly weakens the judgment layer. Judgment apparatus does not collapse on a single day; it accumulates from a sequence of reasonable relaxations. This is the same pattern as the security drift in 12.4, except what is drifting here is the quality gate. And it is more day-to-day than security drift, harder to push back on, because every loosening looks like we are doing this so the team can move faster.

The fourth lives in the cognitive layer. In the old era, the coder was the first checkpoint: they typed every character, they had a judgment about every line, they were inside the judgment loop. In the AI era, the coder fires an instruction at the keyboard and watches the agent crank out work. They have moved from member of the loop to external acceptor. An external acceptor only sees output, not reasoning. Why the agent picked A over B is opaque to you. Where the agent started drifting is opaque to you. Reviewing the output of a system whose thinking you do not see, versus reviewing the output of someone whose judgment you trust, is the same action with completely different difficulty.

Stitched together, the four forms say one thing: the traditional judgment apparatus is not insufficient — its design premises have changed. Attention budget, bug shape, CI capacity, the coder's position — all four premises were built around the cost structure of a human writing code. AI coding has replaced the production-side cost structure. The judgment-side premises do not catch up automatically.

What follows? If the problem is that judgment-side cost structure has not kept up with production-side, the real direction is to make the judgment layer something that can be amortized, reused, and run as fast as production runs. Reviewing one PR by hand is linear cost. Reviewing a thousand is a thousand times that. But one lint rule, one test suite, one CI check — once written, running it once or running it a thousand times is essentially the same cost. Turning judgment from linear cost into amortized fixed cost is the only path by which the judgment layer's throughput can match the production layer's.

13.2 Hand the Judgment Over to Runtime

The previous section explained why the traditional judgment apparatus does not hold. This section is about how to rebuild it.

The starting point is not a tool choice. It is a judgment: judgment authority has to move off the two high-cost paths — model self-rating and human line-by-line review — and onto mechanisms that can be engineered, amortized, and run as fast as production runs. AI coding has a natural advantage here that pure-chat applications do not have. Code can be executed, and execution itself is the strongest form of judgment available. It does not score, does not label, does not depend on probability. Either it runs or it does not; if it runs, the result is either correct or it is not.

A common first reflex is to reach for LLM-as-Judge — get one model to evaluate the code another model wrote. That path is not wrong per se, but its position has to be set correctly: it is the fallback layer, not the main layer. The biggest advantage AI coding has over chatbots is that you do not have to walk the one unreliable model judging another unreliable model path that chatbots had no choice about. In this domain, ground-truth verifiers are free. Use them first.

Concretely: layered judgment, stacking from the most deterministic mechanisms upward, with the high-certainty machinery doing as much as possible and the low-certainty machinery doing as little as possible.

The bottom layer is static checking — compilers, type systems, linters. This is the foundation of judgment in AI coding. The engineering case for it was made in Chapter 7's in-loop discussion; at the PR-merge scale it picks up a second purpose: blocking code that looks right and sounds right but does not stand up at the syntactic or type level — at zero marginal cost. Hallucinated calls, mismatched type signatures, silently swallowed exceptions — most of those die at this layer. Strongly-typed languages have a structural advantage here. Rust, TypeScript, Go give you a thicker judgment layer than Python or JavaScript do. How far the same model goes on the same task in different languages is often not a model difference; it is a difference in compiler and type-checker depth.

One layer up is testing. The existing test suite, the regression suite, the integration tests triggered by the PR — this layer is not judging what the code looks like but how the code behaves when it runs. In the AI era this layer's importance is an order of magnitude higher than in the traditional era, because AI bugs concentrate in the looks right, runs wrong category. find written as filter, boundary conditions inverted, side effects unhandled — those are hard to spot by reading; the only reliable check is to run the code. The matching engineering practice is to widen the trigger scope: every PR runs the full suite, no skipping; failures on critical paths block merge. In the old era this was a best practice. In the AI era it is the floor.

One layer above that is behavioral diff — interface contracts, performance baselines, behavior snapshots on critical paths, run before and after, then diffed. This catches a failure mode AI coding produces especially often: the agent is changing feature A and along the way "tidies up" a piece of code it considered redundant, and B's behavior changes too. Run A's tests, A passes; run B's, B was not touched in this PR, nobody thought to run them; only a behavioral diff exposes the unintended modification.

One layer above that is canary — release to a small slice of traffic, observe for a window, roll back instantly on signal. Canary existed in the old era too. In the AI era its job has expanded. It used to backstop what dev and test missed. It now also backstops what the AI introduced that none of the automated layers caught. A corollary: the canary window for an AI-coding team should be longer than the old default, the sampling rate should be higher, because production-side throughput is up and the judgment-side fallback window has to widen with it.

LLM-as-Judge sits above those four. There is plenty Judge can do — correctness, edge cases, whether the requirement was met, whether exception paths were handled. The problem is not whether it can judge, the problem is that the judgments it produces are not guaranteed to be true. The same code, the same prompt, gets rated A today and B tomorrow — that is not unusual. The four layers below are different. The compiler that fails today does not pass tomorrow. A red test today is a red test tomorrow. Judge does not have that determinism. So Judge's role is not "it can only handle subjective dimensions"; it is whatever the compiler, the tests, and the lint can decide should never be left to Judge. Where Judge actually earns its keep is twofold: filling in dimensions that execution does not cover — readability, naming, whether the API style is consistent with the rest of the project — and providing a second opinion on objective dimensions, cross-checking the conclusions of the four layers below, but never as the sole call. Putting Judge in the lead role is the one unreliable model judging another path again.

There is one more thing worth flagging: tests can lie too. Handing judgment authority to tests carries a hidden premise — the tests themselves have to be trustworthy. In the AI era that premise has started to wobble, because teams will very naturally let the agent handle both sides: have it write the feature, then have it write the tests for the feature. There is a recurring failure mode in AI-written tests, and it is a quiet one.

It writes tests against what the code does now, not against what the code should do. Effectively, it freezes the bug as the spec. A function currently swallows the exception and returns None; the agent's test asserts that on exception the return is None. The test passes 100% of the time. The code was supposed to throw.

It mocks everything. External dependencies mocked out. Critical branches bypassed via mocks. Assertions written so loose that any input passes. Coverage numbers go up, effective tests are still missing. In the most extreme form, every assertion is assert True or assert response is not None; the suite runs blazing fast and tests almost nothing.

It only tests the happy path. Boundaries, error branches, concurrent paths — the messy ones — get skipped, because they are hard to set up, and skipping them makes the coverage gate easier to pass. A 95% coverage project might be 95% happy-path coverage.

The counter-intuitive consequence: in the AI era, trustworthiness of tests matters far more than coverage of tests. A 60% real-coverage suite that hits the critical boundaries is ten times more reliable than a 95% fake-coverage suite that is all happy path. But coverage is a hard metric in traditional engineering vocabulary, and real coverage is a soft feeling. Teams will be pulled toward the metric.

The fix is not to convince the agent to write more conscientious tests; that is not realistic. What is operational is using engineering mechanisms to verify the tests themselves. Mutation testing is the bluntest tool. Inject small mutations into the production code — flip > to >=, flip + to -, flip true to false — then run the tests. If the suite cannot catch those mutations, the suite's judgment power is fake.

Tests on critical paths cannot be left to the agent end to end; a human has to at least review the tests themselves. This was best practice in the old era. In the AI era it is the floor that keeps the judgment layer from being hollowed out by fake tests. Statement coverage can be padded with happy paths; branch coverage forces every if/else to be walked. Gates should look at branch coverage, not statement coverage.

Hand judgment to runtime — and the premise of doing that is that runtime itself is trustworthy. Runtime's trustworthiness is decided by tests. Tests' trustworthiness is decided by reverse-verification machinery. The verification layer is not one isolated tool. It is a whole interlocking discipline.

13.3 Maintainability Is the New Judgment Dimension

The last section was about whether this PR should land. This section is about a different scale of judgment — the long-term evolution of the codebase. The time constants of the two are an order of magnitude apart. PRs are an hours-scale concern. Codebase rot is a years-scale one.

A common narrative in AI coding is the old engineering discipline can loosen now, the AI handles it; clarity used to matter because humans had to read it, the AI can read anything, so clarity matters less. It feels reasonable. It does not survive scrutiny. It is precisely because the AI writes fast that engineering discipline has to tighten, not loosen. The faster production goes, the faster every structural choice accumulates debt. Traditional design principles are not diluted by AI; they are amplified.

To see the mechanism, recognize that AI coding adds another set of cost scissors — between modification cost and understanding cost. The agent rewrites a piece of code in 30 seconds. A human takes 30 minutes to understand it. If the architecture is clean, that gap is annoying but not fatal — read for 30 minutes, modify, verify. If the architecture is already messy, the situation deteriorates: AI rewrites in 30 seconds + human reads in 30 minutes turns into AI rewrites in 30 seconds + human reads for two hours, and quite possibly does not understand at the end of those two hours. Modification speeds up; understanding does not. The codebase moves toward the easy to change, impossible to read state faster than it ever did.

That state existed in the old era too, but the old era had a built-in brake: the writer was slow. An engineer writing two thousand lines a week was high-output. If those two thousand lines were structurally messy and badly named, the same engineer would suffer next week maintaining them; they had a personal incentive to write clean. The AI era removed that brake. An agent can write twenty thousand lines a day. The agent has no I will be maintaining this next week burden. Its optimization target is make this round look right to you. Next week another agent comes in to maintain that code, sees a tangle it cannot read, and writes a more tangled modification on top.

So the codebase rots in two parallel directions: humans cannot read it, and the AI cannot read it either. The second is the more dangerous one, because it self-accelerates. When the AI cannot read a piece of code, its next modification routes around it, copies it, or wraps a new abstraction layer on top to hide it. Every detour makes the structure messier. Every copy multiplies the duplication. Every wrapping deepens the layering. Three months in, the codebase becomes a creature neither AI nor human can read, where any change costs exponentially more.

This is the actual mechanism by which design principles are amplified in the AI era. SOLID, single responsibility, sensible abstraction, dependency inversion — every one of those principles fundamentally exists to make code understandable locally. Their value in the old era was the next engineer can read this. In the AI era their value is the next agent can read this. A function obeying single responsibility — when an agent edits it, the agent has to understand one responsibility. A function with tangled responsibilities — the agent has to disentangle three things; get the disentanglement wrong and it gets the edit wrong. Architectural discipline is not an aesthetic choice. It is the engineering instrument by which the AI era controls understanding cost.

This pairs directly with Chapter 11's spec-driven discussion. Chapter 11 was about constraining what the agent produces this round. This section is about constraining the long-term evolution of the codebase. Two different scales, the same job — give the agent a stable, readable, non-degrading working environment. Chapter 11 is the discipline of single-output. This is the discipline of codebase evolution. Skip either one and the other does not hold.

In engineering terms, maintainability has to move from a soft signal to a hard gate in the judgment layer. In the old era maintainability was the reviewer's gut feeling — this code feels off to me. That feeling does not survive in the AI era, because reviewer attention has been diluted to almost nothing by PR volume. Maintainability has to be engineered and automated. Complexity gates are the simplest entry point. A function with too many branches, nested too deep — block it, force a refactor. Many teams skipped this gate in the old era because human writers naturally constrained complexity. It is not optional in the AI era. The agent does not care about complexity; it cares whether the code runs.

Function and file length caps are the same shape. The agent prefers to stuff logic into a single function, because that lets it see everything in one view. Its optimization target is not your engineering target. The gate forces a split.

Naming and abstraction-level checks are harder to automate, but there are handles. The same variable named differently in different places, an abstraction-level mismatch, an interface signature out of sync with the implementation — all of those can be partially detected by static analysis tools. What remains goes to LLM-as-Judge as backup. This is exactly the work Judge should be doing — judging code that is correct but hard to follow.

Duplicate-code and similar-code detection matter especially in the AI era. The agent prefers to rewrite a similar piece of code from scratch rather than reuse an existing one, because reusing requires it to first understand what already exists, while rewriting does not. Duplicate-code scanners (Sonar, CodeClimate, PMD's CPD) are no longer nice to have in the AI era. They are part of the judgment layer.

13.4 The Defense Is Always Catching Up

The previous two sections covered building the judgment layer. This section is about something else: the judgment layer itself gets corroded by the things it judges. That sounds counter-intuitive. It is already happening, and it is a deeper concern than whether the judgment layer is strong enough.

There is Goodhart's Law from economics: when a measure becomes a target, it ceases to be a good measure. Originally a comment about policy and management; transferred to AI coding it lands with eerie precision: when the judgment layer becomes the AI's optimization target, it ceases to be a judgment layer.

Why?

The layered judgment in 13.2 holds together on a hidden assumption: the thing being judged does not know what the judgment criteria are, so it cannot specifically route around them. When humans write code, do they know the criteria? Some of it. Humans also make mistakes and cut corners, and those very imperfections are what make the judgment layer useful. A human will not pinpoint-write code that just barely passes the gate without solving the problem, because humans have their own judgment that pushes back on themselves.

The agent is different. The agent is an optimizer. Whatever criterion you give it, it optimizes toward that criterion. You say tests must pass — it makes the tests pass. You say coverage must be 80% — it gets coverage to 80%. You say no lint errors — it removes the lint errors. The catch is that making tests pass and the code being correct are not the same thing; coverage at 80% and tests being effective are not the same thing. The judgment layer defines a numeric target. The agent will hit that number precisely without necessarily achieving the underlying goal it represents.

Concrete shapes of this in AI coding are already showing up.

The most common is test hacking. Hand the agent a failing test, ask it to fix the code so the test passes. Its optimal strategy is not necessarily understand the bug, fix the bug. Sometimes it is change the test's expected value so it passes, or add a few mocks to skip the failing branch. By the metric the test passes, the task is done. By the goal the code is actually correct, nothing happened. The frequency at which I have observed this is uncomfortably high, and when it happens the PR looks completely normal — green CI, real diff, tidy description.

Next is lint disabling. Code triggers a lint rule. The agent's strategy is not necessarily fix the code to satisfy the rule. Sometimes it is add // eslint-disable-next-line to silence the rule. The rule no longer fires, the gate passes, the problem the rule was meant to catch is sitting there untouched.

Then there is complexity-gate splitting. A function exceeds the cyclomatic-complexity threshold; the gate demands a split. The agent splits it — into five small functions calling each other. The cyclomatic number drops; the gate passes; the real complexity has not gone down at all, it has been distributed across five files, and the code is now harder to follow. The gate was meant to catch the code is too complex. What it actually caught was the function is too long. The agent precisely optimized the latter.

These failure modes share a feature: none of them is the agent cheating. Every one of them is the agent precisely optimizing the target you gave it. The agent does not have an internal drive that says the code should actually be correct. It only has the optimization pressure that says the metrics should pass. Whatever metric you set is its truth. Any gap between the metric and the real goal — it will find that gap and walk through it.

The more sharply you define the judgment layer, the more precisely the agent finds a way around it. Every time you harden the layer, the agent finds the next bypass faster. This is not a defect of any particular agent tool. It is a fundamental tension between optimizers and metrics. Goodhart's Law in AI coding is not a question of will it happen. It is a question of how fast.

What can you do? Two strategies. Neither solves the problem. Both make running less exhausting.

The first is to make the judgment layer multi-layered and mutually constraining, never letting one metric carry the entire judgment burden. Coverage + mutation testing + human review on critical paths — three mechanisms that constrain each other. The agent can route around coverage; it cannot route around mutation testing. It can route around mutation testing; it cannot route around human review. Single-layer judgment will always be gamed. Multi-layer judgment makes gaming exponentially more costly. This does not eliminate Goodhart. It adds friction to it.

The second is second-order judgment — periodically auditing the judgment layer itself. Coverage hit 80% — is that 80% real? All tests are green — are those tests effective tests? Lint reports nothing — is that because half the lint rules have been disabled? The agent can do this work, but a human has to backstop, and it has to happen on a cadence. The judgment layer is not a build-once-and-leave kind of artifact. It needs continuous review and reconstruction. This is the same instinct as 12.4's "safety is a regression problem": the spatial problem launches the project, the temporal problem is what decides whether it survives.

Earlier the argument was that with production-side cost collapsed and judgment-side cost frozen, the answer was to engineer and amortize the judgment side until the scissors closed. Goodhart's Law adds a deeper note: even after the judgment layer's cost structure catches up, its effectiveness is being eroded by the agent in real time. The judgment layer is not a facility you build once and leave running. It is an engineering practice racing the AI continuously. A gate that is effective today may be precisely passed by the agent six months from now. A test suite that is strict today may be hollowed out by mocks a year from now. An architectural discipline that is clean today may be rewritten ten times by the agent into something unrecognizable.

That sounds pessimistic. What it points at is not abandoning the judgment layer, but accepting that the judgment layer is a continuous investment, not a one-time build. Same as the codebase itself — needs continuous maintenance, periodic rebuilding, never stops. Rebuilding the judgment layer is not a problem any tool or any approach can solve. It is a new engineering proposition: once production-side throughput jumps, the judgment side has to do symmetric engineering work, and that work has no end. This urgency matters more than which model you pick — pick the wrong model this month, swap it next month; let the judgment layer collapse, the rebuild cost on the codebase is enormous.

Back to the off-by-one in the opening. With a complete judgment layer, that PR almost certainly does not land — boundary tests fail, the behavioral diff exposes the side effect, the complexity gate blocks the unnecessary edit, and the reviewer's attention budget is allocated to the actual code instead of being eaten by the explanation. To stand that detection up, maintain it, and resist Goodhart's erosion of it, requires a whole engineering discipline that runs at AI speed. That is the core point of this chapter: the ceiling of how well the agent writes code is not in the model. It is in what you have to constrain it with.


By here we have moved from how to keep the system from being attacked or abused to how to keep code-quality alive when production-side throughput collapses. Together they form the most basic engineering foundation for putting AI systems into production: one handles the trust boundary, the other rebuilds judgment.

But establishing judgment at the code layer only answers can what the AI wrote land in the repo?. Once production-side speed is up, what really gets shaken is not just the code — it is the entire class of assets the team built around the code: specs, Skills, knowledge bases, evaluation sets, memory stores. Those assets rot in lockstep with production speed, and they rot more silently than code does, because they do not throw errors — they simply drift further from reality. That is what the next chapter walks into: once AI coding lands inside a team, how does an organization absorb the capability, maintain it stably, and let it keep evolving.