Two metrics define what good AI-built code looks like
Most teams grade their AI rollout on adoption and task-completion rates. Those measure activity, not the health of what's being built. Here are two structural dials that do.
The bottom line
- The problem: We're grading AI on the wrong metrics. Most engineering leaders track adoption and task-completion rates — numbers that measure activity, not the health of what's being built.
- The insight: Left to optimize for speed, AI reaches for external dependencies and writes custom logic where a generator would do — adding attack surface and code to maintain, for no real speed gain.
- The action: Add two structural dials — Deterministic Coverage and Supply Chain Risk — so you can see whether your teams are building predictable assets or expensive liabilities.
The wrong dashboard
Most enterprises measure generative AI through developer adoption and task-completion rates. By those numbers, the rollouts look successful — and adoption is worth measuring well; I've argued for a better adoption metric elsewhere. But adoption tells you people are using AI. It says little about whether the code they produce is sound.
Without architectural guardrails, an AI takes the quickest path to a working feature. It pulls in an external package for a problem the standard library already solves, along with everything that package itself depends on. Instead of using a deterministic generator for something like API routing, it writes custom boilerplate.
The feature ships, and it looks productive on the dashboard. The cost shows up later: a wider dependency tree to audit and edge cases from guessed code to debug, often paid by a different team than the one that moved first.
The velocity that matters is how easily you can secure, audit, and change the system a year out. An adoption dashboard won't show you that.
Two kinds of code your team didn't write
I've leaned on "generated" a couple of times already, so let's pin the word down — the whole argument rests on it, and "AI-generated" is what it brings to mind today.
Code no one typed by hand comes from one of two generators. A deterministic generator builds it from a schema — a precise, machine-readable description of your data and interfaces, like an OpenAPI spec, a protobuf file, or a database schema. A probabilistic generator — a language model — guesses the code from a prompt, a piece at a time. Both are useful; the difference is whether you can predict what comes out.
What does "deterministic" mean?
Same input, same output, every time — no randomness. Give a deterministic generator the same schema and it produces exactly the same code on every run, so the result is predictable and easy to audit. In contrast, a probabilistic generator samples from many possible outputs, so the same prompt can give different code each time.
When this post says code is "generated," it means the deterministic kind unless noted — built from a schema, not guessed by a model.
Two dials worth tracking
You can describe that health with two numbers.
Dial 1 — Deterministic Coverage. The share of your first-party code that is generated deterministically from a schema or contract, rather than hand-written or generated probabilistically by a model. Generating standard routing or database access with an LLM spends compute and context on problems that were solved decades ago. A healthy architecture uses strict generators for that foundation and reserves the model for the custom logic that differentiates the product.
One honest limit: it's a coverage proxy measured by line count, so a verbose generator can flatter it. Read it as a direction, not a target — at 100% you'd have no custom logic at all.
Dial 2 — Supply Chain Risk. How much of the code running in your environment comes from outside your team. Measure it in trust boundaries, not lines of code: the number of distinct external modules you depend on, directly and through their own dependencies. Size doesn't tell you the risk — a small package can do as much damage as a large one. What counts is how many pieces of code you have to scan, patch, and trust but don't control. Each one widens your attack surface. A healthy architecture keeps that number low by leaning on the language's standard library.
Either dial on its own is easy to game, which is why they're only useful read together.
The target is a generated core with few dependencies
The combination you want is high Deterministic Coverage with low Supply Chain Risk: a generated, predictable foundation on a small external surface you control.
The opposite default — letting the model build everything — leaves little generated and a dependency tree that keeps widening. Every release then costs more: external code to review, and edge cases the model guessed at to track down.
It isn't slower to build the other way. Generating a foundation from a schema is quick, and it's correct by construction — no model guesswork to debug, no third-party package to learn. Because you own that code, it stays cheap to audit and change as the product grows. The speed and the safety come from the same decision.
The architectural ceiling
One warning before you start moving these dials: you can only move them as far as your architecture allows. If your foundation needs a heavy framework to do something basic, telling your team to "use fewer dependencies" won't move the risk dial. If your foundation resists clean code generation, coverage stays low no matter how disciplined the team is.
The ceiling on both numbers is set by choices made before any code — or any model — was involved. That's its own conversation, and the subject of the next post.
Where this is headed
Expect "how much of our executing code do we actually own?" to move from an engineering curiosity to a standard question in technical due diligence — the kind of thing that shows up in acquisition reviews and enterprise procurement, next to your security posture. The teams that can answer it with a number will have an easier time than the ones discovering the answer mid-audit.
Two things to take away
If AI makes writing code effectively free, why is your engineering budget still going up? Sit with that one before your next planning cycle.
And a diagnostic you can run this week: take your most important service and count the distinct external modules in its dependency graph, direct and transitive. Then estimate how much of the code your team "wrote" was generated from a schema. You don't need a tool or a target to start — the two numbers, looked at honestly, will tell you where you stand.