Operational Excellence

Two metrics define what good AI-built code looks like

Most teams grade their AI rollout on adoption and task-completion rates. Those measure activity, not the health of what's being built. Here are two structural dials that do.

The bottom line

The problem: We're grading AI on the wrong metrics. Most engineering leaders track adoption and task-completion rates — numbers that measure activity, not the health of what's being built.
The insight: Left to optimize for speed, AI reaches for external dependencies and writes custom logic where a generator would do — adding attack surface and code to maintain, for no real speed gain.
The action: Add two structural dials — Deterministic Coverage and Supply Chain Risk — so you can see whether your teams are building predictable assets or expensive liabilities.

The wrong dashboard

Most enterprises measure generative AI through developer adoption and task-completion rates. By those numbers, the rollouts look successful — and adoption is worth measuring well; I've argued for a better adoption metric elsewhere. But adoption tells you people are using AI. It says little about whether the code they produce is sound.

Without architectural guardrails, an AI takes the quickest path to a working feature. It pulls in an external package for a problem the standard library already solves, along with everything that package itself depends on. Instead of using a deterministic generator for something like API routing, it writes custom boilerplate.

The feature ships, and it looks productive on the dashboard. The cost shows up later: a wider dependency tree to audit and edge cases from guessed code to debug, often paid by a different team than the one that moved first.

The velocity that matters is how easily you can secure, audit, and change the system a year out. An adoption dashboard won't show you that.

Two kinds of code your team didn't write

I've leaned on "generated" a couple of times already, so let's pin the word down — the whole argument rests on it, and "AI-generated" is what it brings to mind today.

Code no one typed by hand comes from one of two generators. A deterministic generator builds it from a schema — a precise, machine-readable description of your data and interfaces, like an OpenAPI spec, a protobuf file, or a database schema. A probabilistic generator — a language model — guesses the code from a prompt, a piece at a time. Both are useful; the difference is whether you can predict what comes out.

What does "deterministic" mean?

Same input, same output, every time — no randomness. Give a deterministic generator the same schema and it produces exactly the same code on every run, so the result is predictable and easy to audit. In contrast, a probabilistic generator samples from many possible outputs, so the same prompt can give different code each time.

When this post says code is "generated," it means the deterministic kind unless noted — built from a schema, not guessed by a model.

Same schema, same code; same prompt, a new guess. — Both are legitimate tools. Use the deterministic generator for the predictable foundation, and the probabilistic one for the custom logic that differentiates the product.

Two dials worth tracking

You can describe that health with two numbers.

Dial 1 — Deterministic Coverage. The share of your first-party code that is generated deterministically from a schema or contract, rather than hand-written or generated probabilistically by a model. Generating standard routing or database access with an LLM spends compute and context on problems that were solved decades ago. A healthy architecture uses strict generators for that foundation and reserves the model for the custom logic that differentiates the product.

One honest limit: it's a coverage proxy measured by line count, so a verbose generator can flatter it. Read it as a direction, not a target — at 100% you'd have no custom logic at all.

Dial 2 — Supply Chain Risk. How much of the code running in your environment comes from outside your team. Measure it in trust boundaries, not lines of code: the number of distinct external modules you depend on, directly and through their own dependencies. Size doesn't tell you the risk — a small package can do as much damage as a large one. What counts is how many pieces of code you have to scan, patch, and trust but don't control. Each one widens your attack surface. A healthy architecture keeps that number low by leaning on the language's standard library.

Either dial on its own is easy to game, which is why they're only useful read together.

The target is a generated core with few dependencies

The combination you want is high Deterministic Coverage with low Supply Chain Risk: a generated, predictable foundation on a small external surface you control.

The opposite default — letting the model build everything — leaves little generated and a dependency tree that keeps widening. Every release then costs more: external code to review, and edge cases the model guessed at to track down.

It isn't slower to build the other way. Generating a foundation from a schema is quick, and it's correct by construction — no model guesswork to debug, no third-party package to learn. Because you own that code, it stays cheap to audit and change as the product grows. The speed and the safety come from the same decision.

The architectural ceiling

One warning before you start moving these dials: you can only move them as far as your architecture allows. If your foundation needs a heavy framework to do something basic, telling your team to "use fewer dependencies" won't move the risk dial. If your foundation resists clean code generation, coverage stays low no matter how disciplined the team is.

The ceiling on both numbers is set by choices made before any code — or any model — was involved. That's its own conversation, and the subject of the next post.

Where this is headed

Expect "how much of our executing code do we actually own?" to move from an engineering curiosity to a standard question in technical due diligence — the kind of thing that shows up in acquisition reviews and enterprise procurement, next to your security posture. The teams that can answer it with a number will have an easier time than the ones discovering the answer mid-audit.

Two things to take away

If AI makes writing code effectively free, why is your engineering budget still going up? Sit with that one before your next planning cycle.

And a diagnostic you can run this week: take your most important service and count the distinct external modules in its dependency graph, direct and transitive. Then estimate how much of the code your team "wrote" was generated from a schema. You don't need a tool or a target to start — the two numbers, looked at honestly, will tell you where you stand.

Two metrics define what good AI-built code looks like

The bottom line

The wrong dashboard

Two kinds of code your team didn't write

What does "deterministic" mean?

Two dials worth tracking

The target is a generated core with few dependencies

The architectural ceiling

Where this is headed

Two things to take away

Read more

Usage Density: The north star metric for AI adoption

The content-first opportunity: A new playbook for talent and technology in the AI era

The Cognitive Tax: The hidden brake on your AI strategy

The Adoption Velocity metric: A CTO's guide to turning AI pilots into strategic assets