Posts

Trust but Verify: The Math of Rigorous Software Development

Why rigorous software development looks like overhead until you count trust, defects, review cost, and agentic output.

April 10, 2026 12 min read 2433 words

In this article

Between Satire and Ridicule

It started, as these things often do, with a comparison nobody asked for.

I made a post saying that, with AI-assisted workflows, I can contribute anywhere from 20 to 300,000 lines of change in a day.

That number was not meant to say, “Look at me, I make ALL the tokens go brrrrrrrt…”

It was me counting additions plus deletions across real work: feature creation, refactoring, documentation, automation, test updates, generated assets, project scaffolding, cleanup, and the supporting files needed to make the work sustainable.

Naturally, the internet responded with the restraint and nuance we have all come to expect from professional discourse.

Someone tagged another person and laughed about me claiming to produce more code in a day than the Quake 3 engine.

That comparison was supposed to make my claim sound ridiculous.

I did not find it that surprising.

Not because Quake 3 is small. It is not. The engine that powered one of the most iconic, performance-critical, real-time 3D shooters in gaming history deserves respect.

But because modern software output is weird now. A perfectly ordinary web application can pull in a node_modules directory large enough to make the comparison stop feeling like satire and start feeling like Tuesday.

Because… web dev is a completely normal profession practiced by completely normal people.

So I started questioning my own claim.

Was I seeing a 10% increase in output from AI-assisted work?

Or was I seeing something closer to 100%, once I counted my own PR additions and deletions across implementation, refactoring, documentation, tests, scaffolding, and automation?

I did what any reasonable person would do:

I started counting.

A developer studies a pull request diff beside a small retro game-engine display and a much larger pile of modern web project files.

Lines of Code Are a Bad Measuring Stick

Let’s get this out of the way early:

Lines of code are a terrible measure of software value.

They are useful the same way a bathroom scale is useful. It tells you something. It does not tell you whether the thing on the scale is muscle, water, cheesecake, or a deeply regrettable npm install.

Counting additions and deletions tells you about movement. It does not automatically tell you about quality.

A 300,000-line change could mean:

A large generated dependency lockfile changed
Documentation was created across a repository
A scaffolding tool produced a new project structure
A refactor touched many files safely
An agent went wandering through the codebase with a machete and unresolved childhood issues

Same number. Wildly different meaning.

That is why the real question is not:

“How many lines did you produce?”

The real question is:

“How much of that output can you trust?”

That is where the math gets interesting.

What “Coding” Means Now

Here’s the thing about how I write software today. The code is almost the smallest part.

What I actually produce looks more like this:

Designing in markdown with interfaces
Stepwise operations broken into discrete, verifiable phases
Success criteria defined before implementation
Gap analysis around what is missing, assumed, or fragile
Use case analysis for positive and negative scenarios
Unit test coverage targets, often 95%+ where the risk justifies it
Required documentation, not “someday when the calendar develops mercy”
Duplication checks
Cyclomatic complexity thresholds
Linters, formatters, and static analysis tools
Boilerplate, yes, intentionally
Functional documentation per feature
File documentation where the file needs a map
Folder documentation where structure matters
Agent instructional guides, because agents need guardrails too
READMEs that explain how to actually use the thing
MkDocs documentation for the real docs
Specifications and task definitions
Asset and dependency tracking
Logging with context
Operational observability practices
GitHub Actions and pull request build protections
Integration tests
And all the supporting material needed to make the code consistent, reviewable, recoverable, and trustworthy

That last word matters.

Trustworthy.

All of this exists to reduce the chance that an agent, a tired human, or a very confident copy-paste enthusiast can take the shortest, sloppiest path to a solution.

The Skepticism

When I described this workflow, I was met with the usual collection plate of objections:

“That’s overkill.”
“You’re spending more time documenting than coding.”
“Nobody needs all of that.”
“Just write the code.”

Fair enough.

Let’s find out.

I Did the Math

I’m human. I’m a fan of trust but verify.

So I did the math.

And yes, I used AI to help analyze my own contributions.

Shocking, I know.

I pulled numbers from my personal and professional repos and asked the obvious question: if I am counting additions and deletions across pull requests, how much of that is actually functional implementation?

The answer was not what I expected, but it explained why the LinkedIn feedback got so saucy.

In one representative breakdown, the average change looked like this:

F = functional code lines
D = documentation lines
T = test coverage lines

F = 14
D = 9
T = 40

overhead lines = D + T
overhead lines = 9 + 40
overhead lines = 49

overhead ratio = (D + T) / F
overhead ratio = 49 / 14
overhead ratio = 3.5x

In plain English:

For every 14 lines of functional code, there were about 49 lines of documentation and test support.

That means roughly 78% of the total change was not functional implementation at all.

total change = F + D + T
total change = 14 + 9 + 40
total change = 63

supporting work percentage = 49 / 63
supporting work percentage = 77.8%

That is with small files, small functions, simple data types, and an assumption of thorough test coverage.

So when someone sees a large lines-changed number and assumes I am claiming to hand-author a Quake 3 worth of brilliant functional implementation before lunch, they are misunderstanding the shape of the work.

Most of the lines are not “feature code.”

They are the trust scaffolding around the feature code.

A small core of functional code is surrounded by larger documentation, testing, guardrail, and quality-gate scaffolding.

I also looked at larger functions, where the functional implementation was more like 30+ lines. The overhead ratio dropped from about 3.5x to roughly 1.9x because some documentation costs are fixed. A JSDoc description, parameters, and return notes do not double just because the function body gets longer.

That was interesting.

It also made the skepticism make more sense.

If someone is not documenting thoroughly, not writing high-coverage tests, not enforcing small functions, and not counting the supporting artifacts, then my numbers sound fake.

They are not fake.

They are measuring a different workflow.

Not the fake math where we pretend every line has equal value. The practical math of software work looks more like this:

trust = output - uncertainty

uncertainty = assumptions + complexity + missing tests + unclear intent

delivery cost = implementation cost + verification cost + failure cost

That is the part people skip.

They count the implementation cost and call everything else overhead.

But verification is not overhead. Verification is how you stop pretending.

The NIST report on inadequate software testing infrastructure estimated national annual costs of inadequate testing at $59.5 billion, with a potential $22.2 billion reduction from feasible improvements. You do not have to treat those numbers as a perfect modern estimate to understand the principle: defects are not free just because you ignored them until the customer found them.

They wait.

Patiently.

Like interest on technical debt.

Except the bank is production and the account manager is an incident bridge at 2:13 AM.

Rigor Changes Where the Work Goes

When you impose constraints, the ratio of supporting material to implementation code changes dramatically.

Small files. Small functions. Clear naming. Documentation. Tests. Quality gates. Dependency review. Complexity thresholds.

All of that creates more artifacts.

That is the part people see.

What they do not see is what changes underneath:

Defect rate goes down because behavior is specified and tested
Onboarding time goes down because intent is written down
Review time goes down because the reviewer has a map
Refactor risk goes down because tests catch drift
Agent reliability goes up because the agent has constraints
Cognitive load goes down because each file has less nonsense packed into it

The math does not say rigor makes work disappear.

It says rigor moves work from the expensive side of the timeline to the cheaper side.

cheap side                         expensive side
----------                         --------------
design notes        ->             archaeology
success criteria    ->             ambiguous review fights
tests               ->             production bugs
docs                ->             onboarding drag
small functions     ->             mystery meat debugging
quality gates       ->             "how did this merge?"

I prefer paying early.

Early is boring. Boring is underrated. Boring is how you avoid exciting meetings with incident numbers in the title.

Small Files, Small Functions

I impose small file and small function requirements. In the best cases, I aim to keep functions under 40 lines.

That is not a law of physics. Nobody found the sacred tablet of function length buried under a Sun workstation.

It is a constraint.

Constraints are useful because they force questions:

Can this function be named more clearly?
Is this doing two jobs?
Does the branch logic belong somewhere else?
Is this hard because the problem is hard, or because I made it hard?

Cyclomatic complexity gives this instinct a number. The McCabe complexity guidance summarized by Klocwork places 1-10 in the low-risk range, 11-20 as moderate risk, 21-50 as high risk, and 51+ as very high risk.

Again, not magic.

But useful.

A simple function is easier to test. A function with fewer branches is easier to reason about. A file with one clear responsibility is easier for a human or an agent to edit without accidentally turning the rest of the feature into soup.

Here is the kind of thing I want to avoid:

function processUser(user, options, logger) {
  if (user && user.active) {
    if (options.sendEmail) {
      if (user.email) {
        if (!user.bounced) {
          logger.info("sending email");
          // do the thing
        }
      }
    }
  }
}

That code is not long.

It is still annoying.

Length is not the only problem. Nested uncertainty is the problem.

I would rather see the intent pulled forward:

function canEmailUser(user, options) {
  return user?.active && options.sendEmail && user.email && !user.bounced;
}

function processUser(user, options, logger) {
  if (!canEmailUser(user, options)) {
    return;
  }

  logger.info("sending email");
  // do the thing
}

That is not clever. That is the point.

The next person should not need a corkboard and red string to understand whether the email sends.

Don’t Make Me Think Applies to Code

I read Steve Krug’s Don’t Make Me Think when it first published. The book is about web and mobile usability, but the idea lodged itself permanently in my developer brain.

NOTE: This book changed my life.

Here is the leap most people do not make:

Software development is a usability problem.

The code you write is a user interface.

Your functions are navigation. Your variable names are labels. Your documentation is onboarding. Your folder structure is information architecture.

If another developer, or an AI agent, has to think too hard about what your code does, you have created a usability problem inside the codebase.

The same principles that make a website intuitive make a codebase easier to work in:

Don’t make me think about what a function does
Don’t make me think about where behavior lives
Don’t make me think about what is tested
Don’t make me think about which path is safe
Don’t make me think about whether the docs are lying

That last one matters more now.

AI agents are extremely sensitive to local context. If your docs are stale, your naming is vague, your tests are missing, and your folders are junk drawers or named with acronyms only you understand, the agent does not magically discover the truth. It confidently continues the pattern you gave it.

Congratulations. You automated confusion.

A developer and small assistant navigate a clean, library-like codebase with clear paths, while one messy corner shows tangled folders.

Dependencies Are Part of the Trust Budget

The Quake 3 loc comparison is funny because it does feel absurd.

But it also points to a real issue: modern software carries a lot of inherited surface area.

An npm dependency is not just “a package.” It is a tree of code, metadata, maintainers, version constraints, postinstall behavior, transitive dependencies, security alerts, and assumptions you did not personally choose.

Research on npm production dependencies backs up the weirdness. In Not All Dependencies are Equal, they studied 100 JavaScript projects and found that less than 1% of installed dependencies were released to production. They also found that many dependencies categorized one way in configuration behaved differently in actual production use.

NOTE: This was done before the explosion of AI generated code.

That does not mean dependencies are bad.

I like dependencies. I like the simple things like not writing my own date library.

But dependencies are part of the trust budget.

Every dependency asks a question:

Do we need this?
Do we understand what it brings with it?
Is it production code or tooling?
Can we update it safely?
Can we replace it if the maintainer disappears?
Is the dependency solving a real problem or saving us eight lines of code?

That is not paranoia.

That is maintenance.

A central application package branches into many dependency boxes, locks, warning markers, maintainers, version pins, and security shields.

The Real Output Is Verified Change

If an AI-assisted workflow produces 20 lines, I still need to know whether those 20 lines are correct.

If it produces 300,000 lines, I need a system that prevents me from personally reading every line dependably.

That system is not vibes.

It is:

Clear design
Small units of work
Written acceptance criteria
Tests
Complexity limits
Static analysis
Documentation
Review notes
Dependency checks
CI gates
Human judgment

The goal is not to create overhead.

The goal is to create verified change.

That is the actual unit of progress.

Not lines.

Not files.

Not “the agent said it was done.”

Verified change.

The Receipts

So the next time someone tells you documentation is overhead, that function-size limits are arbitrary, or that your AI-assisted output should be dismissed because it sounds absurd next to the Quake 3 source release, ask a better question:

What is your verification strategy?

Because if the answer is “I read the code,” that may work for a tiny function.

It does not scale to large systems.

It definitely does not scale to agentic development where software can be generated, rewritten, documented, tested, and broken faster than one human can blink and mutter, “Wait, why did it change the auth provider?”

Rigor is not about moving slowly.

Rigor is about making speed survivable.

The math says the work has to happen somewhere. You can pay for clarity early, or you can pay for confusion later with interest, outages, rework, review churn, and the subtle spiritual damage of debugging code that clearly resents being understood.

I know which bill I would rather pay.

Rigor is not overhead. Rigor is how trust gets built.

-Rob