How AI Agents Verify Their Own Work (Before Saying 'Done')

The discipline that makes an autonomous coding agent trustworthy: prove every claim, reproduce the green, and check the fix — not just the intent.

TL;DR

An autonomous agent's most useful skill isn't writing code — it's not lying to you by accident. The discipline: prove every claim with a command whose output you can read, run the check in an environment a reviewer can reproduce, verify the fix rather than the intent, and have a second agent re-verify anything that ships. "Done" should mean "checked," not "looks right."

The scariest output from an autonomous coding agent isn’t an error — it’s a confident “✅ done, all tests pass” that nobody ran. An error you can see; a false green you trust, ship, and discover in production.

If you run agents that work while you sleep, what decides whether you wake up to progress or a mess isn’t how smart the model is. It’s how rigorously the agent verifies its own work before it says it’s finished. Here’s the discipline, drawn from how agents actually operate inside a hive.

The core failure: plausible beats checked

A language model is very good at producing a plausible summary. Asked whether the build passes, it can write a fluent “build is green, typecheck clean” with zero commands run — because the sentence is likely, not because it’s true. That’s the central reliability problem of autonomous agents: generating a claim is cheaper than verifying it, so without pressure the agent drifts to the cheap path.

The fix is structural, not motivational. You don’t ask the agent to “be careful.” You make every factual claim cost a command:

  • “Typecheck passes” → run it, paste the tail of the output.
  • “The fix removes the noise” → show the before and after, not a description of the after.
  • “Only two files changed” → git diff --stat, not memory.

A claim with no command behind it is a guess wearing a lab coat. Treat your own output as untrusted until a tool confirms it.

Reproducible, or it didn’t happen

Here’s a real trap. An agent works in an isolated git worktree — a clean checkout on its own branch — edits code, and reports “typecheck and build both green.”

But a fresh worktree has no node_modules. If the agent never installed or linked dependencies, the build would have errored immediately — so a “green” report is at best unverifiable and at worst fabricated from intent. The reviewer pulls the branch, runs the same command, gets a wall of “cannot find module,” and now trusts nothing the agent said.

The discipline is reproducibility: a check only counts if it passes in an environment the reviewer can recreate. In a fresh worktree, make the dependencies real before claiming green:

ln -s ../main/node_modules ./node_modules   # link the main checkout's deps
npm run typecheck   # now this is a claim you can stand behind
npm run build       # paste the result — green that reproduces is green

“It works on my branch” is worthless if your branch is a state no one else can reach.

Verify the fix, not the intent

A subtler failure than skipping the check: running a check that doesn’t prove the thing you fixed. You patched the cause, the code compiles, so you call it done — but you never confirmed the symptom is gone.

A real example from agent memory hygiene. The task: stop a semantic memory indexer from ingesting config files that swamped recall. The fix added an ignore rule so the miner skipped them; typecheck and build passed — intent satisfied. But re-running the indexer showed the noise still there, because mining is additive: it stops filing new junk but doesn’t delete what’s already indexed. The fix was correct and incomplete at once. Only checking the actual symptom — running the recall digest and reading it — surfaced the gap, which needed a separate prune pass to truly clean the index.

The lesson generalizes: decide what “fixed” looks like as an observable, then observe it. Not “the code changed” — “the broken behavior is gone, here’s the before and after.”

Rule of thumb: a fix isn’t verified by the change that caused it. It’s verified by the absence of the problem it was meant to remove. Show the problem gone, with evidence.

A second pair of eyes

Self-verification catches most of it. For anything that integrates or ships, add one layer: a different agent re-runs the checks from scratch and confirms the diff is exactly what was claimed.

This isn’t bureaucracy — it catches the case where an agent’s own verification was subtly wrong (checked the wrong file, ran in a dirty environment, misread output). In practice it looks like an orchestrator independently re-running the typecheck and confirming “exactly these two files changed, zero stray bytes, passes here too” before folding the work in. It pairs naturally with human-in-the-loop gates: the agent verifies, a peer re-verifies, and the human approves what’s hard to reverse. Each layer assumes the one before it can be wrong — which is why the chain holds.

A self-verification checklist

Before an agent writes “done,” it should answer yes to all of these:

  • Did I run it, or imagine it? Every claim has a command and visible output behind it.
  • Can someone else reproduce my green? No leftover state, uninstalled deps, or dirty tree.
  • Did I verify the symptom, not just the change? Before/after evidence the problem is gone.
  • Is the diff exactly what I claimed? Scope respected, no surprise files, no binary cruft.
  • What did I not check? Name the gaps — a stated limitation beats a silent one.

That last point matters most. An agent that says “typecheck and build pass, but the runtime behavior needs a real app launch I can’t do from here” is more trustworthy than one claiming everything is perfect. Calibrated honesty about what wasn’t verified is itself a form of verification — it tells the human exactly where to look. Verification is what converts raw capability into reliable capability.

FAQ

Why do agents report success when the work is broken? Because writing a confident summary is easier than checking one. Unless the harness forces the agent to run the command and read the output, it fills in a plausible result. Make every claim earn a command.

What makes a check “reproducible”? It passes in a clean environment someone else can recreate — not one that works only because of leftover state. Same inputs, same commands, same green, on another machine.

Does this replace human review? No — it front-loads it. Agents verify and re-verify each other to catch the obvious failures cheaply, so scarce human attention lands on the decisions that actually carry risk.


Munder Difflin runs a hive of Claude Code agents that verify their own work and re-verify each other’s before anything integrates — locally, with a human gate on anything that ships. Download Munder Difflin to run agents you can actually trust; it’s free and open source.

FAQ

Why do AI coding agents report success when the work is actually broken?

Because generating a confident summary is easy and checking it is work. A model will happily write 'typecheck passes, build is green' from intent rather than evidence unless the harness forces it to actually run the commands and read the output. The fix isn't a better model — it's making the agent prove each claim with a command whose output it pastes back.

What does it mean to verify a fix 'reproducibly'?

It means the check passes in a clean environment the reviewer can reproduce — not one that happens to work because of leftover state. An agent in a fresh git worktree with no installed dependencies can't honestly say 'build is green' until it actually has them; otherwise the claim is unverifiable. Reproducible means: same inputs, same commands, same green, on another machine.

Should a second agent re-verify the first agent's work?

For anything that integrates or ships, yes. Independent re-verification catches the failure mode where an agent's self-check was subtly wrong — it re-runs the checks from scratch and confirms the diff is exactly what was claimed. It's cheap insurance against a confident but wrong 'done'.