Correctness In An Agentic World
Agents are going to be writing most of the code we deploy. The days of writing code by hand are coming to an end, faster than most of us are ready for.
I'll argue that techniques we've historically treated as exotic - formal methods, property-based testing, mechanically-checked specifications - need to become table-stakes now, because the economics has flipped in their favor.
From Assembly to High-Level Languages
When higher-level languages came along, people were told to stop writing assembly. The transition took time. Compilers had bugs that produced incorrect code generation. Hand-crafted assembly was — and often still is — faster than what compilers emit. The transition still happened, over decades, with craftspeople threading the needle by keeping assembly skills while adopting higher-level languages. As compilers got better, eventually most everyone switched. Higher-level languages started developing tricks of their own — JIT compilation, garbage collection, generational heaps — that were technically possible in assembly but only became practical at the higher-level abstraction. The pattern was roughly:
- Phase 0: Everyone writes assembly. A few funny people use high-level languages, play with various runtimes - Prolog, symbolic systems, things considered fringe.
- Phase 1: People start writing high-level languages, taken as unserious by others. Old-timers worry the newcomers won't know how to code. They argue we should spot-check compiler output for correctness, that we should all maintain assembly skills.
- Phase 2: A lot of programming-language research effort goes into proving the code compilers generate is correct, or at least reasonably correct for most cases.
- Phase 3: People mostly write high-level languages. Compilers and interpreters are largely correct. Nobody believes anymore that writing in a high-level language is an odd choice. It's perfectly fine to not know assembly and be proficient as an engineer.
This took roughly fifty years, rate-limited by the supply of programming-language researchers, the commercial appetite to invest in that work, and the fact that proving these things is genuinely hard and slow. Each generation absorbed the previous generation's heretics1.
I argue we're in the same parallel spot now, compressed. Fifty years of arc, compressed to a few months.
Natural Language
We've moved up another level. We now write in EHL (even-higher-level) languages: natural language. We expect LLM agents to translate EHL into HL (Go, Rust, Python, TypeScript). And we don't really trust the output.
As craftspeople, we behave the same way the last-order-of-magnitude jump went:
- we require people to verify the output of the LLM agents
- we don't trust the output by default
- we worry for the youngsters who won't have the opportunity to learn to code (in HL languages)
- the concept of "let LLMs write code" is transitioning from fringe to reasonably accepted
- frontier EHL purists do weird stuff that everyone else finds baffling2
The good news is that, the same way going from assembly to HL gave us spare mental cycles to develop new concepts (JIT, GC, generational heaps), going from HL to EHL gives us spare cycles — and incidentally, more man-power — to raise the bar on correctness. The question is: how do we trust that EHL compiles to valid HL? How do we solve the correctness worries from agentic coding?
Software Engineering Methods
In the last two to three decades, TDD and the surrounding discipline argued - in different registers - that testing is a first-class artifact in software generation.3 The zealots overshot and the pendulum swung back. It's suddenly relevant again, because if you don't pin agents down with a real test harness, they will write plausible nonsense and walk away whistling.
But TDD was never the end-all of testing. Since the 1940s, we've known that properties of programs can be proven. Tony Hoare published "An Axiomatic Basis for Computer Programming" in 1969. "Impractical for real software," people said, for decades.
The typical unit test walks one path through program state.4 Golden tests, table tests, unit tests, integration tests - variations of the same idea. They prove what is known is known. They never prove anything about unknowns.
Property-based testing5 (PBT) does random walks through the state space - generating plausible ranges of input values and verifying the program never violates specific properties. Expressible essentially identically in Go (rapid), Rust (proptest), TypeScript (fast-check).6 Each draws hundreds or thousands of random vectors and shrinks failures to minimal counterexamples. Strictly stronger than a unit test.
In Go, it looks like this:
func TestSortIsIdempotent(t *testing.T) {
rapid.Check(t, func(t *rapid.T) {
v := rapid.SliceOf(rapid.Int()).Draw(t, "v")
once := append([]int(nil), v...)
sort.Ints(once)
twice := append([]int(nil), once...)
sort.Ints(twice)
require.Equal(t, once, twice)
})
}
The invariant: sorting is idempotent.
Sorting an already-sorted slice doesn't change it: . This test runs thousands of times with various inputs and ensures that the invariant is always true. When it fails, it shrinks the input domain to the minimum range which reproduces the invariant violation.
PBT has its own heretic history7. John Hughes and Koen Claessen published "QuickCheck: A Lightweight Tool for Random Testing of Haskell Programs" at ICFP 2000. For years it was treated as Haskell-people-are-weird, until it spread to every mainstream language.
For real proof of correctness, you need formal methods. Historically reserved for the most critical systems only, because of the onerous cost. But for what is specified, you have a much stronger guarantee than "the program is probably correct, I guess." A TLA+ spec8 is a state machine, an invariant, and a theorem the model checker verifies across every reachable state - not just one walk, every walk. That's the difference between a test and a proof.
TLA+ has its own weirdos' history. Leslie Lamport developed it in the early 1990s. Niche academic exercise for two decades - until Newcombe et al., 2015 documented how AWS used TLA+ to find bugs in DynamoDB, S3, and EBS before deploying them. Twenty-year gap from publication to industrial breakthrough. The pattern keeps repeating.9
The pattern is not "weirdos are always vindicated." It's narrower and more useful: techniques that look wasteful, niche, or absurd at year zero often become rational once the surrounding abstraction level and economics have moved. GC was wasteful on a PDP-1; on a server with 64 GB of RAM it's not even a question. The technique didn't change. The cost of not using it did.
The compression is the new variable. The arc from Hoare's 1969 paper to AWS-on-TLA+ took roughly forty-five years. The next arc should not take forty-five - or twenty, or even five. Agents are running amok in HL code right now. Specs are the next safe haven10, and we should be speedrunning toward it.
The Economics Just Flipped
Three things just happened at once.
1. Productivity stepped up. Agents are an order-of-magnitude shift in how fast code gets written. If you've worked with a good agent loop for a week, you've felt this directly.
2. Quality risk stepped up with it. Agents hallucinate, drift, write plausible but subtly wrong code. Without compensation, code quality regresses as agents take more of the keyboard.
3. The methods that compensate for #2 just got cheap.
Formal methods were never expensive in compute11. They were expensive in human attention. The Anvil paper reports a ~4.5x proof-to-code ratio - for a human to author the proofs and iterate them to closure. That's why "use TLA+ on everything" never won industrial adoption despite working in principle for thirty years.
That tax is paid by a different worker now. Verifiers are deterministic, machine-checked, unambiguously green or red. When the proof doesn't close, the verifier tells you exactly which conjecture fails on what state. This is the workflow agents are best at: tight feedback loop, clear signal, iterate until convergence.
And agents don't get bored.
Techniques previously gated by human patience are now gated by compute, which is cheap.
So the calculus inverts. We pay a fraction of our productivity gain to over-index on correctness, and we come out net positive - bigger reach, higher quality than before. Correctness needs to become table-stakes now, not because we're being purist, but because the math works out in our favor when we do it and against us when we don't.
Trusting Agentic Code
The techniques we need already exist - they were just waiting for a cost-model shift, and the shift just happened.12 So I posit, now:
- For the core logic of any non-trivial program - state machines, invariants that have to hold, bits where "almost right" isn't right - use formal methods. For software-level proofs with LLM-assist: Dafny, Verus, or Lean (Creusot and Kani are complementary Rust picks). For distributed-system design: P or TLA+.
- For the bulk of the rest - anywhere with a meaningful input space - use property-based testing as your default.
- For regression and happy-path persona coverage - keep traditional tests, but be honest: they're not your test of correctness. They're your test of presentation.
- For spec-driven development - also author a TLA+ spec for the load-bearing parts. Marginal cost small, payoff large.
This is no longer optional. The methods exist. The cost just dropped. The productivity gain from agents both pays for the investment and demands it.
Our Safe Haven - for now
The HL kingdom is being colonized by agents. We don't fight that on their turf - we retreat upward, to the layer above the code: natural language, product specs, formal specs, invariants, domain models, personas. The things that used to be flimsy (informal product specs, user stories) and the things that used to be costly (formal methods, advanced testing patterns) are both load-bearing now, for the same reason - they're how we stay legible to ourselves and keep enforcement leverage over what the agents are doing down in HL-land.
The skills that matter are the ones that let you operate that layer: writing user stories well enough that they actually constrain, deriving domain models, identifying invariants, deciding where to test vs. verify vs. prove, designing the spec tightly enough that an agent can't weasel around it. If you don't know how to do any of this - write user stories, derive domain models, find invariants, assemble the lot into layers of abstraction worth testing - your lunch is over, and someone ate your sandwich.
- 1Grace Hopper designed FLOW-MATIC (1955-59), shipped a compiler that read English-shaped statements over "computers can't understand English". John Backus led IBM's FORTRAN (1957); the assembly priesthood said compiled code couldn't match hand-written performance, the 704 demonstrated otherwise. John McCarthy designed LISP (1958) and shipped garbage collection in production for the first time; GC was dismissed as impractical for "real" work for decades. Dijkstra, "Go To Considered Harmful" (1968) - structured programming as fringe before it became table-stakes. Robin Milner designed ML (1973) with type inference, "well-typed programs can't go wrong" - academic exotica then, design ancestor of every typed language now. The arc - FORTRAN ('57) to C ('72) to C++ ('85) to Java ('95) to Go ('09) to Rust ('10) - was rate-limited by researcher supply, commercial appetite, and the fact that proving these things is hard.
- 2Steve Yegge (Gas Town, Gas City), Geoffrey Huntley (ralph loops), Simon Willison's running commentary.
- 3Kent Beck, TDD: By Example (2002); Fowler, Refactoring (1999); McConnell, Code Complete (2nd ed 2004); Dan North, "Introducing BDD" (2006); Gojko Adzic, Specification by Example (2011).
- 4Example-based testing samples the input space at one fixed point. Coverage metrics measure what was walked, not what's correct. Dijkstra's NATO 1969 line is canonical: "Program testing can be used to show the presence of bugs, but never to show their absence!" (report, elaborated in EWD249). The "every walk" guarantee comes from model checking - exhaustive reachable-state exploration up to a bound.
- 5Family members: metamorphic testing - relations between outputs of related inputs (ML, scientific code); differential testing - two implementations, same inputs, compare (compilers, refactors, ports); stateful/model-based PBT - sequences of operations against a model (
rapid.StateMachine,proptest-state-machine,quickcheck-state-machine); concolic execution (KLEE, SAGE); mutation testing - measure test strength. Fuzzing is the adversarial-generator variant. None of them can prove a program correct except for the most trivial - combinatorics blows up fast. - 6Where I think agentic coding consolidates. TypeScript: structural types make compile-time verification the cheapest correctness buy at the human-to-agent boundary. Rust: Verus and the type system make proof-carrying code feasible; the ergonomic tax that slowed humans is what agents pay easily. Go: fast compile/vet/test loop matters when driving an agent;
vet/staticcheck/race-detector/fuzzing are mature. - 7John Hughes and Koen Claessen, "QuickCheck" (ICFP 2000). For years treated as Haskell-people-are-weird, until it spread to every mainstream language:
rapid(Go),proptest(Rust),fast-check(JS/TS),hypothesis(Python). - 8A tiny TLA+ spec:
---- MODULE Counter ---- VARIABLES count Init == count = 0 Next == count' = count + 1 Spec == Init /\ [][Next]_count NonNegative == count >= 0 THEOREM Spec => []NonNegative ==== - 9Xavier Leroy / CompCert, mid-2000s - formally-verified C compiler, used today in aerospace and crypto. L4.verified / seL4 (2009) - first formally-verified OS microkernel. RustBelt (POPL 2018) - Rust's type system proved sound. Verus and Anvil (OSDI '24, Best Paper) - formally-verified Kubernetes controllers in Rust, currently "academic curiosity" phase.
- 10This safe haven doesn't last forever either. Same compression that ate HL will come for the spec layer - agents deriving specs from goals, verifying them against each other. By then we're closer to Iain M. Banks's Culture than today's software industry, and the question is "how do humans live their best life while AI toils away."
- 11Per-run cost can be nontrivial - TLC runs, Verus proof checks, fuzzing campaigns chew CPU. Economics still work: software is "compile/verify once, ship many times" - marginal distribution cost near zero, per Andreessen's "Why Software Is Eating the World". Throw hardware at verification, optimize for the scarce resource (human attention).
- 12Next order-of-magnitude shift isn't coming in 20-30 years; it's coming in 2-3. Don't sit on your laurels too long. Learn to live on the curve.