May 8, 2026 · updated May 8, 2026

The Author Moved Up the Stack

verbose
milestone
ai-authorship
reflection

I wrote a generator for verbose a few weeks ago. It takes a .intent file — numbered prose describing what the program is supposed to do — and asks a Claude model to translate it into .verbose, the formal source the compiler verifies. If the compiler rejects the result, the script feeds the diagnostic back and asks for a fix. Up to three rounds, then it gives up.

The point wasn’t the generator. The point was the eval: a way to check whether the language I’d designed for AI authorship could in fact be authored by AI. I had hopes. I’d built around them. Hope isn’t a measurement.

First run on an 8-intent in-repo sample, Sonnet 4.6, came back 8/8 first_try. Encouraging. Also slightly suspicious — the few-shot examples in the prompt overlap stylistically with the eval set, and the model could plausibly be pattern-matching its way through.

So I wrote a hold-out. Ten new intents in domains the repo never touches: forum moderation, tournament prize pools, library stock, meter billing, racing laps, flight status. Each composes patterns the few-shot doesn’t show. The model has to assemble each from INTENT.md rather than syntactic-clone its way to a passing program.

Hold-out, Sonnet 4.6:

first_try         = 9/10
after_corrections = 1/10
failed            = 0/10

Same hold-out, Opus 4.7:

first_try         = 8/10
after_corrections = 2/10
failed            = 0/10

Across those twenty model-runs, zero failures. Every correction round converged in one pass. Each produced .verbose was a real solution — not a stand-in, not a stub. The verifier checks declared reads against AST, termination bounds against the actual operation count, overflow ranges against interval arithmetic. It rejects what shouldn’t compile.

I let that result sit for a few days before I tried to interpret it.

The honest size of the signal

Twenty model-runs is not a study. Ten new intents per model is not a benchmark. The INTENT.md document was hand-tuned over months by someone who had the verifier’s diagnostics in front of them while writing it. The few-shot pairs were chosen to span the language surface, with full knowledge of which patterns the model tends to miss. None of that makes the result fake — all of it makes it contextual. The generator works because the rails were laid carefully and the model is good at running on rails.

I want to say that out loud before the rest, because the next part is easier to overstate.

What the result actually shows

Verbose was designed under the assumption that the AI is the author. Every architectural choice — every read declared, every termination bound named, every effect spelled out — was made with “an AI will produce this and a compiler will check it” as the load-bearing user story.

Until a few weeks ago that was a bet. The compiler worked, the verifier worked, hand-written .verbose parsed and checked out. But the loop — .intent written by a human, .verbose written by a model, binary verified by a compiler with no human in the middle — was theoretical. The hold-out closed it. For ten previously-unseen intents, a model produced a .verbose, the compiler accepted it, the binary did exactly what the prose had asked for. The chain held under mechanical pressure.

That’s the architectural claim becoming measurable for the first time. I’d told the language’s story in those terms for over a year. Now there are runs.

Where the human went

The thing I keep coming back to, after the result, is what’s left for me to do.

If the model writes the .verbose, the part of the workflow I used to spend most of my time on is gone. I used to write Rust, then later type prompts and review what came back. Keystrokes were sometimes mine, sometimes the AI’s. But the responsibility for what the program is supposed to do lived in the same place either way: in my head, partially formalized, half in the code, half in the comments. There was no clean place where the intention was written down.

In the verbose loop, the intention has a file. The .intent. It’s the thing I now spend the bulk of my time on.

This isn’t a smaller job than writing code. It’s a harder one. Code is forgiving — a confused statement compiles, a vague function does something. Prose specifying behavior is unforgiving in a different way. If the .intent is vague, the model produces a .verbose precisely as vague as the prose allowed, and the compiler accepts it, because the compiler doesn’t speak prose. The vagueness ships, and the auditor reading both files sees what was asked for, what was produced, and what wasn’t asked for at all.

Implementation compressed. Specification expanded. I spend less time thinking about how the program does what it does and more time deciding what I actually want it to do, in language careful enough that translation doesn’t have to fill in the gaps. Implementation has the property that the compiler eventually tells you when you got it wrong. Specification has no such backstop. The only person who can tell whether the .intent says what you meant is you, six months later, reading it again.

The double-layer question

There’s a sharper version of this discomfort, and I’m not going to pretend I’ve resolved it.

If the AI writes the .verbose from my .intent, who writes the .intent?

Right now, I do. Sentence by sentence, slow. The slowness is part of the value — the prose is what forces questions about edge cases, about what’s allowed, about the parts of the system where I had a vague notion and now have to commit to a position. The model can’t do this for me, because the prose is the position. If the model writes it, I haven’t taken a position, I’ve nodded at one the model proposed.

But there’s an obvious next step. The model can also write .intent. I describe a system in chat, the model produces numbered prose, I review it. Three minutes, not three hours. The compiler accepts the resulting .verbose. The binary runs.

Did I author this? I don’t know. I selected from candidates the model produced, edited some lines, rejected others, shipped the result. The chain is mechanically traceable to the binary — every claim verifiable, every read named, every effect declared — but the origin of the claim is no longer cleanly mine. It’s mine in the sense that I ratified it. It isn’t mine in the sense that I generated it.

I don’t have an answer. The structure of verbose makes the question askable — the .intent and the .verbose are both readable artifacts, an auditor can compare them, a future-me can read both and form an opinion about what was actually meant. That’s progress over the alternative, where the intention lives nowhere except the author’s head. But askable is not solved.

This is the part of the AI-and-software conversation that gets flattened when it collapses into either AI replaces programmers or AI is just autocomplete. The interesting case is the one where the human is still meaningfully in the loop but the loop has changed shape — where the work has moved from production to direction, from typing to deciding. The contribution is real but harder to point at. The question of authorship gets correspondingly more delicate.

What the numbers don’t say

It doesn’t show AI replaces the programmer. The programmer wrote the .intent, the language the AI generates into, and the verifier the AI’s output is checked against. The AI does one of those four jobs — translation — competently. The other three are upstream of it.

It doesn’t show verbose is the only safe AI codegen target. Plenty of code is fine being generated into Rust or Go and reviewed by humans. Verbose is for the slice where the gap between generated and trusted needs to be mechanical.

Ten intents doesn’t prove the language. Ten is a signal. The compiler is the part I trust most. The model is the part still moving.

What it does show is that the loop closes. You can write down what you meant, ask a model to express it formally, have a compiler verify the formal expression against axioms you control, and end up with a binary whose behavior is exactly what your prose asked for. That round-trip wasn’t possible last year. It’s possible this week. Whether it’s possible consistently is what the next runs are for.

A small admission

I built the language for this. Designing under the assumption that the AI would be the author was the bet from day one. The numbers are the first measurement that the bet was sound on its own terms.

I’m relieved. That relief is suspicious — it’s what you feel when something you wanted to be true turns out to be true on a small sample, and the next sample could go either way. So I’m watching it. The signal is there. The question of what it means for the human is bigger than the result that surfaced it, and I haven’t figured that part out.