May 12, 2026

Test Isolation Is a Feature

lithair
testing
debugging
vision

The other evening, I added a single line to the BDD harness — Stdio::piped() on the leader process — and for the first time in twenty-six days the leader had a voice. Up to that point every cluster scenario that involved more than one POST had been hanging, the client reading until it hit a 30-second timeout, the report coming back as “connection drops.” I’d been working around it. Filing it under flaky. Telling myself I’d get back to it.

The stderr came back and the first line said WAL replay: restored 40 entries, commit_index=0.

Forty entries. From a previous run. The thing I’d been chasing for nearly a month wasn’t a network problem. The leader wasn’t hanging on the request — it was hanging on entry index 19. Which had been written to disk weeks ago, by some other test run, never committed in this run, and would never be. The apply loop was waiting for index 19 to apply before index 20 could proceed. It would wait forever. The HTTP client gave up at 30 seconds and called it a connection drop.

The fix was four hours, once the stderr was visible. The lithair WAL and snapshot manager had been hard-coded to write to ./data/raft/node_N/... regardless of what the test harness asked for. The harness was setting EXPERIMENT_DATA_BASE to a per-run tempdir, the way it does for every other piece of state — request bodies, snapshots, the lot. The Raft code wasn’t reading it. So each test run wrote into the same on-disk WAL the previous run had left behind, replayed forty entries that had nothing to do with the current scenario, assigned a fresh request to a fresh index, then waited for the ghost in front of it to apply. The ghost would never apply. The leader would never move.

Twenty-six days. A WAL file on disk that nobody had asked to keep.

The fix was small. A raft_base_dir() helper that checks LITHAIR_DATA_DIR first, then the harness’s EXPERIMENT_DATA_BASE, then falls back to ./data for production. WAL paths and snapshot paths route through it. Production behavior unchanged. Tests get the isolation the harness had been trying to give them all along. One file changed, a few dozen lines.

The reason it had taken twenty-six days wasn’t the difficulty of the fix. It was that the harness had been silently swallowing the leader’s voice. The process was spawned with the default stdio — inherited, but cucumber’s runner doesn’t propagate child stderr in any useful way during the run. The leader could panic, write a clear error to stderr, and the test would just timeout with no signal. We were debugging blind because we’d never asked the leader what was wrong.

The stderr capture was forty minutes of work. The bug had been one stderr capture away the whole time.

The second one landed less than a day later, and the shape was identical even though the failure was completely different.

A separate set of BDD scenarios — the in-process mock cluster used by distribution_clustering.feature — had been red, fourteen for fourteen, since the threaded HTTP server existed. Every accepted connection died with what curl -v reported as “Empty reply from server.” TCP reset, no response, the worker thread silently gone.

The thing that made this one strange is that the external cluster tests, the ones that run lithair as a separate binary, had been green the entire time. Same code path. Same HttpServer::serve(). Same dispatch. Different test harness.

What was different: the external binary tests wrap their entry point in #[tokio::main]. The in-process mock cluster doesn’t — it spawns lithair from a regular sync test, no runtime in scope.

The bug was inside HttpServer::serve(). Each accepted connection got handed to a plain std::thread::spawn worker, and the worker called Handle::current() to get a tokio handle so it could block_on the async router. Handle::current() panics if there’s no tokio runtime in the thread’s context. On the #[tokio::main] path the runtime context propagates to the std thread through the way tokio sets up its thread-locals — accidentally, but reliably. On the bare-test path there’s no runtime at all, the std thread has no context to inherit, Handle::current() panics, the connection dies before the request is parsed, the client sees a reset.

In production, every caller wraps the server in #[tokio::main]. The bug had been latent for as long as the threaded server had existed. Real binaries never saw it. The thing that exposed it was a test that didn’t behave like a real binary.

The fix was to stop calling Handle::current() from anywhere a std thread might run. serve() now captures a handle once, at startup — reusing the caller’s runtime if one is in scope, otherwise owning a fresh multi-thread runtime for the lifetime of the call — and passes that explicit handle into every worker. No more Handle::current(). No more invisible thread-local dependency. The same code now runs identically whether the caller is #[tokio::main] or not.

What strikes me, looking at the two of them next to each other, is that they have the same shape.

Both bugs were invisible in production. The first one didn’t surface because production never re-runs into the same tempdir as a previous run — every prod deployment gets its own WAL, the hard-coded path was fine. The second one didn’t surface because every production caller happens to wrap in #[tokio::main], and the runtime context happens to propagate. Production was working around both bugs by accident. The accidents were stable. They would have stayed stable for years.

Both bugs surfaced because the harness asked a different question than production asked. The harness ran the cluster scenarios from scratch, fresh state, no #[tokio::main] wrapper, expecting the code to behave the way the README said it would. Production behavior met those expectations approximately. Test behavior met them exactly, or not at all.

I’d been told for years that tests are about confirming what you believe. That framing keeps the tests aligned with the code — green when the code works, red when it doesn’t. It also keeps them aligned with the assumptions the code was written under. The assumption that there’d be a runtime in scope. The assumption that the WAL path didn’t matter because nobody else would write to it. Those are the assumptions the harness doesn’t share. The harness puts the code in a context where the assumption isn’t true, and watches what happens.

If the test isolation is real — every scenario from scratch, every directory fresh, no shared global state, no inherited runtime, no convenient inheritance from a parent process — the harness asks questions production never asks. It asks them mechanically, on every run. And when something has been quietly riding on an assumption that wasn’t written down, the harness finds it.

That’s a feature. The isolation itself, not the assertions. The assertions only check what you thought to check. The isolation checks what you didn’t.