Apr 20, 2026

What Restarting a Node Taught Me About Trust

lithair
clustering
raft
persistence
reliability

The first time I restarted a node in my test cluster and watched it come back as a blank follower, I knew the design had a hole in it.

The node had been part of the cluster five seconds earlier. It had its own WAL on disk. Its term, its commit index, its applied index — all there, in files I’d written. The node restarted. And then it asked the leader, politely, to please replay the entire log from scratch.

It hadn’t been part of the cluster. It had been a blank follower playing along.

The standard pattern

Most distributed databases store their consensus state somewhere — a WAL, a snapshot, sometimes both. On startup, they read that state and resume from where they were. This is well-trodden territory: Raft is widely implemented, the recovery semantics are documented, and crash-safe persistence is a 50-year-old problem.

Building lithair — my Rust framework that includes a clustered model store — meant I’d written my own version of this. It was working in the happy path. Nodes joined a cluster, replicated writes, served reads. The benchmark numbers were respectable.

I just hadn’t actually restarted any of them yet.

What I noticed

When I did, three things broke:

A node that came back online had no memory of being in the cluster. It re-replicated everything from the leader. For a small log this was a few seconds. For a real one it would have been minutes of unavailability.

The leader, when it stayed quiet for a while with no writes, would silently get demoted. Followers, hearing nothing, assumed it had died. They held an election and picked a new leader. Same data, different timeline. Confusing.

And when one node fell behind, every subsequent write to the leader sent the entire log history to it. Replication cost grew with the log size — O(n) where it should have been O(1).

None of these were exotic. They were exactly the failure modes Raft is supposed to prevent. I’d implemented the parts that pass tests in isolation. I hadn’t implemented the parts that survive operation.

What I fixed

v0.1.3 closed the three gaps:

WAL replay on startup — the node reads its persisted log and committed snapshot before serving any RPC. No more blank followers.
Leader heartbeat — periodic empty AppendEntries during quiet periods. Followers know the leader is alive even when there’s no data flowing.
Per-follower replication index — each new write sends one entry to up-to-date followers. Lagging ones get caught up by a background task using their match_index.

Plus a release-cycle cleanup: PATCH semantics on the model API, an auto-generated OpenAPI 3.1 spec, server-sent events for change subscriptions, and a few thousand lines of code paths that nobody had touched in months — gone.

Why this matters to me

A cluster that lies about its state is a cluster you can’t trust at 3am. The lie isn’t malicious; it’s just the gap between what the implementation claims and what it actually preserves. Most of the time the gap stays hidden because nothing crashes.

I wanted lithair’s clustering to be observable, not asserted. Not “we say it persists state.” Verifiable: kill a node, bring it back, and watch it remember where it was.

The version of lithair I trust is the one that survives a kill -9 and tells you the truth on restart.

What this is not

It’s not “Raft is hard.” It’s well-known, the papers are clear, and many implementations exist.

It’s not “you should write your own.” For most workloads, etcd or FoundationDB or one of a dozen mature options is the right answer.

It’s: when you build something with persistence semantics, the gap between “the test passed” and “the node remembers” is where the real work lives.