GCing my substituter leads to "hash mismatch importing path" errors

thomasjm · September 25, 2023, 11:28pm

I think I’ve identified a workflow that causes a problem when you use Nix substituters and periodically GC them. Here’s how it works:

I have a substituter machine S and and build machine B, and we’re trying to build an output path P.

At first, S has already built P and has it on disk.
B tries to build P, and successfully fetches it from S over the network, and imports it into the Nix store. All good.
Now some time passes, and the hard disks on both machines start to fill up, so I run GC on both machines. P is not connected to a GC root so it’s deleted on both machines.
Now S runs another build which realises P. (But this time P has slightly different bytes on disk!)
B tries to build P, and fetches it over the network onto disk. But then, when B tries to import the path into the Nix store, it fails with “hash mismatch importing path.”

I know that Nix doesn’t generally verify output hashes, but I think I’m running into an exception here. I went poking around in the Sqlite database in ~/.cache/nix/binary-cache-v6.sqlite on B and found that there is a cached relation between P and the initial version of the hash. I think this was inserted in step 2 above, and it matches the old hash I’m seeing in the error message.

Deleting ~/.cache/nix/binary-cache-v6.sqlite and restarting the Nix daemon on B seems to solve the problem.

This hash caching seems to be a major blocker when it comes to running GC on my own substituters. Or am I misunderstanding what’s going on?

ElvishJerricco · September 26, 2023, 12:07am

You’ve pretty much got it. Nix caches the paths it knows about from different substituters in ~/.cache/nix/binary-cache-v6.sqlite. And Nix kind of assumes that a substituter only ever adds paths and never loses them.

thomasjm · September 26, 2023, 12:24am

Got it, thanks @ElvishJerricco. I guess Nixpkgs must have pretty good byte-for-byte reproducibility for most paths, because this hasn’t bitten me until now. The path in question is a Haskell program so there must be some nondeterminism there.

As a design matter, it makes me sad whenever I see Nix putting additional state under ~/.cache/nix. Currently mine has binary-cache-v6.sqlite, fetcher-cache-v1.sqlite, eval-cache-v4, eval-cache-v5, flake-registry.json, and gitv3. Not only do these not do any good for other users in a multi-user environment, they seem to each offer their own possibilities for cache staleness and other weirdness. And they violate my simple mental model about Nix’s state, which is that we have the on-disk /nix/store directory and a Sqlite metadata DB in /nix/var/nix/db/db.sqlite.

Anyway, would it make sense to open an issue in the Nix repo to change how the binary fetcher cache works? It seems to me that the only purpose of the cached hash is to make sure the file was transmitted properly over the network, and it doesn’t really need to be kept around after a path has been successfully retrieved and imported. I would think that the binary fetching logic could just ask the store for the expected hash after downloading it, then check it, import the path into the local store, and discard the hash.

ElvishJerricco · September 26, 2023, 12:29am

Importantly, this cache database significantly reduces the number of queries made to the binary cache to check for the presence of paths. On top of making the user experience quicker in some scenarios, it also reduces the load on the CDN.