We are running Nix build jobs as part of our CI and noticed that sometimes builds transiently break when they coincide with garbage collecting the Nix store. The typical pattern is this:
Build job:
$ nix-build some.nix -A someAttribute
...
error: getting status of '/nix/store/bwvcs82ic9f6j2md1zdxzp5sgkkq86nl-nixpkgs-src/pkgs/development/compilers/llvm/9/clang': No such file or directory
System Log:
May 19 03:15:20 build17 nix-gc-start[6127]: deleting '/nix/store/bwvcs82ic9f6j2md1zdxzp5sgkkq86nl-nixpkgs-src'
Restarting the build fixes this issue. It seems that the Nixpkgs source got garbage collected while Nix was evaluating it. If this is true, we are asking ourselves how to permanently fix this issue? So far we have only come up with workarounds, such as moving the GC time to a time where there are no jobs running.
Where does this path come from? Whatever process creates this path has to make sure that it’s registered as a GC root. (E.g. nix-build -o ./foo registers the symlink ./foo as a GC root.)
then you can hold on to the pinned version tarball by, say, using its store path as the value of an environment variable (i.e. an attribute of your derivation):
NIXPKGS_SRC = "${pinnedNixpkgs}";
this way the nixpkgs source becomes an explicit dependency of your derivation and will not be gc’ed from under it.
You could consider setting nix.gc.options = "--delete-older-than 1h"; which would only remove garbage that is older than 1 hour. You can tweak that window to your liking. This may not be the reason for breakage, so we’ll have to investigate more, but this would be a good place to start investigating a workaround.
Note that --delete-older-than only affects the deletion of old profile generations, not arbitrary store paths. So if you do something like nix-prefetch-url to download something, it will still be deleted if it’s younger than the specified interval, unless it happens to be reachable from a root.
Once upon a time Nix had an option to use the access time of store paths to delete unused paths, but atime is no longer widely maintained by filesystems so we removed it.
After some thinking about the GC process, we believe we have a fundamental problem with our approach. We execute our Nix builds from Docker containers. The approach is pretty much identical to what is outlined here in the NixOS Wiki.
These containers have access to the host Nix daemon and store, because they are mapped into the container. Now the problem is that a GC root from the container will actually be a non-existent path in the host system. That means if the build from the container tries to register a path, it will do so with its local chroot path name. The Nix daemon then presumably ignores/removes GC roots that point to non-existent paths.
So the solution is to not use Nix from inside Docker. Or at least try to configure the GC to not coincide with build jobs.