Nix-heuristic-gc: A more discerning cousin of nix-collect-garbage

nix-heuristic-gc

A more discerning cousin of nix-collect-garbage, mostly intended as a testbed to allow experimentation with more advanced selection processes.

Developers and users who make use of a lot of packages ephemerally end up with a large nix store full of packages which may or may not be used again in the near future, but aren’t referenced from a GC-root. While nix-collect-garbage’s --max-freed option is handy here, it still selects which paths to delete at random.

nix-heuristic-gc uses a greedy algorithm to prefer deletion of less-recently accessed (or more easily replaceable) store paths.

16 Likes

Nice idea. Do you think it could be possible to integrate it as tightly as nix gc?

nix.gc = {
  automatic = true;
  dates = "weekly";
  options = "--delete-older-than 30d";
};

In its current form, no. As it is, it’s mostly designed for hackability. If someone comes up with a sensible set of heuristics that are straightforward to implement in minimal-dependency c++ (or rust?) perhaps tighter integration could be proposed.

Note that it’s trying to solve a different problem from the --delete-older-than flag. That works with profile generations, all of which I consider to be “not garbage”. Many people barely use profiles at all, and therefore have little better mechanism out-of-the-box than random deletion.

2 Likes

FYI, somewhat related to nix build: options to help with disk space management / out-links for partially built derivations · Issue #7803 · NixOS/nix · GitHub

1 Like

Nice idea, definitely plenty of scope to play around with better logic, but it should be clear that the actual goal is actually deleting less garbage.

A couple of initial thoughts:

  • Consider related versions of the same package. Packages from cache that have other, newer versions installed as not-garbage, are more likely to be real garbage (because they’ve been updated). Packages that have no current equivalent under gc roots are more likely to be things pulled in by other mechanisms: devshells, ad-hoc nix-shell commands, build deps, etc, that may be called on again despite currently dangling.

  • I feel like there’s a struggle here to get good data, which is unsurprising: caveats in the usage notes like reliance on atimes, etc. How much could this be improved with assistance from the nix daemon, using either data already kept in the store db, or with enhancements to track more (like package install times or even tracking that a package has already been downloaded more than once).

I guess the main point for the latter item is that a standalone gc program may not be the best place to experiment, even if it is the easiest and sensible place to start.

Certainly in my common use-case this isn’t an assumption that can really be made. I do a lot of nixpkgs-review, with multiple different target branches (staging builds are a killer), build for multiple architectures, build different variants of a package… and then sometimes I need the great big build of staging-22.11 I did last week back to review something new.

The main drawback of atime use as I see it is that using a package as a dependency doesn’t necessarily touch it in a way that will update its atime. I toyed with the idea of making referenced packages inherit the max atime of their referring packages, but this adds complexity and a couple of quirky corner-cases.

Yes, the main problem being that by the time a package is garbage, a lot of the metadata about it is gone. Though of course, if you made nix keep extensive information on things that are already supposed to have gone, it may rather defeat the point of trying to free up storage space.

This is very cool!

I’d be very interested in seeing whether some of these can turn out to be generic enough to be included upstream

2 Likes