The NixOS Foundation's Call to Action: S3 Costs Require Community Support

sternenseemann · June 4, 2023, 2:30pm

I want to suggest that we leave the possiblity of garbage colleccting the binary cache out of the discussion—at least as far as the short term solution is concerned. Garbage collecting the binary cache is a big problem in two ways:

It is a question of policy: Do we want to garbage collect the cache? What do we want to keep? What can be deleted? This should be a community decision which can’t be achieved in a month’s time.
It is an engineering problem, i.e. how do we determine what store paths are to be deleted according to our established policy.

Additionally, garbage collecting the binary cache is risky, as we may wind up intentionally or erroneously deleting data we may miss—be it in 2 months or 10 years.

For the policy discussion one problem is that there is not a lot information about the cache available (although this has recently improved). Also it’d invariably involve looking to the future, i.e. how will our cache grow, how will storage costs develop etc.

For the engineering side, the big problem is that the store paths correspond to build recipes dynamically generated by a turing complete language, making it all but trivial to determine all store paths in the cache stemming from a specific nixpkgs revision. Assuming all store paths in the binary cache have been created by hydra.nixos.org (is this true for all time?), we have the /eval/<id>/store-paths endpoint available (from which store-paths.xz is generated) as a starting point. That will of course never contain build time only artifacts or intermediate derivations that never show up as a job in Hydra—among those, though, are the most valuable store paths in the binary cache, i.e. patches, sources etc. Even though we have tools to (reliably?) track those down, it becomes more difficult to do so for every historical nixpkgs revision. Additionally there is the question of the completeness of the data Hydra has (what happens to evals associated with deleted jobsets?). (If we were to garbage collect the binary cache, I think we should probably try figuring out what is safe to delete rather than determining what to keep.)

Garbage collection is a long term solution if at all.

It seems to be that finding a way to deduplicate the (cold) storage or archiving little used data more cost effectively is a similar amount of engineering effort, but less risky and comparatively uncontroversial—while still offering a sizeable cost reduction.