This is a follow-up of 2023-10-24 re: Long-term S3 cache solutions meeting minutes #1 and 2023-10-30 re: Long-term S3 cache solutions meeting minutes #2.
Full details can be found in NixOS Cache GC Meeting - HedgeDoc.
Quick Recap
- Bucket is now Requester Pays, thank you @delroth !
- Bucket logs are available in the Archeology S3 now, thank you @zimbatm !
- SQLite database from @edolstra has been uploaded in the Archeology S3, thank you @edolstra ! (see next steps of the last meeting.)
- All narinfo files have been downloaded as Parquet files in the Archeology bucket: the data format is being extended to allow easier manipulations in Clickhouse (e.g. allow joins) by @edef (with a turbofetch tool: feat(tvix/tools/turbofetch): init · Gerrit Code Review)
- Based on prior experiences with nix and casync, without any knowledge of NAR and boundaries, we could deduplicate down to 20 % of the original size. With this new model which has knowledge about the proper boundaries, we expect a better ratio: see Nix-casync, a more efficient way to store and substitute Nix store paths - #3 by rickynils for the prior experiment.
- When can we start? asked @edolstra — code to read exist, code to write has still to be written.
- Data analysis is still ongoing, aiming to answer:
- what is the request rate for the paths we are planning to deduplicate and will the reassembler (so-called
nar-bridge
) be able to sustain the charge for those paths? - Either case, for paths we planned to delete, any performance is better than not having the path at all.
- what is the request rate for the paths we are planning to deduplicate and will the reassembler (so-called
- Sum of all ISO in the cache that are copied anyway to releases.nixos.org (so we store them twice!) is ~60TiB anyway, thus 12 % of the current cache size.
Next steps
- @zimbatm is looking if our Fastly plan include Fastly Compute Edge so we could run the reassembly component at the edge to improve performance.
-
@edef and @flokli are working:
- on taking 2/3 channel bumps home, dedupe them on disk and play around with chunking parameters.
- looking into the request rate to S3 and narrow it down with the cold paths and hot paths.