2023-11-07 re: Long-term S3 cache solutions meeting minutes #3

This is a follow-up of 2023-10-24 re: Long-term S3 cache solutions meeting minutes #1 and 2023-10-30 re: Long-term S3 cache solutions meeting minutes #2.

Full details can be found in NixOS Cache GC Meeting - HedgeDoc.

Quick Recap

  • Bucket is now Requester Pays, thank you @delroth ! :tada:
  • Bucket logs are available in the Archeology S3 now, thank you @zimbatm !
  • SQLite database from @edolstra has been uploaded in the Archeology S3, thank you @edolstra ! (see next steps of the last meeting.)
  • All narinfo files have been downloaded as Parquet files in the Archeology bucket: the data format is being extended to allow easier manipulations in Clickhouse (e.g. allow joins) by @edef (with a turbofetch tool: feat(tvix/tools/turbofetch): init ยท Gerrit Code Review)
  • Based on prior experiences with nix and casync, without any knowledge of NAR and boundaries, we could deduplicate down to 20 % of the original size. With this new model which has knowledge about the proper boundaries, we expect a better ratio: see Nix-casync, a more efficient way to store and substitute Nix store paths - #3 by rickynils for the prior experiment.
  • When can we start? asked @edolstra โ€” code to read exist, code to write has still to be written.
  • Data analysis is still ongoing, aiming to answer:
    • what is the request rate for the paths we are planning to deduplicate and will the reassembler (so-called nar-bridge) be able to sustain the charge for those paths?
    • Either case, for paths we planned to delete, any performance is better than not having the path at all.
  • Sum of all ISO in the cache that are copied anyway to releases.nixos.org (so we store them twice!) is ~60TiB anyway, thus 12 % of the current cache size.

Next steps

  • @zimbatm is looking if our Fastly plan include Fastly Compute Edge so we could run the reassembly component at the edge to improve performance.
  • @edef and @flokli are working:
    • on taking 2/3 channel bumps home, dedupe them on disk and play around with chunking parameters.
    • looking into the request rate to S3 and narrow it down with the cold paths and hot paths.

cc @ron @delroth @lheckemann @zhaofengli


2023-11-14 re: Long-term S3 cache solutions meeting minutes #4 is out now.

1 Like