2023-11-21 re: Long-term S3 cache solutions meeting minutes #5

RaitoBezarius · November 21, 2023, 3:35pm

This is a follow-up of 2023-11-14 re: Long-term S3 cache solutions meeting minutes #4.

Present: @RaitoBezarius @flokli @edef @zimbatm @edolstra.

Quick recap

zimbatm discovered we may have some credits to use Fastly Compute Edge but has to discuss more with Fastly to understand what does it mean to use that credit and how long can use it.
Right now, edef & flokli are working on figuring out the right FastCDC parameters for the chunking to optimize the deduplication ratio vs. the compression ratio.
An unexpected setback was to discover that most of the time for ingestion was spent on performing unxz -d on the narfiles, which brought whether should we rewrite the narinfo to modify the compression algorithm from xz to zstd.
A test run of 3 channel bumps of 1 month apart has been running, 2 channel bumps were ingested, edef and flokli have observed that deduplication + compression (in zstd) shows us a 25 % reduction (~ 100ish TB worth of data in the cache), this is the first attempt using a small chunk size (i.e. smaller chunks, better deduplication ratio, the worst compression is because of the compression context window)
edef have good reasons to believe this is what would constitute a lower bound, a worst case scenario, regarding our hopes to achieve a good deduplication relative to the compressed narfiles.

Next steps

Due to the unexpected time it takes to ingest channel bumps of the XZ compression, tests are still running with different chunking parameters to figure out what is the right balance between good deduplication and compression.

Full notes (old pad available here: https://pad.lassul.us/nixos-cache-gc?both):

## Day **2023-11-21**

Present: zimbatm, flokli, eelco, raitobezarius

Queries from previous call

@zimbatm is looking if our Fastly plan include Fastly Compute Edge so we could run the reassembly component at the edge to improve performance.
- zimbatm: we have some credits for Fastly Compute Edge but it’s unclear how much. I need to go back to you again.

@edef and @flokli are working:

on taking 2/3 channel bumps home, dedupe them on disk and play around with chunking parameters.
- flokli: we did write code that ingests 3 channel bumps (3 months apart) and trying to get a worst case (lower bound) and store paths being realistic
- flokli: a lot of time spent on this stuff is just unxz -d the nar files
  - in some cases, the compressed narfile is > than the uncompressed but still waste CPU time to decompress, e.g. debug info about chromium… (~2 minutes to uncompress)
- go through store-paths.xz from releases.nixos.org (containing the closure of store paths) and ingesting these: decompressing narfiles, parsing the nars, feeding every blob to the FastCDC chunker, storing it the individual chunks to /dev/null while keeping track of which chunk we already had
- you don’t need to keep the data but you want to know the statistics: decompression ratio and deduplication ratio
- we are looking for the right parameters for good deduplication while not having to store the chunks on disk actually
- edef: the tradeoff is the more the chunks are smaller, the more chunk metadata we are generating for the same amount of data, but we get better deduplication ratios as we have smaller chunks, so they are more likely to be in common with other nars (or internally inside in that nar) but we also shred the compression algorithm context window by doing so, therefore, there’s less opportunities for compression algorithm to be effective
- flokli: the script that edef wrote does zstd compress on the chunks to keep track of the size of the data that it would take on S3
  - it’s not yet entirely clear whether we should use zstd or xz
  - if we don’t want to touch the narinfo file, we would need it to serve it back in xz
  - and doing unzstd → xz → serving, might not be a good idea
  - and this would not run that well in a edge computing service
  - the nice point about using zstd and if we would be fine to rewrite the narinfo (i.e. modify the compression algorithm)
  - and then reassembly is just concatenating the zstd chunks so it’s a neat feature
  - zimbatm: do we need to change the hashes?
  - flokli: no, we only change the compression information inside the narinfo, this can change only the URL field inside the narinfo, but it doesn’t break any hashes, Nix doesn’t care about the file hash, about the file size, about the suffix in the URL, the only thing it’s about the compression field and the narinfo field
  - zimbatm: so we cannot use only the 404 trick?
  - flokli: only if we don’t want to re-xz again and if we chose zstd ; we may have multiple options if we really want to use xz during render (lightweight xz-ing, etc.)
- edef: we run a test ingestion of two closures (the third one is still running, it takes a decent amount of CPU, we have 8 cores, it takes a couple hours)
  - our worst case is we do 25 % better than how things are currently stored, ~128TB of the current cache
  - it is not enough to hit our target but it’s pretty plausible for a worst case
  - this is done with the very very small chunk size on the fastcdc size (64kb/128kb/?)
  - it doesn’t really on zstd doing anything
- flokli: I want to run this with a megabyte of chunk size, this is mostly balancing compression and deduplication and coming up with a number that makes sense to us
  - the fact that it takes (5-10) hours × 3 to run this on a dataset is a bit annoying
  - adding more CPUs would be pretty expensive as we don’t need it all the time
raitobezarius: how to maintain those numbers in the future?
- edef: I’m interested into exploring adaptive chunking for later
  - basically, we chunk at the small chunk size, and then, do a slightly more dynamic algorithm once we have the chunks to figure out if we can see runs of common chunks or unique chunks that don’t benefit from being duplicated or being split up further.
- flokli: it would be a pass after the whole dataset was chunked with small chunk size, right?
  - edef: correct.
  - flokli: you would essentially, after this, repack some of those chunks that have little sharing and you look if the bigger block compress better.
- flokli: I don’t expect the numbers we would figure out to change that much, software doesn’t change that fast.
Discussion around what closures to pick to get a more representative sample (pick staging-next vs. version before the merge)
- minor python package bump?

Recap:

Right now, you (:= edef & flokli) are working on figuring out the right FastCDC parameters for the chunking to optimize the deduplication ratio vs. the compression ratio. An unexpected setback was to discover that most of the time for ingestion was spent on performing unxz -d on the narfiles, which brought whether should we rewrite the narinfo to modify the compression algorithm from xz to zstd.

Regarding the test run on the two ingestions, so far, we have observed that deduplication + compression (in zstd) shows us a 25 % reduction, this is our first attempt using a small chunk size (i.e. smaller chunks, better deduplication ratio, worst compression because of the compression context window), we have good reasons to believe this is what would constitute a lower bound, a worst case scenario, regarding our hopes to achieve a good deduplication relative to the compressed narfiles.

Next steps:

Running the same thing with different chunking sizes and checking the numbers we come up with that.

cc @ron