Switch cache.nixos.org to ZSTD to fix slow NixOS updates / nix downloads?

Yes, to be clear, I don’t think it is useful to convert old packages to zstd.

Given that Nix rebuilds everything when a fundamental library like the libc is upgraded, we’d get there a lot faster than “slowly” anyway.

4 Likes

“vocal”? I found a single comment, complaining about unhandled PRs and unhelpful comments without linking those issues/PRs.

I’m totally with Sandro… Just merge it…

4 Likes

There are currently 4 thumbsups to andir’s post and, more coming in, so it doesn’t look like something one should just discard.

I also buy the argument that some stuff doesn’t work in newer nix. I’ve hit workflow-breaking regressions myself as well, and they directly impacted the ability of my company to build stuff.

I do agree that it would help productivity and decision making when issues are stated directly (e.g. link to issue), so that they can be worked on, and so that it’s known what the acceptance criteria for advancing the minimum are, so that there’s a better chance for literally everyone advancing past e.g. nix 2.3. Vague references are more difficult to work with and result in subjective opinions instead of clear steps forward.

However, I still think that

How difficult is it to backport zstd to Nix 2.3?

should be answered if possible since that could solve the problem straight away.

8 Likes

Here is the naive backport without much manual testing (in Nix tradition):

Note: this is incomplete, there have been many changes to how sinks and sources work in Nix. It isn’t unfeasible but requires a bit more work.

8 Likes

I don’t know if there is some history behind this passive-aggressive comment, but in my experience Sandro has always been very responsive and thorough when reviewing PRs, providing good and thoughtful suggestions and undoubtedly is one of the main contributors to nixpkgs.

14 Likes

These numbers make it look like the difference in compression ratio between lzma and zstd are insignificant, but I’m not so sure of this.

cache.nixos.org seems to use xz at the highest level (-9).

Fedora used xz with level -2 which is significantly less effective, and therefore there was no loss switching for them.

The article for Arch Linux does not state which level was used for xz, but as they claim they only have lost 0.8%, I assume they did not use it at (-9) either.
Also the source for ubuntu doesn’t reveal the level used for xz.

I made a little benchmark with qemu, which is quite a big nar file. The xz coming from cache.nixos.org is 117M in size and the zstd compressed one (-19 / highes level) is 129M.
That is a difference of around 10%

Therefore by optimizing for the people sitting behind a fast connection, we will make the experience worse for the people who have to work with limited bandwidth. As a frequent traveler, I’m often in that position. That’s why I care.

Though, I have to admit, 10% loss sounds acceptable to gain 10X speeds elsewhere. Still, it would be nice to have actual numbers here and see what would be the difference on the compressed closure size of a whole nixos system, for example.

10 Likes

Another option is something like GitHub - vasi/pixz: Parallel, indexed xz compressor. A quick benchmark tarring up on some version of CrossOver I had lying around on the mac (969MiB of executables):

Tool CTime Dtime Size
xz -9 5min 10s 207M
zstd -19 40s 1s 254M
pixz -9 1min 2s 217M

Not sure how easy that’d be to integrate into a library but this shows the potential.

Would it be feasible for Hydra to upload multiple files with different compression algorithms and allow the user to choose? Retains compat with 2.3 without a backport if it defaults to xz, but requires double the space for Hydra.

2 Likes

I doubt that this is doable. We don’t even use separateDebugInfo = true; on each package because that’d imply a ~30% (IIRC) increase in storage size of a single nixpkgs build.

1 Like

FWIW that is kind of different. Increasing a single derivation’s output file size burdens all users with increased storage requirements, regardless of the NAR compression used. Increasing the space required to store more than one compressed variant of a NAR only burdens the cache’s storage. Might still be too much of a burden, but I think your comparison doesn’t apply.

1 Like

No, the debug outputs aren’t normally downloaded. EDIT: I believe the concerns really were mainly about the AWS expenses.

3 Likes

I believe the concerns really were mainly about the AWS expenses.

In case that was a response to my comment, then that’s correct. I’m aware that downloads of debug-outputs only happen with e.g. environment.enableDebugInfo = true;, but we still have to store it in S3.

1 Like

Regarding pixz:

Any parallelism speedup it brings is also available to zstd (with pzstd from the same package). That is, pzstd remains ~13x faster than pixz.

For example, I benchmarked this on a 6-core/12-threads Xeon E5-1650, with invocations like pzstd -v file --stdout > /dev/null:

               user(s)   system(s)    CPU   total(s)   througput
10 GiB with zstd 1.5.0
  compression
      zstd     18.61s    3.05s       128%     16.792    607 MB/s
      pzstd    37.91s    2.87s      1067%      3.821   2668 MB/s
  decompression
      zstd      5.37s    0.10s        99%      5.482   1859 MB/s
      pzstd    11.24s    2.20s      1035%      1.299   7848 MB/s

So parallelism brings > 2 GB/s compression and > 8 GB/s decompression, thus suitable also for 10/100 Gbit/s networking and a single fast SSD.

I used zstd -T8 in my testing. It was already as parallel as reasonably possible on this machine.
Sorry for not being explicit about that. Didn’t know about the pzstd alias, else I would’ve just written that for clarity.

>2GB/s compression sounds very high, too high. Are you using /dev/zero as a source here? Extremely low or extremely high entropy test data aren’t very useful for evaluating performance of medium or mixed entropy data IME as programs like zstd like to have specific fastpaths for those kinds of data.

As far as I know, pzstd is not an alias, and different from zstd -T8.

Details:
pzstd splits the input into chunks to compress in parallel, and adds headers into the output to allow parallel decompression.
Only the combination of “compress with pzstd → decompress with pzstd” allows parallel decompression. The normal zstd binary cannot decompress parallelly, nor can pzstd parallelly decompress files made with the normal zstd binary (for now), see:

No, for my test above I compressed a Ceph log file (plain text) – it does compress well though, 10 GiB → 680 MB (14x).

You’re right; here we have numbers for a cudadoolkit store path that compresses only factor 2x:

               user(s)   system(s)    CPU   total(s)   througput
3.6 GiB cutatoolkit nix store tar'ed, with zstd 1.5.0, compresses to 1.9 GiB
  compression
      zstd     19.70s    1.12s       110%     18.795    193 MB/s
      pzstd    57.50s    1.40s      1057%      5.570    651 MB/s
  decompression
      zstd      3.22s    0.25s        99%      3.468   1045 MB/s
      pzstd     6.29s    0.83s       851%      0.836   4339 MB/s

So for such cases, decompression cases change by 2x.

2 Likes

Re-ran the benchmark with -T10 and pzstd:

Tool CTime Dtime Size
xz -9 5min 10s 207M
zstd -19 -T10 34s 1s 254M
pzstd -19 30s 0.4s 257M
pixz -9 1min 2s 217M

pzstd decompresses ~4x as fast as pixz. However, pixz already decompresses at speeds on the order of 100MB/s compressed / 500MB/s uncompressed on my M1 Pro which is the class of hardware you are likely to use if you have access to 1Gb internet.

Also keep in mind that this decision would have an effect on every Nix user, not just those with fast pipes.
For a user who is unfortunate enough to live in an internet-desolate 3rd world country like, for example, Germany, download speed is almost always the bottleneck. A 25% size increase in output size means a 25% increase in download time for them.

I’m a fan of zstd too and strongly believe it’s the best general purpose compression tool but it can’t really compete on the extreme high or low end of the compress ratio/speed spectrum. With a Linux distribution we have to consider all kinds of users, not just the fortunate ones.

A compromise I’d be comfortable with would be to offer recent output paths via zstd and xz but drop the zstd version after a certain time. One month sounds good to me but it could be even shorter. At that point we could even consider the likes of lz4 but I don’t think that’s necessary with currently available home internet speeds as even pixz is able to saturate 1Gb/s with moderately high-end HW and zstd should manage to saturate a 10Gb/s one with true high-end machines.

For the Nix store paths I tested, the difference seems to be around 10%, similar to the numbers @DavHau found above.

For example, some detail benchmarks for a recent Chromium store path:

                             |--------------------- compression ---------------------------| |--- decompression ---|
                                                                      per-core   total        total
                       size  user(s)   system(s)    CPU    total(s)   throughput throughput   total(s)   throughput   comments
chromium: tar c /nix/store/620lqprbzy4pgd2x4zkg7n19rfd59ap7-chromium-unwrapped-108.0.5359.98
  uncompressed         473Mi
  compression (note `--ultra` is given to enable zstd levels > 19), zstd v1.5.0, XZ utils v5.2.5
    xz -9              102M  216.34     0.54s       99%    3:37.07    2.29 MB/s  2.29 MB/s     6.227       79 MB/s
    zstd -19           113M  176.42     0.56s      100%    2:56.66    2.81 MB/s  2.81 MB/s     0.624      794 MB/s
    zstd -19  --long   111M  200.84     0.52s      100%    3:21.07    2.46 MB/s  2.46 MB/s     0.686      722 MB/s
    zstd -22           108M  210.77     0.74s      100%    3:31.44    2.35 MB/s  2.35 MB/s     0.716      692 MB/s
    zstd -22  --long   108M  214.96     0.64s      100%    3:35.53    2.30 MB/s  2.30 MB/s     0.716      692 MB/s    bit-identical to above for this input
    pzstd -19          114M  270.05     1.20s     1064%      25.47    1.83 MB/s 19.83 MB/s     0.244     2032 MB/s
    pzstd -22          108M  224.17     0.66s      100%    3:44.80    2.21 MB/s  2.21 MB/s     0.721      687 MB/s    single-threaded comp/decomp!

(Edit 2022-12-19 14:26: I’ve split the compression throughput into per-core and total; before I had only per-core called throughput which was confusing. Now one can see that pzstd trades a bit of per-core throughput for higher total throughput.)

In here I also found that:

  • pzstd doesn’t support --long (issue says that’s by design)
  • pzstd -22 couldn’t multi-thread on this input, maybe it was too small for use with -22?

That is true, the compress ratio of xz -9 cannot currently be beaten by zstd. The faster decompress speed would have to be bought with a percentage size increase (10% if chromium above is representative). For Arch that was only 0.8%, that is because they weren’t using xz -9 (their benchmark ahead of the switched suggests they used no argument).

So I agree that if we want to trade 10x CPU for 10% storage size, switching to pixz makes sense.


Judging the tradeoff:

I am also from, and often have to download from, the above-mentioned Internet 3rd world country over a 12 Mbit/s connection, so from that perspective, 1.1x growth wouldn’t be great.

At the same time though I have more machines in data centers and on my LAN, for which a 10x speedup would be awesome.

In any case, I think sticking with single-threaded xz would be pretty bad.


Is pixz available as a library?

I guess pixz would also be a breaking change for older nix versions, like zstd, is that correct?

1 Like

There is an idea I’ve not had time to fully develop or RFC (anyone want to team up?), but there is a way to make zstd allow for deduplication via a mechanism similar to rsync/casync rolling hashes. This would also be backwards compatible such that normal zstd libraries and clients would not be impacted. By storing NARs like this one can get pretty nice dedup by comparing blocks with similar NARs.

The Nix client would maintain a store of blocks and fetch the index from the zstd NAR, then only fetch the required bytes via Range requests (supported by S3). This will need some heuristics as often it is not worth it, but in some situations the dedup is impressive.

I’ve done some testing that older Nix versions could still process and use these cloud-optimized NARs transparently and that a smarter client achieves dedup.

2 Likes

Inevitably this rings a bell, but without forming a symphony, yet:

@tomberek Can you explain again in a little more layman’s terms your zst proposal?

A burning question I have would be how frames should/could be well-crafted to maximize deduplication?

And I guess, on the other extreme (like zst:chunked) we have file awareness. But that doesn’t seem like a good fit for the nix model (where file-wise dedup is already quite reasonably approximated by the input hash). Yet maybe I’m missing something?

I really like this compromise. Offering a choice for which compression algorithm you want to use seems great, and pruning old data this way seems like it would easily solve the space requirement problem. I think the most logical choice for the pruning timeframe would be one that covers the latest NixOS release. i.e. Anything that’s in any revision of the current NixOS release should not be pruned. This means up to 6 months, but frankly in comparison to the amount of data cached by NixOS I’m sure 6 months of extra zstd files is not that big a deal.