Switch cache.nixos.org to ZSTD to fix slow NixOS updates / nix downloads?

I wasn’t aware that zstd supports compression ratios higher than -19.
In your benchmark, using zstd -22 comes already pretty close to xz -9 in terms of compressed size. It only increases by 5.9%.

Those are pretty good results. Given the dramatic performance benefits, zstd appears like the better choice overall to me.

Concerning the parallelism, optimizing the tooling on parallelizing single file extraction would be nice but there can be a simpler approach.

Nix usually has to fetch many store paths. Why not parallelize the requests to the number of CPUs?
Extracting one file per CPU would solve the problem as well, and it is a client side only optimization which we could implement right now without breaking anything.

2 Likes

Substitution is already parallelized. It shares the max-jobs limit with build jobs (which is not ideal as the IO-bound fetching part can benefit from far more parallelism than the CPU-bound decompression part).

1 Like

https://github.com/NixOS/nix/issues/3379

1 Like

Chromium is a good sample.

I think we should probably use a larger set of more mixed packages though. I think a good candidate might be the tar’d closure of the x86 installer ISOs.

Totally forgot zstd had levels above 19. At least for chromium though, your numbers put it in a similar enough ballpark to xz -9 to consider it IMO.

As for CrossOver:

Tool CTime Dtime Size
xz -9 5min 10s 207M
pixz -9 1min 2s 217M
zstd -19 -T10 34s 1s 254M
pzstd -19 30s 0.4s 257M
zstd --ultra -22 -T10 2min 40s 1s 199M

So, yeah. That’s a bit unexpected.

We might need to evaluate memory usage at this point though as IIRC the extremely high zstd compression levels do require substantial amounts of memory; especially multi-threaded.

OTOH, Nix users already need quite a bit of memory for eval anyways, so that might not be as big of a problem.

According to my data, we’d be trading ~4x the total CPU time with pixz, not 10x.

I don’t see how this would be beneficial in your LAN? You’re free to choose whatever compression you like there (Nix already supports zstd AFAIK); this is about what cache.nixos.org should do.

That I’m unsure about. It doesn’t seem like it.

It also doesn’t smell very production-ready. For example, I just noticed that it does special tarball handling that is silently backwards-incompatible with xz (produces different tarballs that what comes in). That can be turned off but it’s on by default and already caught a certain distribution by surprise.

Actually, no. With the -t flag, pixz-compressed data can be decompressed by xz at the same speed as data compressed by regular xz.

It’d be backwards compatible with any Nix version that understands xz.

https://blog.cachix.org/posts/2022-12-19-zstd-compression/

5 Likes

I encountered 2 issues with pixz, one about compression ratio, and one about using all available cores:

Anybody know how can I get pixz’s compression side to use all cores? On my 6-core/12-thread machine, it uses only 400% CPU for the chromium store path, which over time of the compression drops to 300% and then to 200%. The averace usage according to time is then 270%.

Passing -p6 or -p12 doesn’t seem to change anything about that.


I have now added benchmarks of pixz 1.0.7 to the table (including the above-mentioned problem).

I have also added maxres outputs from command time, showing how many MB maximum RAM were needed for compression and decompression:

                             |--------------------- compression ------------------------------------| |------ decompression ------|
                                                                      per-core   total                          total
                       size  user(s) system(s)  CPU  total(m)  throughput throughput  maxres  total(s) throughput maxres  comments
chromium: tar c /nix/store/620lqprbzy4pgd2x4zkg7n19rfd59ap7-chromium-unwrapped-108.0.5359.98     
  uncompressed         473Mi
  compression (note `--ultra` is given to enable zstd levels > 19), zstd v1.5.0, XZ utils v5.2.5
    xz -9              102M  216.34   0.54s     99%  3:37.07   2.29 MB/s  2.29 MB/s   691 MB   6.227     79 MB/s   67 MB
    pixz -9            137M  216.98   1.71s    271%  1:20.57   2.28 MB/s  6.19 MB/s  2951 MB   2.551    194 MB/s  657 MB  did not use all cores consistently, for both compression and decompression
    zstd -19           113M  176.42   0.56s    100%  2:56.66   2.81 MB/s  2.81 MB/s   241 MB   0.624    794 MB/s   10 MB
    zstd -19  --long   111M  200.84   0.52s    100%  3:21.07   2.46 MB/s  2.46 MB/s   454 MB   0.686    722 MB/s  133 MB
    zstd -22           108M  210.77   0.74s    100%  3:31.44   2.35 MB/s  2.35 MB/s  1263 MB   0.716    692 MB/s  133 MB
    zstd -22  --long   108M  214.96   0.64s    100%  3:35.53   2.30 MB/s  2.30 MB/s  1263 MB   0.716    692 MB/s  133 MB  bit-identical to above for this input
    pzstd -19          114M  270.05   1.20s   1064%    25.47   1.83 MB/s 19.83 MB/s  1641 MB   0.244   2032 MB/s  564 MB
    pzstd -22          108M  224.17   0.66s    100%  3:44.80   2.21 MB/s  2.21 MB/s  1392 MB   0.721    687 MB/s  245 MB  single-threaded comp/decomp!

Oddly, pixz produces a much worse compresison ratio than any of the other approaches.

xz/pixz need 5x more RAM for decompression

pzstd needs disproportionately much RAM for decompression. I suspect this is because with the invocation pzstd -d ... > /dev/null, outputs from the various threads need to be bufferend in-memory to write them in order into the output pipe.
However, pzstd does this even when writing outputs to a regular file with -o.

I also checked how much RAM plain zstd needs to decompress the pzstd outputs; there is no difference compared to decompressing the zstd outputs.


For the chromium derivation in my table above, single-threaded zstd has 10x higher decompression throughput than single-threaded xz, and for pzstd vs pixz it’s also 10x.

Summary of my findings so far

On this chromium tar:

  • single-threaded:
    • zstd -19 wins against xz -9 on decompression speed (10x) and decompression memory usage (5x)
    • zstd -22 still wins against xz -9 on decompression speed (same 10x) but loses against decompression memory usage (0.5x)
    • xz -9 wins on best compression ratio: By 10% against zstd -19 and 5% against zstd -22
  • multi-threaded:
    • pixz -9 loses against pzstd -19 on all metrics
    • decompression memory usage can apparently be reduced 10x by decompressing 6 zstd -19 archives independently, rather than using pzstd -d to decompress 1 zstd -19 archive. That seems wrong, at least when writing to regular files. I filed a zstd issue.

Please point out if you spot any mistakes!

1 Like

I found on the zstd issue tracker a detail description of the difference between pzstd and multi-threaded normal zstd.

It confirms that zstd -T and pzstd are very different:

zstd -T# produces a single compact frame, as opposed to pzstd and mcmilk variants which produce multiple independent payloads. Decompressing in parallel multiple independent payloads can be done fairly easily, while untangling dependencies within a single frame is more complex.

and also states:

Multi-threaded [single-frame] decompression is in our task list, although there is no release date set yet.

1 Like

Hello everyone,

Given the current state of discussion, it seems like the compromise mentioned in Switch cache.nixos.org to ZSTD to fix slow NixOS updates / nix downloads? - #23 by Atemu seems a good way forward for this issue.

On the one hand, Nix 2.3 still has no support for zstd, though discussion with the TVL group showed an interest to tackle this and work on a patch they would submit to the Nix 2.3 tree or keep as a patch and provide it to nixpkgs tree. They will dogfood it first to their infrastructure and improve it. There is an incentive for them to do this as Cachix has enabled zstd compression and from my understanding, there is no xz files laying around for now (?) as a fallback.
BTW, there would be a need for reviewers once the patch (based on @andir’s work) each a certain point, @domenkozar @tomberek would you be interested to be reviewers of such a patch?

On the other hand, there are definitely users out there in “an internet-desolate 3rd world country”, @tazjin, for example, reported to me that updating a full NixOS system can cost up to 3-4USD in Egypt. I feel like this is a compelling argument to consider that compression ratio are more important than decompression speed on the long term and it would be a good signal to send saying that NixOS cares about this type of Internet infrastructure too while we improve having things like deltas, etc.

In the mean time, the consequence of adopting two tarballs format for NixOS cache require to consider the cost incurred on the existing S3 store, for this, I would ask the NixOS Foundation (cc @ron @edolstra) to weigh on this. Is it expensive? Is there any cost optimization that can be achieved to enable this usecase? If the data could be provided, I can also try to do a report in this post to offload this from the Foundation.

Also, IMHO, going with the compromise remove any need of patching Nix 2.3 right away and enable other Nix users to benefit from zstd now.

10 Likes

Will work together with @edolstra to verify this and revert. Thanks for bringing it up!

1 Like

I don’t expect that additional S3 expenses would be significant if the additional archives were removed after a short-ish period of not being needed (like one month), but I’m not sure if someone can easily produce infra code that does deletion like this.

Checking that a path is alive in this model seems difficult, as referencing from a different path has been noop AFAIK. In particular, fixed output derivation might remain alive for much longer than the chosen period, and on stable branches we may not do a full rebuild every month. Well, maybe one month after upload could be considered a good enough approximation (and not hard hopefully?), given that we’ll have xz fallback anyway :man_shrugging:

1 Like

It could be helpful to collect some statistics on compression used in substitutions. It’d be great to get insight into how many substitutions still happen via Nix client versions which don’t request zstd and how many substitutions would have liked zstd but it wasn’t available (and why).

FODs are an interesting point. On the one hand, we could simply offer them as xz-only and that’d probably be fine. That could even prove beneficial as they’re source code archives most of the time for our use-cases where xz likely will achieve greater compression and decompression might be bottlenecked by drive speed more often.
OTOH we could also offer them as zstd-only and let users without zstd support fetch the FODs themselves. In the case of FODs, our cache isn’t really much of a cache but rather a mirror.

Side note about the OP of large downloads: I expect there’s still some lower-hanging fruit in closure reductions. Dependencies or files that are rarely or even never used. It just seems that most contributors don’t mind these costs too much.

3 Likes

Nix does not support multiple compression methods per .narinfo file. So we cannot offer store paths using both xz and zstd compression.

BTW, if we really care about download speeds, then the real focus should be on closure size optimisation. My desktop NixOS 18.09 closure was 4.6 GB; the mostly equivalent 22.11 configuration is 13.7 GB, including 5 versions of ffmpeg, 2 versions of qtwebengine, a gdb that has ballooned to 672 MB, something named mbrola that takes up 676 MB, and 121 -dev outputs.

21 Likes

Can you add zchunk to the comparison?
Just out of curiosity. On would still have to solve the problem of finding a suitable local reference to benefit from the chunking.

@wamserma That doesn’t look to useful for our case. We probably need something tailored towards Nix’ use-cases for the case of chunking. My CrossOver binary tarball from before came out over 100MiB more than any of the other options and it took a minute to compress with no parallelism.

@Atemu I had no idea how it would perform, hence I asked for a test run. Nix- (or rather nar-)specific chunking has been discussed a few times, e.g. in the Attic-Thread.

btw: mbrola is a TTS package and probably pulled in due to this: okular pulls in mbrola worth > 600 mb · Issue #207204 · NixOS/nixpkgs · GitHub

1 Like

zstd is now backported: [2.3-maintenance] libutil: add ZstdDecompressionSink by edef1c · Pull Request #9221 · NixOS/nix · GitHub thanks to @edef !

8 Likes

I opened Tag 2.3.17 from `2.3-maintenance` branch · Issue #9244 · NixOS/nix · GitHub, so this can ideally trickle into a new version number, so (smart) HTTP caches can detect if zstd support is available.

2 Likes

Nix does not support multiple compression methods per .narinfo file. So we cannot offer store paths using both xz and zstd compression.

So it’s not possible to introduce this narinfo extension as non-breaking change, is it?

I have another issue which may or may not be related:

nix flake update --commit-lock-file

given nixpkgs is ± the only flake input, it takes at least 1 min (MBA M1, macOS ) past the download progress has stopped which seems like a bit forever.

not sure if nix-channel --update is any faster, but irrelevant for me as I’ve gone flakes-only due to #10247