Switch cache.nixos.org to ZSTD to fix slow NixOS updates / nix downloads?

No, the debug outputs aren’t normally downloaded. EDIT: I believe the concerns really were mainly about the AWS expenses.

3 Likes

I believe the concerns really were mainly about the AWS expenses.

In case that was a response to my comment, then that’s correct. I’m aware that downloads of debug-outputs only happen with e.g. environment.enableDebugInfo = true;, but we still have to store it in S3.

1 Like

Regarding pixz:

Any parallelism speedup it brings is also available to zstd (with pzstd from the same package). That is, pzstd remains ~13x faster than pixz.

For example, I benchmarked this on a 6-core/12-threads Xeon E5-1650, with invocations like pzstd -v file --stdout > /dev/null:

               user(s)   system(s)    CPU   total(s)   througput
10 GiB with zstd 1.5.0
  compression
      zstd     18.61s    3.05s       128%     16.792    607 MB/s
      pzstd    37.91s    2.87s      1067%      3.821   2668 MB/s
  decompression
      zstd      5.37s    0.10s        99%      5.482   1859 MB/s
      pzstd    11.24s    2.20s      1035%      1.299   7848 MB/s

So parallelism brings > 2 GB/s compression and > 8 GB/s decompression, thus suitable also for 10/100 Gbit/s networking and a single fast SSD.

I used zstd -T8 in my testing. It was already as parallel as reasonably possible on this machine.
Sorry for not being explicit about that. Didn’t know about the pzstd alias, else I would’ve just written that for clarity.

>2GB/s compression sounds very high, too high. Are you using /dev/zero as a source here? Extremely low or extremely high entropy test data aren’t very useful for evaluating performance of medium or mixed entropy data IME as programs like zstd like to have specific fastpaths for those kinds of data.

As far as I know, pzstd is not an alias, and different from zstd -T8.

Details:
pzstd splits the input into chunks to compress in parallel, and adds headers into the output to allow parallel decompression.
Only the combination of “compress with pzstd → decompress with pzstd” allows parallel decompression. The normal zstd binary cannot decompress parallelly, nor can pzstd parallelly decompress files made with the normal zstd binary (for now), see:

No, for my test above I compressed a Ceph log file (plain text) – it does compress well though, 10 GiB → 680 MB (14x).

You’re right; here we have numbers for a cudadoolkit store path that compresses only factor 2x:

               user(s)   system(s)    CPU   total(s)   througput
3.6 GiB cutatoolkit nix store tar'ed, with zstd 1.5.0, compresses to 1.9 GiB
  compression
      zstd     19.70s    1.12s       110%     18.795    193 MB/s
      pzstd    57.50s    1.40s      1057%      5.570    651 MB/s
  decompression
      zstd      3.22s    0.25s        99%      3.468   1045 MB/s
      pzstd     6.29s    0.83s       851%      0.836   4339 MB/s

So for such cases, decompression cases change by 2x.

2 Likes

Re-ran the benchmark with -T10 and pzstd:

Tool CTime Dtime Size
xz -9 5min 10s 207M
zstd -19 -T10 34s 1s 254M
pzstd -19 30s 0.4s 257M
pixz -9 1min 2s 217M

pzstd decompresses ~4x as fast as pixz. However, pixz already decompresses at speeds on the order of 100MB/s compressed / 500MB/s uncompressed on my M1 Pro which is the class of hardware you are likely to use if you have access to 1Gb internet.

Also keep in mind that this decision would have an effect on every Nix user, not just those with fast pipes.
For a user who is unfortunate enough to live in an internet-desolate 3rd world country like, for example, Germany, download speed is almost always the bottleneck. A 25% size increase in output size means a 25% increase in download time for them.

I’m a fan of zstd too and strongly believe it’s the best general purpose compression tool but it can’t really compete on the extreme high or low end of the compress ratio/speed spectrum. With a Linux distribution we have to consider all kinds of users, not just the fortunate ones.

A compromise I’d be comfortable with would be to offer recent output paths via zstd and xz but drop the zstd version after a certain time. One month sounds good to me but it could be even shorter. At that point we could even consider the likes of lz4 but I don’t think that’s necessary with currently available home internet speeds as even pixz is able to saturate 1Gb/s with moderately high-end HW and zstd should manage to saturate a 10Gb/s one with true high-end machines.

For the Nix store paths I tested, the difference seems to be around 10%, similar to the numbers @DavHau found above.

For example, some detail benchmarks for a recent Chromium store path:

                             |--------------------- compression ---------------------------| |--- decompression ---|
                                                                      per-core   total        total
                       size  user(s)   system(s)    CPU    total(s)   throughput throughput   total(s)   throughput   comments
chromium: tar c /nix/store/620lqprbzy4pgd2x4zkg7n19rfd59ap7-chromium-unwrapped-108.0.5359.98
  uncompressed         473Mi
  compression (note `--ultra` is given to enable zstd levels > 19), zstd v1.5.0, XZ utils v5.2.5
    xz -9              102M  216.34     0.54s       99%    3:37.07    2.29 MB/s  2.29 MB/s     6.227       79 MB/s
    zstd -19           113M  176.42     0.56s      100%    2:56.66    2.81 MB/s  2.81 MB/s     0.624      794 MB/s
    zstd -19  --long   111M  200.84     0.52s      100%    3:21.07    2.46 MB/s  2.46 MB/s     0.686      722 MB/s
    zstd -22           108M  210.77     0.74s      100%    3:31.44    2.35 MB/s  2.35 MB/s     0.716      692 MB/s
    zstd -22  --long   108M  214.96     0.64s      100%    3:35.53    2.30 MB/s  2.30 MB/s     0.716      692 MB/s    bit-identical to above for this input
    pzstd -19          114M  270.05     1.20s     1064%      25.47    1.83 MB/s 19.83 MB/s     0.244     2032 MB/s
    pzstd -22          108M  224.17     0.66s      100%    3:44.80    2.21 MB/s  2.21 MB/s     0.721      687 MB/s    single-threaded comp/decomp!

(Edit 2022-12-19 14:26: I’ve split the compression throughput into per-core and total; before I had only per-core called throughput which was confusing. Now one can see that pzstd trades a bit of per-core throughput for higher total throughput.)

In here I also found that:

  • pzstd doesn’t support --long (issue says that’s by design)
  • pzstd -22 couldn’t multi-thread on this input, maybe it was too small for use with -22?

That is true, the compress ratio of xz -9 cannot currently be beaten by zstd. The faster decompress speed would have to be bought with a percentage size increase (10% if chromium above is representative). For Arch that was only 0.8%, that is because they weren’t using xz -9 (their benchmark ahead of the switched suggests they used no argument).

So I agree that if we want to trade 10x CPU for 10% storage size, switching to pixz makes sense.


Judging the tradeoff:

I am also from, and often have to download from, the above-mentioned Internet 3rd world country over a 12 Mbit/s connection, so from that perspective, 1.1x growth wouldn’t be great.

At the same time though I have more machines in data centers and on my LAN, for which a 10x speedup would be awesome.

In any case, I think sticking with single-threaded xz would be pretty bad.


Is pixz available as a library?

I guess pixz would also be a breaking change for older nix versions, like zstd, is that correct?

1 Like

There is an idea I’ve not had time to fully develop or RFC (anyone want to team up?), but there is a way to make zstd allow for deduplication via a mechanism similar to rsync/casync rolling hashes. This would also be backwards compatible such that normal zstd libraries and clients would not be impacted. By storing NARs like this one can get pretty nice dedup by comparing blocks with similar NARs.

The Nix client would maintain a store of blocks and fetch the index from the zstd NAR, then only fetch the required bytes via Range requests (supported by S3). This will need some heuristics as often it is not worth it, but in some situations the dedup is impressive.

I’ve done some testing that older Nix versions could still process and use these cloud-optimized NARs transparently and that a smarter client achieves dedup.

3 Likes

Inevitably this rings a bell, but without forming a symphony, yet:

@tomberek Can you explain again in a little more layman’s terms your zst proposal?

A burning question I have would be how frames should/could be well-crafted to maximize deduplication?

And I guess, on the other extreme (like zst:chunked) we have file awareness. But that doesn’t seem like a good fit for the nix model (where file-wise dedup is already quite reasonably approximated by the input hash). Yet maybe I’m missing something?

I really like this compromise. Offering a choice for which compression algorithm you want to use seems great, and pruning old data this way seems like it would easily solve the space requirement problem. I think the most logical choice for the pruning timeframe would be one that covers the latest NixOS release. i.e. Anything that’s in any revision of the current NixOS release should not be pruned. This means up to 6 months, but frankly in comparison to the amount of data cached by NixOS I’m sure 6 months of extra zstd files is not that big a deal.

I wasn’t aware that zstd supports compression ratios higher than -19.
In your benchmark, using zstd -22 comes already pretty close to xz -9 in terms of compressed size. It only increases by 5.9%.

Those are pretty good results. Given the dramatic performance benefits, zstd appears like the better choice overall to me.

Concerning the parallelism, optimizing the tooling on parallelizing single file extraction would be nice but there can be a simpler approach.

Nix usually has to fetch many store paths. Why not parallelize the requests to the number of CPUs?
Extracting one file per CPU would solve the problem as well, and it is a client side only optimization which we could implement right now without breaking anything.

3 Likes

Substitution is already parallelized. It shares the max-jobs limit with build jobs (which is not ideal as the IO-bound fetching part can benefit from far more parallelism than the CPU-bound decompression part).

1 Like

https://github.com/NixOS/nix/issues/3379

1 Like

Chromium is a good sample.

I think we should probably use a larger set of more mixed packages though. I think a good candidate might be the tar’d closure of the x86 installer ISOs.

Totally forgot zstd had levels above 19. At least for chromium though, your numbers put it in a similar enough ballpark to xz -9 to consider it IMO.

As for CrossOver:

Tool CTime Dtime Size
xz -9 5min 10s 207M
pixz -9 1min 2s 217M
zstd -19 -T10 34s 1s 254M
pzstd -19 30s 0.4s 257M
zstd --ultra -22 -T10 2min 40s 1s 199M

So, yeah. That’s a bit unexpected.

We might need to evaluate memory usage at this point though as IIRC the extremely high zstd compression levels do require substantial amounts of memory; especially multi-threaded.

OTOH, Nix users already need quite a bit of memory for eval anyways, so that might not be as big of a problem.

According to my data, we’d be trading ~4x the total CPU time with pixz, not 10x.

I don’t see how this would be beneficial in your LAN? You’re free to choose whatever compression you like there (Nix already supports zstd AFAIK); this is about what cache.nixos.org should do.

That I’m unsure about. It doesn’t seem like it.

It also doesn’t smell very production-ready. For example, I just noticed that it does special tarball handling that is silently backwards-incompatible with xz (produces different tarballs that what comes in). That can be turned off but it’s on by default and already caught a certain distribution by surprise.

Actually, no. With the -t flag, pixz-compressed data can be decompressed by xz at the same speed as data compressed by regular xz.

It’d be backwards compatible with any Nix version that understands xz.

https://blog.cachix.org/posts/2022-12-19-zstd-compression/

5 Likes

I encountered 2 issues with pixz, one about compression ratio, and one about using all available cores:

Anybody know how can I get pixz’s compression side to use all cores? On my 6-core/12-thread machine, it uses only 400% CPU for the chromium store path, which over time of the compression drops to 300% and then to 200%. The averace usage according to time is then 270%.

Passing -p6 or -p12 doesn’t seem to change anything about that.


I have now added benchmarks of pixz 1.0.7 to the table (including the above-mentioned problem).

I have also added maxres outputs from command time, showing how many MB maximum RAM were needed for compression and decompression:

                             |--------------------- compression ------------------------------------| |------ decompression ------|
                                                                      per-core   total                          total
                       size  user(s) system(s)  CPU  total(m)  throughput throughput  maxres  total(s) throughput maxres  comments
chromium: tar c /nix/store/620lqprbzy4pgd2x4zkg7n19rfd59ap7-chromium-unwrapped-108.0.5359.98     
  uncompressed         473Mi
  compression (note `--ultra` is given to enable zstd levels > 19), zstd v1.5.0, XZ utils v5.2.5
    xz -9              102M  216.34   0.54s     99%  3:37.07   2.29 MB/s  2.29 MB/s   691 MB   6.227     79 MB/s   67 MB
    pixz -9            137M  216.98   1.71s    271%  1:20.57   2.28 MB/s  6.19 MB/s  2951 MB   2.551    194 MB/s  657 MB  did not use all cores consistently, for both compression and decompression
    zstd -19           113M  176.42   0.56s    100%  2:56.66   2.81 MB/s  2.81 MB/s   241 MB   0.624    794 MB/s   10 MB
    zstd -19  --long   111M  200.84   0.52s    100%  3:21.07   2.46 MB/s  2.46 MB/s   454 MB   0.686    722 MB/s  133 MB
    zstd -22           108M  210.77   0.74s    100%  3:31.44   2.35 MB/s  2.35 MB/s  1263 MB   0.716    692 MB/s  133 MB
    zstd -22  --long   108M  214.96   0.64s    100%  3:35.53   2.30 MB/s  2.30 MB/s  1263 MB   0.716    692 MB/s  133 MB  bit-identical to above for this input
    pzstd -19          114M  270.05   1.20s   1064%    25.47   1.83 MB/s 19.83 MB/s  1641 MB   0.244   2032 MB/s  564 MB
    pzstd -22          108M  224.17   0.66s    100%  3:44.80   2.21 MB/s  2.21 MB/s  1392 MB   0.721    687 MB/s  245 MB  single-threaded comp/decomp!

Oddly, pixz produces a much worse compresison ratio than any of the other approaches.

xz/pixz need 5x more RAM for decompression

pzstd needs disproportionately much RAM for decompression. I suspect this is because with the invocation pzstd -d ... > /dev/null, outputs from the various threads need to be bufferend in-memory to write them in order into the output pipe.
However, pzstd does this even when writing outputs to a regular file with -o.

I also checked how much RAM plain zstd needs to decompress the pzstd outputs; there is no difference compared to decompressing the zstd outputs.


For the chromium derivation in my table above, single-threaded zstd has 10x higher decompression throughput than single-threaded xz, and for pzstd vs pixz it’s also 10x.

Summary of my findings so far

On this chromium tar:

  • single-threaded:
    • zstd -19 wins against xz -9 on decompression speed (10x) and decompression memory usage (5x)
    • zstd -22 still wins against xz -9 on decompression speed (same 10x) but loses against decompression memory usage (0.5x)
    • xz -9 wins on best compression ratio: By 10% against zstd -19 and 5% against zstd -22
  • multi-threaded:
    • pixz -9 loses against pzstd -19 on all metrics
    • decompression memory usage can apparently be reduced 10x by decompressing 6 zstd -19 archives independently, rather than using pzstd -d to decompress 1 zstd -19 archive. That seems wrong, at least when writing to regular files. I filed a zstd issue.

Please point out if you spot any mistakes!

1 Like

I found on the zstd issue tracker a detail description of the difference between pzstd and multi-threaded normal zstd.

It confirms that zstd -T and pzstd are very different:

zstd -T# produces a single compact frame, as opposed to pzstd and mcmilk variants which produce multiple independent payloads. Decompressing in parallel multiple independent payloads can be done fairly easily, while untangling dependencies within a single frame is more complex.

and also states:

Multi-threaded [single-frame] decompression is in our task list, although there is no release date set yet.

1 Like

Hello everyone,

Given the current state of discussion, it seems like the compromise mentioned in Switch cache.nixos.org to ZSTD to fix slow NixOS updates / nix downloads? - #23 by Atemu seems a good way forward for this issue.

On the one hand, Nix 2.3 still has no support for zstd, though discussion with the TVL group showed an interest to tackle this and work on a patch they would submit to the Nix 2.3 tree or keep as a patch and provide it to nixpkgs tree. They will dogfood it first to their infrastructure and improve it. There is an incentive for them to do this as Cachix has enabled zstd compression and from my understanding, there is no xz files laying around for now (?) as a fallback.
BTW, there would be a need for reviewers once the patch (based on @andir’s work) each a certain point, @domenkozar @tomberek would you be interested to be reviewers of such a patch?

On the other hand, there are definitely users out there in “an internet-desolate 3rd world country”, @tazjin, for example, reported to me that updating a full NixOS system can cost up to 3-4USD in Egypt. I feel like this is a compelling argument to consider that compression ratio are more important than decompression speed on the long term and it would be a good signal to send saying that NixOS cares about this type of Internet infrastructure too while we improve having things like deltas, etc.

In the mean time, the consequence of adopting two tarballs format for NixOS cache require to consider the cost incurred on the existing S3 store, for this, I would ask the NixOS Foundation (cc @ron @edolstra) to weigh on this. Is it expensive? Is there any cost optimization that can be achieved to enable this usecase? If the data could be provided, I can also try to do a report in this post to offload this from the Foundation.

Also, IMHO, going with the compromise remove any need of patching Nix 2.3 right away and enable other Nix users to benefit from zstd now.

10 Likes

Will work together with @edolstra to verify this and revert. Thanks for bringing it up!

1 Like

I don’t expect that additional S3 expenses would be significant if the additional archives were removed after a short-ish period of not being needed (like one month), but I’m not sure if someone can easily produce infra code that does deletion like this.

Checking that a path is alive in this model seems difficult, as referencing from a different path has been noop AFAIK. In particular, fixed output derivation might remain alive for much longer than the chosen period, and on stable branches we may not do a full rebuild every month. Well, maybe one month after upload could be considered a good enough approximation (and not hard hopefully?), given that we’ll have xz fallback anyway :man_shrugging:

1 Like