Yes, to be clear, I don’t think it is useful to convert old packages to zstd.
Given that Nix rebuilds everything when a fundamental library like the libc is upgraded, we’d get there a lot faster than “slowly” anyway.
Yes, to be clear, I don’t think it is useful to convert old packages to zstd.
Given that Nix rebuilds everything when a fundamental library like the libc is upgraded, we’d get there a lot faster than “slowly” anyway.
“vocal”? I found a single comment, complaining about unhandled PRs and unhelpful comments without linking those issues/PRs.
I’m totally with Sandro… Just merge it…
There are currently 4 thumbsups to andir’s post and, more coming in, so it doesn’t look like something one should just discard.
I also buy the argument that some stuff doesn’t work in newer nix. I’ve hit workflow-breaking regressions myself as well, and they directly impacted the ability of my company to build stuff.
I do agree that it would help productivity and decision making when issues are stated directly (e.g. link to issue), so that they can be worked on, and so that it’s known what the acceptance criteria for advancing the minimum are, so that there’s a better chance for literally everyone advancing past e.g. nix 2.3. Vague references are more difficult to work with and result in subjective opinions instead of clear steps forward.
However, I still think that
How difficult is it to backport zstd to Nix 2.3?
should be answered if possible since that could solve the problem straight away.
Here is the naive backport without much manual testing (in Nix tradition):
Note: this is incomplete, there have been many changes to how sinks and sources work in Nix. It isn’t unfeasible but requires a bit more work.
I don’t know if there is some history behind this passive-aggressive comment, but in my experience Sandro has always been very responsive and thorough when reviewing PRs, providing good and thoughtful suggestions and undoubtedly is one of the main contributors to nixpkgs.
These numbers make it look like the difference in compression ratio between lzma
and zstd
are insignificant, but I’m not so sure of this.
cache.nixos.org seems to use xz at the highest level (-9).
Fedora used xz with level -2 which is significantly less effective, and therefore there was no loss switching for them.
The article for Arch Linux does not state which level was used for xz, but as they claim they only have lost 0.8%, I assume they did not use it at (-9) either.
Also the source for ubuntu doesn’t reveal the level used for xz.
I made a little benchmark with qemu, which is quite a big nar file. The xz coming from cache.nixos.org is 117M in size and the zstd compressed one (-19 / highes level) is 129M.
That is a difference of around 10%
Therefore by optimizing for the people sitting behind a fast connection, we will make the experience worse for the people who have to work with limited bandwidth. As a frequent traveler, I’m often in that position. That’s why I care.
Though, I have to admit, 10% loss sounds acceptable to gain 10X speeds elsewhere. Still, it would be nice to have actual numbers here and see what would be the difference on the compressed closure size of a whole nixos system, for example.
Another option is something like GitHub - vasi/pixz: Parallel, indexed xz compressor. A quick benchmark tarring up on some version of CrossOver I had lying around on the mac (969MiB of executables):
Tool | CTime | Dtime | Size |
---|---|---|---|
xz -9 | 5min | 10s | 207M |
zstd -19 | 40s | 1s | 254M |
pixz -9 | 1min | 2s | 217M |
Not sure how easy that’d be to integrate into a library but this shows the potential.
Would it be feasible for Hydra to upload multiple files with different compression algorithms and allow the user to choose? Retains compat with 2.3 without a backport if it defaults to xz, but requires double the space for Hydra.
I doubt that this is doable. We don’t even use separateDebugInfo = true;
on each package because that’d imply a ~30% (IIRC) increase in storage size of a single nixpkgs build.
FWIW that is kind of different. Increasing a single derivation’s output file size burdens all users with increased storage requirements, regardless of the NAR compression used. Increasing the space required to store more than one compressed variant of a NAR only burdens the cache’s storage. Might still be too much of a burden, but I think your comparison doesn’t apply.
No, the debug outputs aren’t normally downloaded. EDIT: I believe the concerns really were mainly about the AWS expenses.
I believe the concerns really were mainly about the AWS expenses.
In case that was a response to my comment, then that’s correct. I’m aware that downloads of debug
-outputs only happen with e.g. environment.enableDebugInfo = true;
, but we still have to store it in S3.
Regarding pixz
:
Any parallelism speedup it brings is also available to zstd
(with pzstd
from the same package). That is, pzstd
remains ~13x faster than pixz
.
For example, I benchmarked this on a 6-core/12-threads Xeon E5-1650, with invocations like pzstd -v file --stdout > /dev/null
:
user(s) system(s) CPU total(s) througput
10 GiB with zstd 1.5.0
compression
zstd 18.61s 3.05s 128% 16.792 607 MB/s
pzstd 37.91s 2.87s 1067% 3.821 2668 MB/s
decompression
zstd 5.37s 0.10s 99% 5.482 1859 MB/s
pzstd 11.24s 2.20s 1035% 1.299 7848 MB/s
So parallelism brings > 2 GB/s compression and > 8 GB/s decompression, thus suitable also for 10/100 Gbit/s networking and a single fast SSD.
I used zstd -T8
in my testing. It was already as parallel as reasonably possible on this machine.
Sorry for not being explicit about that. Didn’t know about the pzstd
alias, else I would’ve just written that for clarity.
>2GB/s compression sounds very high, too high. Are you using /dev/zero
as a source here? Extremely low or extremely high entropy test data aren’t very useful for evaluating performance of medium or mixed entropy data IME as programs like zstd like to have specific fastpaths for those kinds of data.
As far as I know, pzstd
is not an alias, and different from zstd -T8
.
Details:
pzstd
splits the input into chunks to compress in parallel, and adds headers into the output to allow parallel decompression.
Only the combination of “compress with pzstd
→ decompress with pzstd
” allows parallel decompression. The normal zstd
binary cannot decompress parallelly, nor can pzstd
parallelly decompress files made with the normal zstd
binary (for now), see:
No, for my test above I compressed a Ceph log file (plain text) – it does compress well though, 10 GiB → 680 MB (14x).
You’re right; here we have numbers for a cudadoolkit
store path that compresses only factor 2x:
user(s) system(s) CPU total(s) througput
3.6 GiB cutatoolkit nix store tar'ed, with zstd 1.5.0, compresses to 1.9 GiB
compression
zstd 19.70s 1.12s 110% 18.795 193 MB/s
pzstd 57.50s 1.40s 1057% 5.570 651 MB/s
decompression
zstd 3.22s 0.25s 99% 3.468 1045 MB/s
pzstd 6.29s 0.83s 851% 0.836 4339 MB/s
So for such cases, decompression cases change by 2x.
Re-ran the benchmark with -T10
and pzstd
:
Tool | CTime | Dtime | Size |
---|---|---|---|
xz -9 | 5min | 10s | 207M |
zstd -19 -T10 | 34s | 1s | 254M |
pzstd -19 | 30s | 0.4s | 257M |
pixz -9 | 1min | 2s | 217M |
pzstd decompresses ~4x as fast as pixz. However, pixz already decompresses at speeds on the order of 100MB/s compressed / 500MB/s uncompressed on my M1 Pro which is the class of hardware you are likely to use if you have access to 1Gb internet.
Also keep in mind that this decision would have an effect on every Nix user, not just those with fast pipes.
For a user who is unfortunate enough to live in an internet-desolate 3rd world country like, for example, Germany, download speed is almost always the bottleneck. A 25% size increase in output size means a 25% increase in download time for them.
I’m a fan of zstd too and strongly believe it’s the best general purpose compression tool but it can’t really compete on the extreme high or low end of the compress ratio/speed spectrum. With a Linux distribution we have to consider all kinds of users, not just the fortunate ones.
A compromise I’d be comfortable with would be to offer recent output paths via zstd and xz but drop the zstd version after a certain time. One month sounds good to me but it could be even shorter. At that point we could even consider the likes of lz4 but I don’t think that’s necessary with currently available home internet speeds as even pixz is able to saturate 1Gb/s with moderately high-end HW and zstd should manage to saturate a 10Gb/s one with true high-end machines.
For the Nix store paths I tested, the difference seems to be around 10%, similar to the numbers @DavHau found above.
For example, some detail benchmarks for a recent Chromium store path:
|--------------------- compression ---------------------------| |--- decompression ---|
per-core total total
size user(s) system(s) CPU total(s) throughput throughput total(s) throughput comments
chromium: tar c /nix/store/620lqprbzy4pgd2x4zkg7n19rfd59ap7-chromium-unwrapped-108.0.5359.98
uncompressed 473Mi
compression (note `--ultra` is given to enable zstd levels > 19), zstd v1.5.0, XZ utils v5.2.5
xz -9 102M 216.34 0.54s 99% 3:37.07 2.29 MB/s 2.29 MB/s 6.227 79 MB/s
zstd -19 113M 176.42 0.56s 100% 2:56.66 2.81 MB/s 2.81 MB/s 0.624 794 MB/s
zstd -19 --long 111M 200.84 0.52s 100% 3:21.07 2.46 MB/s 2.46 MB/s 0.686 722 MB/s
zstd -22 108M 210.77 0.74s 100% 3:31.44 2.35 MB/s 2.35 MB/s 0.716 692 MB/s
zstd -22 --long 108M 214.96 0.64s 100% 3:35.53 2.30 MB/s 2.30 MB/s 0.716 692 MB/s bit-identical to above for this input
pzstd -19 114M 270.05 1.20s 1064% 25.47 1.83 MB/s 19.83 MB/s 0.244 2032 MB/s
pzstd -22 108M 224.17 0.66s 100% 3:44.80 2.21 MB/s 2.21 MB/s 0.721 687 MB/s single-threaded comp/decomp!
(Edit 2022-12-19 14:26: I’ve split the compression throughput into per-core
and total
; before I had only per-core
called throughput
which was confusing. Now one can see that pzstd
trades a bit of per-core throughput for higher total throughput.)
In here I also found that:
pzstd
doesn’t support --long
(issue says that’s by design)pzstd -22
couldn’t multi-thread on this input, maybe it was too small for use with -22
?That is true, the compress ratio of xz -9
cannot currently be beaten by zstd
. The faster decompress speed would have to be bought with a percentage size increase (10% if chromium
above is representative). For Arch that was only 0.8%, that is because they weren’t using xz -9
(their benchmark ahead of the switched suggests they used no argument).
So I agree that if we want to trade 10x CPU for 10% storage size, switching to pixz
makes sense.
Judging the tradeoff:
I am also from, and often have to download from, the above-mentioned Internet 3rd world country over a 12 Mbit/s connection, so from that perspective, 1.1x growth wouldn’t be great.
At the same time though I have more machines in data centers and on my LAN, for which a 10x speedup would be awesome.
In any case, I think sticking with single-threaded xz
would be pretty bad.
Is pixz
available as a library?
I guess pixz
would also be a breaking change for older nix versions, like zstd
, is that correct?
There is an idea I’ve not had time to fully develop or RFC (anyone want to team up?), but there is a way to make zstd allow for deduplication via a mechanism similar to rsync/casync rolling hashes. This would also be backwards compatible such that normal zstd libraries and clients would not be impacted. By storing NARs like this one can get pretty nice dedup by comparing blocks with similar NARs.
The Nix client would maintain a store of blocks and fetch the index from the zstd NAR, then only fetch the required bytes via Range requests (supported by S3). This will need some heuristics as often it is not worth it, but in some situations the dedup is impressive.
I’ve done some testing that older Nix versions could still process and use these cloud-optimized NARs transparently and that a smarter client achieves dedup.
Inevitably this rings a bell, but without forming a symphony, yet:
@tomberek Can you explain again in a little more layman’s terms your zst proposal?
A burning question I have would be how frames should/could be well-crafted to maximize deduplication?
And I guess, on the other extreme (like zst:chunked
) we have file awareness. But that doesn’t seem like a good fit for the nix model (where file-wise dedup is already quite reasonably approximated by the input hash). Yet maybe I’m missing something?
I really like this compromise. Offering a choice for which compression algorithm you want to use seems great, and pruning old data this way seems like it would easily solve the space requirement problem. I think the most logical choice for the pruning timeframe would be one that covers the latest NixOS release. i.e. Anything that’s in any revision of the current NixOS release should not be pruned. This means up to 6 months, but frankly in comparison to the amount of data cached by NixOS I’m sure 6 months of extra zstd files is not that big a deal.