No, the debug outputs aren’t normally downloaded. EDIT: I believe the concerns really were mainly about the AWS expenses.
I believe the concerns really were mainly about the AWS expenses.
In case that was a response to my comment, then that’s correct. I’m aware that downloads of debug
-outputs only happen with e.g. environment.enableDebugInfo = true;
, but we still have to store it in S3.
Regarding pixz
:
Any parallelism speedup it brings is also available to zstd
(with pzstd
from the same package). That is, pzstd
remains ~13x faster than pixz
.
For example, I benchmarked this on a 6-core/12-threads Xeon E5-1650, with invocations like pzstd -v file --stdout > /dev/null
:
user(s) system(s) CPU total(s) througput
10 GiB with zstd 1.5.0
compression
zstd 18.61s 3.05s 128% 16.792 607 MB/s
pzstd 37.91s 2.87s 1067% 3.821 2668 MB/s
decompression
zstd 5.37s 0.10s 99% 5.482 1859 MB/s
pzstd 11.24s 2.20s 1035% 1.299 7848 MB/s
So parallelism brings > 2 GB/s compression and > 8 GB/s decompression, thus suitable also for 10/100 Gbit/s networking and a single fast SSD.
I used zstd -T8
in my testing. It was already as parallel as reasonably possible on this machine.
Sorry for not being explicit about that. Didn’t know about the pzstd
alias, else I would’ve just written that for clarity.
>2GB/s compression sounds very high, too high. Are you using /dev/zero
as a source here? Extremely low or extremely high entropy test data aren’t very useful for evaluating performance of medium or mixed entropy data IME as programs like zstd like to have specific fastpaths for those kinds of data.
As far as I know, pzstd
is not an alias, and different from zstd -T8
.
Details:
pzstd
splits the input into chunks to compress in parallel, and adds headers into the output to allow parallel decompression.
Only the combination of “compress with pzstd
→ decompress with pzstd
” allows parallel decompression. The normal zstd
binary cannot decompress parallelly, nor can pzstd
parallelly decompress files made with the normal zstd
binary (for now), see:
- We have many terabytes of large files that are currently compressed using xz (we... | Hacker News
- pzstd compression ratios vs zstd · Issue #517 · facebook/zstd · GitHub
No, for my test above I compressed a Ceph log file (plain text) – it does compress well though, 10 GiB → 680 MB (14x).
You’re right; here we have numbers for a cudadoolkit
store path that compresses only factor 2x:
user(s) system(s) CPU total(s) througput
3.6 GiB cutatoolkit nix store tar'ed, with zstd 1.5.0, compresses to 1.9 GiB
compression
zstd 19.70s 1.12s 110% 18.795 193 MB/s
pzstd 57.50s 1.40s 1057% 5.570 651 MB/s
decompression
zstd 3.22s 0.25s 99% 3.468 1045 MB/s
pzstd 6.29s 0.83s 851% 0.836 4339 MB/s
So for such cases, decompression cases change by 2x.
Re-ran the benchmark with -T10
and pzstd
:
Tool | CTime | Dtime | Size |
---|---|---|---|
xz -9 | 5min | 10s | 207M |
zstd -19 -T10 | 34s | 1s | 254M |
pzstd -19 | 30s | 0.4s | 257M |
pixz -9 | 1min | 2s | 217M |
pzstd decompresses ~4x as fast as pixz. However, pixz already decompresses at speeds on the order of 100MB/s compressed / 500MB/s uncompressed on my M1 Pro which is the class of hardware you are likely to use if you have access to 1Gb internet.
Also keep in mind that this decision would have an effect on every Nix user, not just those with fast pipes.
For a user who is unfortunate enough to live in an internet-desolate 3rd world country like, for example, Germany, download speed is almost always the bottleneck. A 25% size increase in output size means a 25% increase in download time for them.
I’m a fan of zstd too and strongly believe it’s the best general purpose compression tool but it can’t really compete on the extreme high or low end of the compress ratio/speed spectrum. With a Linux distribution we have to consider all kinds of users, not just the fortunate ones.
A compromise I’d be comfortable with would be to offer recent output paths via zstd and xz but drop the zstd version after a certain time. One month sounds good to me but it could be even shorter. At that point we could even consider the likes of lz4 but I don’t think that’s necessary with currently available home internet speeds as even pixz is able to saturate 1Gb/s with moderately high-end HW and zstd should manage to saturate a 10Gb/s one with true high-end machines.
For the Nix store paths I tested, the difference seems to be around 10%, similar to the numbers @DavHau found above.
For example, some detail benchmarks for a recent Chromium store path:
|--------------------- compression ---------------------------| |--- decompression ---|
per-core total total
size user(s) system(s) CPU total(s) throughput throughput total(s) throughput comments
chromium: tar c /nix/store/620lqprbzy4pgd2x4zkg7n19rfd59ap7-chromium-unwrapped-108.0.5359.98
uncompressed 473Mi
compression (note `--ultra` is given to enable zstd levels > 19), zstd v1.5.0, XZ utils v5.2.5
xz -9 102M 216.34 0.54s 99% 3:37.07 2.29 MB/s 2.29 MB/s 6.227 79 MB/s
zstd -19 113M 176.42 0.56s 100% 2:56.66 2.81 MB/s 2.81 MB/s 0.624 794 MB/s
zstd -19 --long 111M 200.84 0.52s 100% 3:21.07 2.46 MB/s 2.46 MB/s 0.686 722 MB/s
zstd -22 108M 210.77 0.74s 100% 3:31.44 2.35 MB/s 2.35 MB/s 0.716 692 MB/s
zstd -22 --long 108M 214.96 0.64s 100% 3:35.53 2.30 MB/s 2.30 MB/s 0.716 692 MB/s bit-identical to above for this input
pzstd -19 114M 270.05 1.20s 1064% 25.47 1.83 MB/s 19.83 MB/s 0.244 2032 MB/s
pzstd -22 108M 224.17 0.66s 100% 3:44.80 2.21 MB/s 2.21 MB/s 0.721 687 MB/s single-threaded comp/decomp!
(Edit 2022-12-19 14:26: I’ve split the compression throughput into per-core
and total
; before I had only per-core
called throughput
which was confusing. Now one can see that pzstd
trades a bit of per-core throughput for higher total throughput.)
In here I also found that:
-
pzstd
doesn’t support--long
(issue says that’s by design) -
pzstd -22
couldn’t multi-thread on this input, maybe it was too small for use with-22
?
That is true, the compress ratio of xz -9
cannot currently be beaten by zstd
. The faster decompress speed would have to be bought with a percentage size increase (10% if chromium
above is representative). For Arch that was only 0.8%, that is because they weren’t using xz -9
(their benchmark ahead of the switched suggests they used no argument).
So I agree that if we want to trade 10x CPU for 10% storage size, switching to pixz
makes sense.
Judging the tradeoff:
I am also from, and often have to download from, the above-mentioned Internet 3rd world country over a 12 Mbit/s connection, so from that perspective, 1.1x growth wouldn’t be great.
At the same time though I have more machines in data centers and on my LAN, for which a 10x speedup would be awesome.
In any case, I think sticking with single-threaded xz
would be pretty bad.
Is pixz
available as a library?
I guess pixz
would also be a breaking change for older nix versions, like zstd
, is that correct?
There is an idea I’ve not had time to fully develop or RFC (anyone want to team up?), but there is a way to make zstd allow for deduplication via a mechanism similar to rsync/casync rolling hashes. This would also be backwards compatible such that normal zstd libraries and clients would not be impacted. By storing NARs like this one can get pretty nice dedup by comparing blocks with similar NARs.
The Nix client would maintain a store of blocks and fetch the index from the zstd NAR, then only fetch the required bytes via Range requests (supported by S3). This will need some heuristics as often it is not worth it, but in some situations the dedup is impressive.
I’ve done some testing that older Nix versions could still process and use these cloud-optimized NARs transparently and that a smarter client achieves dedup.
- Inspired by: Cloud optimized GeoTIFF https://www.cogeo.org/
- Similar idea adopted in Fedora: https://fedoraproject.org/wiki/Changes/Zchunk_Metadata
- Zstd proposal to add to spec: [WIP] [RFC] add cryptographic hash to seekable format by tomberek · Pull Request #2737 · facebook/zstd · GitHub
Inevitably this rings a bell, but without forming a symphony, yet:
@tomberek Can you explain again in a little more layman’s terms your zst proposal?
A burning question I have would be how frames should/could be well-crafted to maximize deduplication?
And I guess, on the other extreme (like zst:chunked
) we have file awareness. But that doesn’t seem like a good fit for the nix model (where file-wise dedup is already quite reasonably approximated by the input hash). Yet maybe I’m missing something?
I really like this compromise. Offering a choice for which compression algorithm you want to use seems great, and pruning old data this way seems like it would easily solve the space requirement problem. I think the most logical choice for the pruning timeframe would be one that covers the latest NixOS release. i.e. Anything that’s in any revision of the current NixOS release should not be pruned. This means up to 6 months, but frankly in comparison to the amount of data cached by NixOS I’m sure 6 months of extra zstd files is not that big a deal.
I wasn’t aware that zstd
supports compression ratios higher than -19
.
In your benchmark, using zstd -22
comes already pretty close to xz -9
in terms of compressed size. It only increases by 5.9%
.
Those are pretty good results. Given the dramatic performance benefits, zstd
appears like the better choice overall to me.
Concerning the parallelism, optimizing the tooling on parallelizing single file extraction would be nice but there can be a simpler approach.
Nix usually has to fetch many store paths. Why not parallelize the requests to the number of CPUs?
Extracting one file per CPU would solve the problem as well, and it is a client side only optimization which we could implement right now without breaking anything.
Substitution is already parallelized. It shares the max-jobs
limit with build jobs (which is not ideal as the IO-bound fetching part can benefit from far more parallelism than the CPU-bound decompression part).
Chromium is a good sample.
I think we should probably use a larger set of more mixed packages though. I think a good candidate might be the tar’d closure of the x86 installer ISOs.
Totally forgot zstd had levels above 19. At least for chromium though, your numbers put it in a similar enough ballpark to xz -9 to consider it IMO.
As for CrossOver:
Tool | CTime | Dtime | Size |
---|---|---|---|
xz -9 | 5min | 10s | 207M |
pixz -9 | 1min | 2s | 217M |
zstd -19 -T10 | 34s | 1s | 254M |
pzstd -19 | 30s | 0.4s | 257M |
zstd --ultra -22 -T10 | 2min 40s | 1s | 199M |
So, yeah. That’s a bit unexpected.
We might need to evaluate memory usage at this point though as IIRC the extremely high zstd compression levels do require substantial amounts of memory; especially multi-threaded.
OTOH, Nix users already need quite a bit of memory for eval anyways, so that might not be as big of a problem.
According to my data, we’d be trading ~4x the total CPU time with pixz, not 10x.
I don’t see how this would be beneficial in your LAN? You’re free to choose whatever compression you like there (Nix already supports zstd AFAIK); this is about what cache.nixos.org should do.
That I’m unsure about. It doesn’t seem like it.
It also doesn’t smell very production-ready. For example, I just noticed that it does special tarball handling that is silently backwards-incompatible with xz
(produces different tarballs that what comes in). That can be turned off but it’s on by default and already caught a certain distribution by surprise.
Actually, no. With the -t
flag, pixz
-compressed data can be decompressed by xz
at the same speed as data compressed by regular xz
.
It’d be backwards compatible with any Nix version that understands xz
.
I encountered 2 issues with pixz
, one about compression ratio, and one about using all available cores:
Anybody know how can I get pixz
’s compression side to use all cores? On my 6-core/12-thread machine, it uses only 400% CPU for the chromium
store path, which over time of the compression drops to 300% and then to 200%. The averace usage according to time
is then 270%.
Passing -p6
or -p12
doesn’t seem to change anything about that.
I have now added benchmarks of pixz 1.0.7
to the table (including the above-mentioned problem).
I have also added maxres
outputs from command time
, showing how many MB maximum RAM were needed for compression and decompression:
|--------------------- compression ------------------------------------| |------ decompression ------|
per-core total total
size user(s) system(s) CPU total(m) throughput throughput maxres total(s) throughput maxres comments
chromium: tar c /nix/store/620lqprbzy4pgd2x4zkg7n19rfd59ap7-chromium-unwrapped-108.0.5359.98
uncompressed 473Mi
compression (note `--ultra` is given to enable zstd levels > 19), zstd v1.5.0, XZ utils v5.2.5
xz -9 102M 216.34 0.54s 99% 3:37.07 2.29 MB/s 2.29 MB/s 691 MB 6.227 79 MB/s 67 MB
pixz -9 137M 216.98 1.71s 271% 1:20.57 2.28 MB/s 6.19 MB/s 2951 MB 2.551 194 MB/s 657 MB did not use all cores consistently, for both compression and decompression
zstd -19 113M 176.42 0.56s 100% 2:56.66 2.81 MB/s 2.81 MB/s 241 MB 0.624 794 MB/s 10 MB
zstd -19 --long 111M 200.84 0.52s 100% 3:21.07 2.46 MB/s 2.46 MB/s 454 MB 0.686 722 MB/s 133 MB
zstd -22 108M 210.77 0.74s 100% 3:31.44 2.35 MB/s 2.35 MB/s 1263 MB 0.716 692 MB/s 133 MB
zstd -22 --long 108M 214.96 0.64s 100% 3:35.53 2.30 MB/s 2.30 MB/s 1263 MB 0.716 692 MB/s 133 MB bit-identical to above for this input
pzstd -19 114M 270.05 1.20s 1064% 25.47 1.83 MB/s 19.83 MB/s 1641 MB 0.244 2032 MB/s 564 MB
pzstd -22 108M 224.17 0.66s 100% 3:44.80 2.21 MB/s 2.21 MB/s 1392 MB 0.721 687 MB/s 245 MB single-threaded comp/decomp!
Oddly, pixz
produces a much worse compresison ratio than any of the other approaches.
xz
/pixz
need 5x more RAM for decompression
pzstd
needs disproportionately much RAM for decompression. I suspect this is because with the invocation pzstd -d ... > /dev/null
, outputs from the various threads need to be bufferend in-memory to write them in order into the output pipe.
However, pzstd
does this even when writing outputs to a regular file with -o
.
I also checked how much RAM plain zstd
needs to decompress the pzstd
outputs; there is no difference compared to decompressing the zstd
outputs.
For the chromium derivation in my table above, single-threaded zstd
has 10x higher decompression throughput than single-threaded xz
, and for pzstd
vs pixz
it’s also 10x.
Summary of my findings so far
On this chromium
tar:
-
single-threaded:
-
zstd -19
wins againstxz -9
on decompression speed (10x) and decompression memory usage (5x) -
zstd -22
still wins againstxz -9
on decompression speed (same 10x) but loses against decompression memory usage (0.5x) -
xz -9
wins on best compression ratio: By 10% againstzstd -19
and5%
againstzstd -22
-
-
multi-threaded:
-
pixz -9
loses againstpzstd -19
on all metrics - decompression memory usage can apparently be reduced 10x by decompressing 6
zstd -19
archives independently, rather than usingpzstd -d
to decompress 1zstd -19
archive. That seems wrong, at least when writing to regular files. I filed a zstd issue.
-
Please point out if you spot any mistakes!
I found on the zstd issue tracker a detail description of the difference between pzstd
and multi-threaded normal zstd
.
It confirms that zstd -T
and pzstd
are very different:
zstd -T#
produces a single compact frame, as opposed topzstd
andmcmilk
variants which produce multiple independent payloads. Decompressing in parallel multiple independent payloads can be done fairly easily, while untangling dependencies within a single frame is more complex.
and also states:
Multi-threaded [single-frame] decompression is in our task list, although there is no release date set yet.
Hello everyone,
Given the current state of discussion, it seems like the compromise mentioned in Switch cache.nixos.org to ZSTD to fix slow NixOS updates / nix downloads? - #23 by Atemu seems a good way forward for this issue.
On the one hand, Nix 2.3 still has no support for zstd, though discussion with the TVL group showed an interest to tackle this and work on a patch they would submit to the Nix 2.3 tree or keep as a patch and provide it to nixpkgs tree. They will dogfood it first to their infrastructure and improve it. There is an incentive for them to do this as Cachix has enabled zstd compression and from my understanding, there is no xz files laying around for now (?) as a fallback.
BTW, there would be a need for reviewers once the patch (based on @andir’s work) each a certain point, @domenkozar @tomberek would you be interested to be reviewers of such a patch?
On the other hand, there are definitely users out there in “an internet-desolate 3rd world country”, @tazjin, for example, reported to me that updating a full NixOS system can cost up to 3-4USD in Egypt. I feel like this is a compelling argument to consider that compression ratio are more important than decompression speed on the long term and it would be a good signal to send saying that NixOS cares about this type of Internet infrastructure too while we improve having things like deltas, etc.
In the mean time, the consequence of adopting two tarballs format for NixOS cache require to consider the cost incurred on the existing S3 store, for this, I would ask the NixOS Foundation (cc @ron @edolstra) to weigh on this. Is it expensive? Is there any cost optimization that can be achieved to enable this usecase? If the data could be provided, I can also try to do a report in this post to offload this from the Foundation.
Also, IMHO, going with the compromise remove any need of patching Nix 2.3 right away and enable other Nix users to benefit from zstd now.
I don’t expect that additional S3 expenses would be significant if the additional archives were removed after a short-ish period of not being needed (like one month), but I’m not sure if someone can easily produce infra code that does deletion like this.
Checking that a path is alive in this model seems difficult, as referencing from a different path has been noop AFAIK. In particular, fixed output derivation might remain alive for much longer than the chosen period, and on stable branches we may not do a full rebuild every month. Well, maybe one month after upload could be considered a good enough approximation (and not hard hopefully?), given that we’ll have xz fallback anyway