Nix-sandwich: differential compression for Nix

Several months ago I had an idea for an experiment to make Nix downloads faster: transfer only differences between versions of a package. It took some effort to make it all work out nicely, but I’ve been using it on my personal machines for a while and have been happy with the results. It often (but not always) does actually make system updates faster on my slow-ish home connection. You can read all the details and how to try it yourself here:

A quick summary:

  • Acts as a local substituter
  • Keeps track of packages present on the system
  • To get a narinfo: Finds a suitable “base” package that we guess will have a small delta to the requested one
  • To get a nar: Asks a remote “differ” for that binary delta, then reconstructs the requested nar from the base package and the delta

Where does it run? The remote differ runs in Lambda so it’s close to the public binary cache. It’s currently well under the free tier, but it wouldn’t be if it was open to everyone, so you can run one yourself, or send me a DM and a few people can use mine for a while.

How well does it work? I generally get a 5–15× reduction in data transferred for routine system updates. It often takes less overall time, but not always, since the deltas and reconstruction can be slow (especially for kernels), and the internet is pretty fast. This will be most interesting to people on slow/metered connections.

How does this relate to other store-related efforts that use content-defined chunking? It’s both better and worse in different ways, but mostly worse, since it can’t be scaled up as easily and ends up being more expensive. Still, it was a fun experiment and I plan to continue using it and maybe make a few extensions.

40 Likes

wow, that’s amazing work!

The costs of ‘computing the delta’s’ might actually make sense if the binary cache is currently burning lots of bandwidth.

As fully paid up member of the internet bandwidth preservation society anything that reduces load on those poor over worked bgp routers get the thumbs up from me.

The delta’s that are created, can they also be cached? so if a user upgrade from package a to package b, and they binary diff is created dynamically, could that ‘diff’ be stored for another user to benefit from that now ‘precomputed diff’.

The actually binary diff data should be pretty small.

When the package is reassembled, does the hash of the resultant nar, match the .narinfo’s meta data for the package your upgrading too.

if so, the delta data doesn’t have to be classed as too ‘sensitive’, because your getting an output hash of the of you reassembled package, with what the normal binary cache would serve.

There are other mechanism to do llamba type compute on the fly, i wonder if you could put in a cloud flare worker for instance, or a dedicated machine with lots of CPU to handle the spikes, a single 128 core machine should be able to handle it, if it’s back by a diff cache.

To prevent too much diffing data from being generated, caching the per channel differences (per single step) would be easy and just a single compute task to generate the ‘diff cache’

however, a user would need to update their channels a commit at time , rather than jumping over N commits over a channel, so it would be slower, but because your not pumping gigs of data, it may actually be faster to roll through the channel, rather than step to the end, we only know the diff between channel N + 1 commits, not n + n channel commits.

What ever this, this is tech straight from 2038, and i love it.

Respect.

3 Likes

This is wonderful! I had been toying with the same idea for a while (and learned two days ago that so had Eelco, two decades ago: https://edolstra.github.io/talks/eupfcdm-colloquium.pdf). I spoke to zimbatm earlier today about it again, and a few hours later he pointed me out to this thread!

I had done experiments with bsdiff. The diff files were absolutely tiny! But it required a very high amount of memory (17x the file size IIRC). hdiffpatch looks like a more promising tradeoff, but I haven’t yet tried it.

I’m working on garnix a nix CI and cache, and there the considerations weigh heavily in favor of a solution like this (though precomputed) - because we have a lot more insight about what we should be diffing (repos tend to have only a few packages, and diffs size correlated with commit order), and the diffs tend to be smaller, and downloads tend to be very recency-biased (so we can keep the most recent paths fully materialized, and then only diffs backwards).

Again, congratulations, and thanks for putting this into the world!

4 Likes

Thanks! A few responses:

So, my understanding of the cache situation (just from reading threads on here) is that the cache is currently burning lots of bandwidth, but that’s being paid for so it doesn’t really matter. The more pressing issue is storage space, and while binary deltas can theoretically help there, it’s a hard problem. There’s lots of prior art (git, for example), but nothing that’s a drop-in solution at the scale required.

They can definitely be cached, and I’ll probably add that. But I also have doubts how well it can work. The fundamental problem is that a delta is between two specific versions, so if you let clients request any pair of versions, you have a quadratic number of things to cache. As you said, you can force the clients to only jump a single channel bump at a time so you’ll get more cache hits. You’ll download less but blow up local storage space (you can gc immediately after I suppose).

It would be nice to just download a chain of deltas and compose them. It turns out that vcdiff deltas (generated by xdelta3 among others) can be composed without constructing intermediate results, but zstd deltas (the current default) cannot.

There’s also the problem of enumerating exactly where the channel bumps are. Last I checked, that information was surprisingly hard to find.

I don’t think cloudflare workers can work for this: you’d need to build the whole thing into wasm (a big effort) but then the runtime limits are just too small (128mb of ram isn’t going to fly).

I actually tried this first with a big vm in ec2, but I realized that the usage patterns are so bursty that lambda is a better fit, even though it’s more expensive per cpu/gb-second.

However, if you have a 128 core machine you’d like to offer, I’m happy to help you run this service on it :slight_smile:

Yup, binary deltas have been around a long time, and none of these ideas are that new. I’m curious if the scheme described in that presentation was ever implemented for Nix.

That presentation talks about patch chaining, which is one way to make this work at scale, but it’s not clear it’s practical (for nixpkgs): We have to come up with a scheme for which deltas to compute, and then on the client side, evaluate the system closure at each intermediate channel bump to figure out the store path hashes to request deltas between. Though if we precompute all the deltas we might want (this is very expensive for nixpkgs), then maybe we could index enough information so that we don’t need to do the intermediate evaluations on the client side.

Then there’s actually applying the patches: as I wrote above, vcdiff deltas are composable, but vcdiff are relatively slow to generate (using xdelta3). In my experiments, zstd was much faster, so much faster that it was a clear win even though the deltas were often bigger. So for effective patch chaining you’d have to switch back to xdelta3, and then pre-compute, and then pre-computation is even more expensive.

I didn’t try hdiff or any others, but based on my survey I would be very surprised if anything could even approach zstd on speed + memory required, or significantly beat xdelta3 on delta size.

I think the main surprising result from this experiment is that zstd deltas are so fast that it is actually feasible to compute them on demand instead of pre-computing. In fact zstd is so fast that the xz decompression is a significant part of the time and resource usage. On a cache with zstd for faster decompression, this would get even better.

Also, you can charge users directly for the resources required, so it definitely makes more sense in that context. I focused on making it work with nixpkgs because that’s what I use, but I’m interested in thinking about the problem in a private cache context too. Actually you should be able to use the existing code with a private cache by setting the nix_sandwich_upstream env variable, though if it requires authorization you might have to add code to handle that.

Both storage and bandwidth are a problem. The CDN egress is sponsored by Fastly and we currently don’t pay for it, but the S3 → Fastly bandwidth is about a quarter of our current AWS bill, and it’s probably worth optimizing.

Of course, the real question would be what trade-offs that induces (compute resources requirements, availability and reliability requirements, etc.). That would need careful evaluation, but I suspect it would still be a net positive for the project.

1 Like