How to make Nixpkgs more eco-friendly / use less resources

Hello,

On this page we can find information such as:

  • cache.nixos.org contains 120 TB of packages for 3 architectures
  • hydra executes over 350 000 builds a week, but many packages are very small

As a nixos-unstable user, I notice the amount of download data required to update a workstation is ridiculous, like a few gigabytes per week for my Plasma workstation.

I’m convinced as a community we should thrive to make NixOS more sustainable with regard to resources usage.

The only ideas I have in mind right now would be to have deduplication in the cache repository at the package level, exactly like with the nix-store --optimise, maybe it’s already working this way?

The other idea would be to download a delta of the packages, this would drastically reduce bandwidth usage, and reduce the download time at the same time.

Because of reproducibility, it’s hard to skip rebuilds, guix has a “graft” system for packages update with really minor changes to avoid recompiling the whole dependency graph, but I suppose it kills reproducibility?

However, the number of weekly build could be reduced / optimized by only building package sets that would be really used by most people. I don’t exactly understand how it works, but when I update my nixpkgs flake, it’s downloading a json in the registry and there is a new nixpkgs tarball available every few days.

So, I don’t really have game changer ideas, but that’s an important topic for me, and I always feel a bit guilty to contribute to nixpkgs because of this. I open the idea to the community, maybe we can find a solution together to improve the current state of Nixpkgs. :slight_smile:

32 Likes

Getting the content-addressable derivations to be more prevalent should theoretically help a lot with mass rebuilds.

12 Likes

if this is the same content-addressable feature featured in this blog post that would be a huge win on Hydra :open_mouth:

I’ll join the test party

1 Like

What are the (current) trade-offs for content-addressed derivations versus input-addressed derivations?

I think you still need input-address to know if a package is available in cache, but content-addressed derivation will prevent packages to be rebuilt if a dependency has a different input hash (is it the right name?) but the same content as previous.

2 Likes

Related:

Okay. So like… before I pinned it, it was literally making an HTTP request to GitHub every time I ran search ? Is that… that’s insane, right?

from How to Learn Nix, Part 44: More flakes, unfortunately

Disclaimer: Not being snide here. This post does the job of being snide, but I mention it because it outlines the problem from the user’s perspective.

4 Likes

The CA-store will not bring the benefit everyone is hoping for!

If the library changes, then its path changes, be it input or content addressed. So whatever depends on it has to change its content as well to adjust to the new path, so everything that depends on it will as well, etc.

The CA-store will only show the benefits everyone is waiting for, if and only if, a compiler or otherwise compiletime only input changes and despite this change the same binary output is produced.


Re “package deltas”: In general a good idea, though binary data is hard (not impossible) to properly diff. Also that would require additional support in the protocol and creating the diffs based on the negotiated “left” and “right” will bind resources on the server to actually calculate the diff. Not sure if we then still can use current CDN solution or if then it will be necessary to put up many cache mirrors around the world.

6 Likes

CA primarily, and then multi-derivation packages. Splitting building, installing and testing into derivations when reasonably possible could help though there is a tiny overhead for creating more derivations. For example with Python, this means building a wheel in one derivation, installing and “linking” to other Python libraries in a second derivation, and running the test suite in a third.

1 Like

AFAIK with CA if output don’t change, path neither change.

You will have 2 different .drv paths generating same $out path.

as builds are already reproducible, the compiler should provide the same binaries if the changes doesn’t affect the compiler (flags change would affect the result obviously)

@Solene note that not all compilers are deterministic (for instance GHC when -j n with n greater than 1).
If you are interested into using Hydra with CA derivations you con try this, it seems to work if you don’t use remote builders. Also any help to make it work with remote builders is welcome :grinning:

Also, CA derivations and Hydra could bring to some unpleasant situations where you build more stuff than without CA derivations: https://github.com/NixOS/rfcs/blob/8bb86f8bddd98deb3c03c5698d5eff0b9072d0a7/rfcs/0062-content-addressed-paths.md#caching-of-non-deterministic-paths

After running NixOS unstable for years, I’m intending to switch to stable at the next stable release for that reason: being tired of downloading gigabytes[^1] per week. I wonder whether I’ll end up finding stable only slightly better, though.

[^1]: more than a few, and I use i3wm

3 Likes

Yes but the point @NobbZ was making is that this won’t help if e.g. a library you depend on changes. Even if my code and my compiler are identical to what they were before the library changed, my binary still has to be different to point to the new location of the library, since its code changed. So we can’t just patch a library and avoid rebuilding everything that depends on it, because everything has to point to the new library binary.

3 Likes

Unfortunately the binary cache format can’t really do something like this, since it packs everything into nars, there would be no way to hardlink files inside of those archives.

We could probably do delta thing if we had attr -> nix path mapping (or nix path -> attr) somewhere for all hydra evaluations, but it would make nix dependent on hydra which is not great.

It also has the problem that most users won’t be upgrading between the same versions as hydra is. The diff between builds in hydra is between very nearby commits, whereas users are likely to be upgrading between commits that are weeks apart. We’d have to keep diffs between HEAD and more than one previous commit, which would only exacerbate the storage problem.

EDIT: Plus the binary cache protocol would then get more complicated. User machines would have to ask the cache “I need this path, what paths do you have diffs from for it?” “I have diffs with these paths” “Ok I have that one, please give diff”

Is the current nix store design a real dead end with regard to reducing resource usage? :confused:

Not necessarily.

We probably need to increase the substitution granularity, though. I wrote an article evaluating the efficiency of some potential solutions to this problem earlier this year: Nix Substitution: the Way Forward

The TL;DR being: adding an output-adressed chunk/file store would greatly reduce the amount of data we’d have to download in most use cases.

3 Likes

Has anyone ever considered using something like bitorrent or IPFS to share the distribution load?

I’ve written GitHub - input-output-hk/spongix: Proxy for Nix Caching which splits NARs into chunks using GitHub - folbricht/desync: Alternative casync implementation and saves roughly 80% of space (always depends on the contents of the cache, of course).
The remaining issue is that Nix won’t down-/upload those chunks directly, so the transfer is still as inefficient as always.
Implementing chunking in Nix may work for uploading, but downloading would need a separate chunk store and the end result would be about 20% additional disk overhead because the /nix/store still needs to be uncompressed.
One alternative for that would be FUSE, but that’s going to be slow and probably buggy, as well as restricted to a few platforms, but in theory you’d trade time for space, if that’s critical.

6 Likes