Improve deduplication with late binding?

Problem statement:

If/when Nix switches to CAS storage, then binaries that reference different dependencies but for the rest are the same will not result in deduplication.

Example:
curl uses openssl. openssl gets a patch release which results in a small change in the .so file but nothing else.
This results in curl rebuilding. The resulting binary will (should) be the same, except for the reference to openssl being updated.
The checksum is different and everybody needs to install the new version; the unchanged supporting files will be hardlinked, but the binaries will have a few different bytes.

In “regular” Linux distros, simply updating the openssl package would be sufficient. In low-powered devices, that is preferable.

Solutions

If curl was somehow able to decide which openssl to use at runtime, the package would be unchanged and any dependencies would not need rebuilding. There could be a separate wrapper package that wraps binaries and libraries, perhaps by patching ld.so. This separate wrapper would be tiny, all dependent packages could then skip a rebuild by only updating their own wrappers, and the bigger wrapped packages remain unchanged and unbuilt.

Another option might be to generate diffs for all builds with the same version. It wouldn’t help with skipping rebuilds, and local diskspace would still be impacted, but there would be less to download.

1 Like

Would this wrapper query the derivation or the store in some way to discover the information it needs? In a sense, this is an impurity, though perhaps a useful one. Or is there a way to make this pure?

Is this basically overwriting or making LD_LIBRARY_PATH(or ld.so) “Nix-aware”?

Well, each executable could be wrapped by something compiled that loads the wrappee and applies the hardcoded dependencies. Then it wouldn’t need to query anything, it would be like a “ld.so script” (or dyld).

These wrappers would be in a separate package, which would need to be recreated, but the wrapped packages can remain unchanged. I imagine a similar thing is possible for libraries.

As I understand it, it would be like:

openssl 1.0.0 is built. We build curl 1.0.0 with openssl 1.0.0. We then (later) build openssl 1.0.1, with no API incompatibility. We then replace pkgs.curl from curl 1.0.0 to curl 1.0.0 build with openssl 1.0.0 but wrapped to use openssl 1.0.1.

This seem doable while keeping the purity, but would need to manually specify this overwrite, and if someone who have the incompatible openssl 0.9.0 compiled, it will need to build both openssl 1.0.0 and openssl 1.0.1. This still look like an interesing idea for the most commonly used library.

If I understand right, this building involves downloading from the binary cache, which is, for most users, much easier than building from source.

Indeed, it will be downloaded from the binary cache if avalaible.

Not quite - openssl 1.0.1 leads to curl 1.0.0 being rebuilt, but since the only references are runtime bindings, the build results in the same output hash.

I’ve had more of a think, and the benefits I’d like to see are about anything but runtime, and therefore it seems wrong somehow to complicate runtime behavior.

I wrote down how to deduplicate the Nix store with a git repo, which unfortunately results in more disk space use vs only having a store, but has great benefits for building and distribution.

Plus, if we can make a filesystem that uses this git repo as a backing store, it will actually take less disk space than the current store.

I am not sure how relevant this is, but OSTree is a content-addressed object store.

1 Like

Did not know that! It looks pretty similar in concept, but it doesn’t use git so this comparison applies, especially regarding diff packing and diff downloading.

Furthermore, what I’m proposing removes store references before storing a file, so that there’s a lot more deduplication than simply hardlinking (which nix-store already does), and it stores data for the Nix store.

So thanks for the pointer, this shows me that I’m on the right path :slight_smile:

Thought about this some more - runtime binding is not really helpful.

Suppose you implemented runtime binding with a separate file that explains which libraries to load. So then if a library dependency causes a rebuild, and only that one file changes, you know that you could patch the store path of this build in all its dependants. However, that’s no guarantee that the builds are correct. To check that, you’d still need to build everything.

So then you’d only get some benefit of having smaller delta changes to download and some better hardlinking, but that would already be solved generally and safely with the Nix store in a git repo.

However, that’s no guarantee that the builds are correct. To check that, you’d still need to build everything.

Would lazy binding require package maintainers would need to denote when soname changes happen? Gentoo has something like this: a package’s “subslot” field indicates ABI compatibility, and when this changes, Portage knows to rebuilds reverse dependencies. If a library’s subslot doesn’t change, say on upgrade, then no further rebuilds are needed. This largely just works, and personally I’ve never had issues with it, despite the lack of correctness guarantees. But it does require the maintainer to record the correct metadata. I feel that lazy binding on the whole would not be pure enough for the Nix crowd (or to make it so would be too complex). As much as I would love to see a feature like this.

Myeah, but that would require maintainers to carefully maintain ABI records and packages to always be correct.

And even then there could be packages that statically include code and they would need recompiling anyway.

Just recompiling the dependants tree is safest.