I built a Nix binary cache backed by Git (82% storage reduction)

I recently explored the structural similarities between Nix and Git. This led me to build Gachix, a decentralized binary cache that uses Git internals as the backend.

I wrote a blog post detailing the design, the mapping of Nix stores to Git objects, and benchmarks against tools like harmonia and nix-serve.

https://www.ephraimsiegfried.ch/posts/nix-binary-cache-backed-by-git

Some key results:

  • Storage: Achieved an ~82% reduction in size compared to a standard Nix store due to Git’s deduplication and compression.
  • Latency: Achieved the lowest median latency for retrieval, though average performance lags behind due to some outliers with large files.
  • Decentralization: Because it’s Git, you get a replication protocol for free.

I’d love to hear your thoughts on this!

57 Likes

Interesting idea!

If I understand correctly, gachix serves .narinfo files, does it also support serving debuginfo/{buildid} files like binary caches created with the ?index-debug-info=true option?

Also did you compare size to a “live” nix store or to a compressed file:// binary cache? the latter is compressed with xz which I expect would offset the size benefits you see.

2 Likes

Is the reduction due to compression? You can compare to a compressed binary cache store (can use xz, zstd, various levels, etc) versus a local store fully expanded into usable form.

That would help distinguish the benefits between compression and deduplication.

Also worth knowing about this comparison with a local store that has done the hard-linking optimization.

3 Likes

That ratio is better than just compression. I compress my stores with ztsd:3 and get ~2.25-2.5 on the ratios, which is about half of what he achieved. I’m curious about this one as well, and interested to see more.

5 Likes

This is very exciting to me. As you say, Git gives us a lot of the complicated DAG stuff and APIs for free. It looks like the foundations for a really solid distributed cache. I could also see symbolic references serving a purpose similar to profiles or gc roots. How might package signing work, is there a way to integrate it with commit signing? I also wonder, how would Gachix handle an input addressed package with different content outputs on two machines?

1 Like

Cant wait to store a binary cache in forgejo.

2 Likes

Thank you!

No, it does not support serving debug files by Build ID yet, but that is a great idea. Since Gachix relies on Git references for lookups, this could definitely be implemented by adding a secondary index that maps Build IDs to commits (e.g. refs/debuginfo/<build-id>).

1 Like

I used a local store without compression and without hard-linking optimization. But I also didn’t optimize the storage for the Git database (with git gc).

But that’s a good point. I will also compare an optimized Nix store to an optimized Gachix store.

I’m wondering how well it scales with the overall cache size - in terms of latency and CPU usage for the git operations and (de)compression. The benchmarks I see in the blog and paper are only for latency and not CPU, and for constant size of the cache at 13 GB (compressed).

Yes I also thought about using symbolic references to mimic user profiles. I was wondering whether we could do something similar to a Nix store if we would merge all commits that a user wanted. But this would not work because of merge conflicts. Instead we would have to construct a super tree pointing to all packages (represented as trees) that a user wants.

The problem with commit signing is that it changes the commit hash. We would not have the property anymore that every package is globally associated with exactly one commit hash. But we might be able to use a detached signature mechanism like signed tags to verify trust.

That’s a tough problem, I admittedly have not considered. I didn’t know that there could be different content outputs for the same input addressed package. If that is the case we could probably only index Nix packages in the Gachix database using Nix content-addessed hashes. Or do you have a suggestion for a working solution to this problem?

1 Like

The usecase I was thinking of was akin to Cachix deploy, where clients could query their symbolic reference and get back a system closure. You can of course do this with a nornal nix store and a db or symbolic refernces in a var directory, but this seems very clean.

It really shouldn’t for the vast majority of input addressed store objects.[Edit: I guess it’s more common than I thought] Per RFC0062:

Unresolved questions

Caching of non-deterministic paths

A big question is about mixing remote-caching and non-determinism. As Eelco’s phd thesis states, caching CA paths raises a number of questions when building that path is non-deterministic (because two different stores can have two different outputs for the same path, which might lead to
some dependencies being duplicated in the closure of a dependency).

The current implementation has a naive approach that just forbids fetching a path if the local system has a different realisation for the same drv output. This approach is simple and correct, but it’s possible that it might not be good-enough in practice as it can result in a totally useless binary cache in some pathological cases.

There exist some better solutions to this problem (including one presented in Eelco’s thesis), but there are much more complex, so it’s probably not worth investing in them until we’re sure that they are needed.

So I largely don’t think it is an issue but it is also true that the same derivation can produce the different outputs in some particularly difficult cases, making it non-deterministic.

The nix manual as far as I’ve skimmed, doesn’t explicitly state that input-addressed outputs aren’t necessarily reproducible, but it doesn’t say they are either. It does say that fixed output content addressed outputs are reproducicible, which makes sense because that’s what makes them fixed output.

The draft for RFC0017 suggests that outputs from derivations are not fixed, and that it’s okay to have one output produce multiple content addressed store objects:

A note on reproducibility

There is no need for a given $out to always generate the same $cas. It allows better resource use, but doesn’t change anything about this RFC. There is no obligation that a single $out only stores a single $cas entry.

This to me suggests that input addressed store objects are often reproducible but not necessarily always reproducible.

As for solutions, I think you’d have to see how much this happens in practice and if any issues occur when it does. It might be a non-issue for 99% of packages. I’m sure someone on here has a better idea for properly managing it.

1 Like

Most packages in nixpkgs are probably not reproducible. Reproducibility is a stated but rarely-realised (ha) goal.

1 Like

«Most» depends on weighting. Of build-deps of Gnome installation ISO, <5% have been irreproducible last time graphical-iso-build-closure - Lila checked.

2 Likes

This really looks very exciting!

I was just looking into content-addressed binary caches again, because my Minio S3 cache (yes, I’ll have to migrate it to Garage at some point) that my CI pushes to already eats up over 1.5 TB of disk space. I didn’t find anything new though, so I gave up and resigned to just browse this discourse for a bit. And then your post comes up, lol.

Anyway, the blog post compares Gachix to established solutions like nix-serve and harmonia, which are, as you point out, not very efficient because they basically just serve local files (OK, they pack them up in NARs, but still).

But are you aware of the other existing binary cache implementations that have a content-addressed storage backend? These are:

Attic, a multi-tenant binary cache:

It uses S3 and (optionally?) PostgreSQL for storage and can deduplicate via its content-addressing. It doesn’t content-address individual store files, though, but chunks NARs using an algorithm: FAQs - Attic.
Unfortunately, I never got it to work reliably for me.

Snix store, part of the Snix reimplementation of Nix in Rust (fka/forked from Tvix):

It provides content-addressing with a per file granularity.

The general architecture of Snix is very modular: all components communicate using gRPC APIs for which both Rust types and Protobuf files exist.

For the store part, there’s a completely Nix-agnostic snix-castore serving Blobs (files) and Directories (=Git trees), while the actual snix-store provides only a PathInfo service on top of that, which translates Nix store paths to their content addresses.

What’s interesting about the last part is that users only need to trust the PathInfo service, because once they’ve got the content hash of a store path, they can securely substitute it from anyone that has it.
I guess the same would also apply to Gachix. While I’m not sure SHA1 hashes should be trusted for that use case, using SHA256 should theoretically be possible (IIRC).

To that point, despite the obvious similarities to Git (Merkle DAG, file storage, etc), the developers decided not to use it as their CA store mainly because of the hash function: snix/web/content/docs/components/castore/why-not-git.md at canon - snix/snix - Snix Project.

One advantage of their chosen hash function, Blake3, is that it supports verified streaming, which is helpful for scenarios where the store is mounted into a system (Virtio and FUSE are supported), but the files should only be fetched once they’re accessed.

4 Likes

I didn’t know about Attic. It’s very interesting that it deduplicates chunks of NARs. I wonder whether it this is more efficient than deduplicating blobs and trees.

If a single file is added to a package, Gachix simply creates one new blob and updates the directory tree, leaving the existing file blobs untouched. In contrast, a chunking strategy might produce new chunks not just for the new file, but also for the boundaries where the data stream shifted. This potentially creates more storage overhead and processing work than necessary?

On the other hand, Gachix is inefficient with small changes to large files, as it requires re-uploading the whole blob, while Attic only stores the new chunks. Implementing chunking for large files in Gachix would not only improve deduplication but also likely fix the large file latency issues mentioned in my blog post.

Short after I started working on my bachelor thesis I read about Snix and I told my supervisor that I would give up on Gachix because there already exists a Nix CA-store with Git-like features. But he told me not to do it, because he’s an avid Git enthusiast. :laughing:

I really like the modular approach of Snix, that they use BLAKE3 and standardized serialization methods. That’s a big plus. On the other hand, the main benefit of Gachix is that it builds on top of Git’s well tested framework and established synchronization protocols.

4 Likes

Attic worked pretty well for me overtop Caddy for TLS termination. My homelab has had some reliabillity issues so I got rid of Attic but I might bring it back later when I rearchitect things. Unfortuantely I don’t use Nix heavily enough (or with good enough metrics) to say how much of an effect Attic’s chunking had.

In related systems, there’s also Styx, which is essentially a deduplicated binary cache, plus a custom substitution protocol to preserve the deduplication across the network while fetching, plus a “filesystem” to preserve the deduplication on local disk. Plus binary diffs and some on-demand stuff.

The backing store can be local disk or S3. The deduplication is chunk-based with file-aligned chunks. In my experiments, content-defined chunking strategies (i.e. ones that can handle “shifted” data) did not provide enough benefit to justify the complexity.

The catch is that currently it supports only its custom substitution protocol and can’t act as a standard binary cache, but that’s actually very easy to add. Let me know if you’re interested in using it that way, I can try to prioritize it.

4 Likes

@EphraimSiegfried : I am curious, was this idea inspired by the bup backup system? Here’s short description of this backup system:

Very efficient backup system based on the git packfile format, providing fast incremental saves and global deduplication (among and within files, including virtual machine images).

At first glance, this looks highly related to your strategy.

No I have not heard of bup. But thanks for sharing, It looks interesting (especially how it chunks large files). I’ll look into it in more detail.

1 Like

Regarding non-deterministic paths, does anyone know how other binary caches handle those types of packages? If there are multiple artifacts associated with the same input-addressed path, which artifact is sent to the client?