Distributed Nix Cache Server with Cachix Compatibility

Hello! I’d like to share my first Nix-related project with you and kindly ask for your feedback.

Introduction
This project, currently named cache-server, was developed as part of my bachelor’s thesis at Brno University of Technology. It is a fork of a previous project which was also developed as a thesis by Marek Križan.

Problem Being Solved
The goal of the project is to offer a flexible and scalable cache management solution. It enables the creation of a network of interconnected cache servers with multiple cache instances, offering different storage options and distribution strategies. Currently, it supports only local and S3 storage, but it’s easy to extend to support more storage backends.

Also nix python package for OpenDHT library was created during development.

Key Features

  • Uses OpenDHT for decentralized access to cache servers across multiple nodes.
  • Supports the Cachix client for binary pushing, fetching and deploying.
  • Supports local and S3 storage backends with optional storage strategies for distributing data across multiple storage instances.
  • Simple YAML configuration with per-cache settings, supporting multiple cache instances on a single server.

Inspiration from Attic
The project was inspired by Attic.
Key differences in this implementation are:

  • Support for distributed cache nodes
  • Built-in Cachix compatibility
  • YAML configuration for ease of use

How to Get Started
To try out the project, follow the installation instructions. Next, create the server configuration file using the examples provided in the README. Start the server to generate the authentication keys. After that, set up your Cachix client. It’s pretty straightforward, but if you encounter any issues, feel free to reach out!

Current Status
The project is functional but still has some bugs to work out. I know there’s a lot of room for improvement, and ongoing work is focused on stabilizing the system.

Feedback
As this project is part of my bachelor’s thesis, any feedback or suggestions you provide could contribute to both the development of the project and the content of my thesis. I would be grateful for your insights, as they may help shape the project. Feel free to share your thoughts, and if you see any areas for improvement, don’t hesitate to let me know!

Additionally, there is a section in the README that outlines the known limitations of the project, so take a look if you’re interested.

Contributions are welcome.
Thank you for checking out my project!

31 Likes

Does it support deduplication like attic?

1 Like

Cool work! Hope the thesis went well. :slight_smile:

1 Like

It doesn’t support global deduplication like attic does.

Each cache tenant can have multiple storages, and public tenants (or also a whole nodes) are mutually trusting. Some duplication can occur on tenant level but not on storage level.

Thanks for the comment — improving deduplication is definitely something the project could benefit from.

7 Likes

Great to see something with cachix-deploy capability! I’d love to see caches take a more central role in building and deploying NixOS machines in the future, and having a FOSS implementation of cachix deploy seems like a good step in that direction.

The distributed part of this seems interesting as well, but I’m not sure I understand the benefits. It can have a local filesystem store, an s3 store, or some mix of both? S3 is already highly available, so what are we gaining here? I’d be tempted to run multiple Attic instances on the same S3 and an HA postgres deployment instead. I guess the point here is that OpenDHT replaces postgres, so this is friendlier to a K8S type environment?

I’m also quite hesitant about mixing a permanent store with a volatile db. From my limited experience, it seems like having a distributed key value database that can live in persistent storage, such as TiKV, would be more appropriate. Since OpenDHT is in memory, what happens after complete failure of all nodes? Is there any way to recover the lost information? If the hash table’s destroyed in such a way, can we at least perform garbage collection to cleanup the orphaned paths, or will it forever be stuck in S3 without manual intervention? And on the subject of garbage collection, are there plans to add garbage collection strategies such as lru?

It’s always nice to see more cache solutions around Nix! Especially ones like this that bring in some novel ideas!

1 Like

Yes, you can freely combine different storage options as needed. It’s possible to use multiple local or S3 backends per tenant, or even share the same storage backend across multiple tenants. This flexibility is especially useful for local storage setups, where you can configure multiple storage locations or devices and define how data is distributed—for example, by filling them sequentially. You can also configure an S3 backend as a fallback, which will only be used once all local options are full. Naturally, if you’re only using S3, this setup won’t offer much additional benefit.

I guess the point here is that OpenDHT replaces postgres, so this is friendlier to a K8S type environment?

I’m not very familiar with Kubernetes, so if this design happens to be a better fit for them, that’s more of a happy accident than a deliberate choice.

It’s true that this approach mixes permanent storage with the more volatile nature of OpenDHT. However, each cache server also includes a SQLite database, while OpenDHT is only used for distribution and lookup across nodes. Even if all nodes go offline, the system can automatically reload all data from the local database. If I understand correctly, this is quite similar to what you proposed with TiKV. Moreover, data in the DHT network persists while nodes are running, and after a restart, they can be reinserted. Replication in OpenDHT also helps keep the data in the network for a while, even if some nodes temporarily go down.

Garbage collection runs periodically (every 12 hours by default) and removes orphaned paths as well as any data that has exceeded its retention period (which is defined for all caches in the configuration). At the moment, there are no additional garbage collection strategies planned, but I agree it would be a good idea to make this more configurable in the future.

I hope this clarifies things a bit.
Thanks for the feedback, it’s much appreciated!

4 Likes

Yes this makes a lot more sense now! The sqlite db was the main piece I was missing.

Regarding K8S, distributed applications tend to work pretty well with K8S out of the box, usually you’d need some glue to connect together peers. Although with DHT’s bootstrap mechanism you may not even need that, it looks like you could just loadbalance the bootstrap endpoint and connect to that to get things up and running. That being said there’s a few steps that may need to be taken to make kubernetes integration “magical”:

  • Add liveness, readiness, and startup endpoints
    • liveness indicates the service is running
    • readiness indicates the service is willing to receive traffic
    • startup blocks liveness and readiness until after it returns truthy
    • truthy is usually 2xx, falsey is any of the error codes. Usually I like 201 and 503.
    • so in this case, we’d want to make sure readiness isn’t true until after dht bootstrap, so pods can’t bootstrap to themselves into a split brain scenario.
  • there would need to be some sort of contingency for what to do if no other services are ready for bootstrapping (I.e. a service is the first node), I’m not sure how to do this.

This is analogous to having multiple instances of the cache server all behind a proxy. You want to be able to flip them all on and point the bootstrap to the load balancer and have it “just work.” There’s two problems:

  1. Split brain from bootstraps creating partitions
    • can be solved by only marking the server as ready (via health check endpoint) after bootstrapping
  2. The first server will be configured to bootstrap just like all the others, but there will be no reachable servers to bootstrap from
    • maybe add a retry count before it gracefully gives up?

I’m just learning K8s so maybe there’s a better way? I’m also not sure how likely it will be that anyone in the Nix Community uses this with k8s. But I do think it could be useful for the analagous load balancer scenario, so long as that first server knows what to do.

1 Like

Happy to see that it supports garbage collections, which is often an afterthought in other implementation: cache-server/cache_server_app/src/cache/base.py at 28e431a27f6df27f2fc7f5a7ea839134de70d5de · mifka01/cache-server · GitHub

Quickly skimming over the code, does this actually check for references i.e. so it doesn’t delete nars still referenced by other live nars? Nix doesn’t really fail nicely if those are dangeling.

8 Likes

You’re right, it doesn’t check for references. Thanks for pointing it out, I’ll look into fixing that.

3 Likes