NixOS S3 Long Term Resolution - Phase 1

ron · December 5, 2023, 5:07pm

Quick Recap:

As we need to expedite our timelines in optimizing our S3 Cache, the NixOS Foundation is supporting a “Phase 1” effort via initial funding of 10,000 EUR.

The total estimated project funds are 30,000 EUR and we are also announcing the Open Collective Project for those that wish to donate and support the effort!

We hope that this effort also helps us learn how to go about allocating funds to the community and if you’re interested in partaking in the wider community discussion about that please do reach out and/or visit NixCon Governance Workshop - Announcements - NixOS Discourse.

Initial Background & Prior Recaps

After we reached a short term resolution for our S3 binary cache situation (NixOS S3 Short Term Resolution! - Announcements - NixOS Discourse), a number of awesome folks across the community stepped up to begin researching a long term solution.

The community members involved have been sharing updates on progress and discussion. (add links to github repo as well)

About a month ago, an ad-hoc working group was formed following the creation of the issue Garbage-collect cache.nixos.org · Issue #282 · NixOS/infra · GitHub. The first meeting happened on the 24th October (notes and further notes can be found in: 2023-10-24 re: Long-term S3 cache solutions meeting minutes #1).

A team composed of @zimbatm, @edolstra, @RaitoBezarius , @flokli and @edef quickly formed to answer multiple questions, notably the cache’s bucket growth per year and our need to reduce our cache footprint in AWS.

From this team’s activities, two solutions emerged:

Garbage collection of historical data
Deduplication of historical data

The team found that the cache’s bucket growth per year was increasing, implying that a brutal garbage collection would be a short term solution, one that causes us to lose historical data and buys us an unknown amount of time. Our estimates suggest it would only buy us 6 months to 2 years, depending on how Nixpkgs needs evolve, with estimates suggesting roughly 1 year.

After this, there was an initial agreement to prioritize the deduplication solution and only perform the garbage collection as a last-resort measure, if needed at all.

During the following month, @flokli and @edef, in charge of the deduplication solution, worked in their free time to build a bespoke set of tools to analyze cache.nixos.org. For example, they created. a fast .narinfo parser that imports all of the data into Clickhouse to perform various analytics and guide the solution. Many more examples of the team’s tooling can be found in the detailed notes.

The team looked at:

Fastly logs
S3 bucket logs
A SQLite database that @edolstra provided which contains a mapping of channel bumps to store paths (and their size)

Data analysis is still ongoing and is focusing on questions like “what would be the request rate to cold paths that would be deduplicated?” that might shed light on the scalability of the reassembly component, i.e. the piece of software responsible for reassembling the deduplicated pieces into a full NAR that gets cached by the Fastly CDN.

In the meantime, we discussed the potential gains might see with deduplication, and it is hard to answer. In the past, projects like nix-casync provided deduplication down to 20% of the original uncompressed size.

In the past weeks, @flokli and @edef have started tuning the fast content-defined chunker parameters to determine the optimal parameters with respect to the cache data.

To do so, they took multiple channel bumps, ingested them on a server in the same AWS region as our bucket, read through all NARs of a given channel bump, uncompressed and decomposed the NARs, fed all blobs into a content-defined chunker, and recorded the deduplication metadata chunk length (compressed+uncompressed) and digest. The actual chunk data itself was not stored, as this process was mostly there to find good parameters, balancing chunk size and compression possibilities (i.e. when chunks are smaller, the compression context window is smaller and compression performs worse, but we get a better chance to deduplicate).

During that process, they uncovered that xz decompression is a big bottleneck while ingesting the existing NARs. After ingesting 3 channel bumps separated each by the months, the recorded total size of the data was 71% of the original compressed size (including metadata). Note that these numbers are just small samples of the entire dataset, using one picked chunk size and store paths further apart in time than we usually have in channel bumps, constructing a bad-case scenario. Adding another two channel bumps, each two weeks apart brought the recorded size down to 65% of the original compressed size. One of the next steps are to further explore the parameter space to figure out which ones make sense for cache.nixos.org as a whole.

Next Steps & Initial Plan

As we approach the halfway mark of our 12-month AWS funding (9,000 USD/month), the urgency for a sustainable solution is higher, especially given the significant growth in our cache expenses (November charge 13,728 USD split between S3 storage - 8,696.88 and data transfer - 4,776.67).

We’re excited to announce a major step forward: an initiative led by @flokli and @edef, targeting an accelerated milestone for the long term resolution plan:

Cache analytics: support technical decision-making on where to deploy things with respect to our needs for performance (latency, parallel requests, etc.)
Deduplication analytics: explore the parameter space of the chunker and data structures for metadata.
NAR reassembly: extend “nar-bridge” to support operating where NAR reassembly would happen (AWS Lambda or Fastly Compute@Edge), and to support the used storage model
Enablement: extend Fastly 404 handler to reroute historical data to this new S3 store and delete the old data from the main S3 binary cache

As timing is crucial on this project, the NixOS Foundation will fund the first 10,000 EUR to enable us to get going. In parallel, we plan to open up an Open Collective project to raise the remaining ca. 20,000 EUR for those who want to take part in helping make progress on our S3 cache needs.

If you want to support the project please visit the Open Collective project page or reach out to us at the foundation (foundation@nixos.org).

Background on Funding/AWS and Additional Context

As cache.nixos.org serves as a critical resource to the community, it is essential for the Foundation to empower active contributors to expedite work on critical areas via funding. In this case, deduplication efforts can significantly enhance the efficiency of cache.nixos.org. This, in turn, can lead to additional benefits such as supporting contributors relying more on the cache, e.g. debug symbols. This task is difficult and requires prior expertise with how Nix has been storing things in the past and knowledge of state-of-the-art solutions that are better at storing things without compromising on performance. Furthermore, all of this has to be done in a tight timeline without compromising the integrity of the data manipulated.

In this instance, the Foundation will be investing into the deduplication group to provide a sustainable solution to the cache’s size. We considered multiple alternatives, such as:

Performing garbage collection
Removing more aggressively staging data
Removing the copies of NixOS ISOs published on releases.nixos.org
Using off-the-shelf community software such as GitHub - zhaofengli/attic: Multi-tenant Nix Binary Cache

As mentioned above, some of those solutions may only buy us a short amount of time and will require another intervention. Other solutions may significantly degrade the contributing experience in Nixpkgs, since active contributors rely on the presence of the staging data to perform large-scale bisections and root cause analysis of the whole ecosystem ( https://git.qyliss.net/hydrasect/tree/README is an example of such a tool). Finally, in regard to existing tooling, we uncovered that such tools would not necessarily have an optimal out-of-the-box performance, and we have not dug into them to gain familiarity with all the failure modes they could exhibit.

Why does this matter? A wider recap can be found in the NixOS Discourse (The NixOS Foundation’s Call to Action: S3 Costs Require Community Support - Announcements - NixOS Discourse)

Cost Efficiency: Reducing our cache size directly impacts our ongoing expenses, making our operations more economical.
Future-Proofing: As Nix keeps growing, we need to deploy long term strategies that can keep matters as sustainable as possible. This will also help in demonstrating our commitment to sustainability which strengthens our case for continued AWS support or other potential partnerships.
Community Involvement: We’re inviting more hands and minds to join in. Your participation, be it through volunteering, funding, or sharing ideas, is critical.

We want to thank everyone again for rallying around the S3 and Cache needs. Please reach out, participate, and share feedback whenever you can. https://matrix.to/#/#binary-cache-selfhosting:nixos.org

We’d like to extend a heartfelt thanks to the Open Source teams at AWS for their support and funding. Their assistance has provided us with the essential time and space needed to thoroughly explore and address these challenges at a manageable pace.

This announcement was written by @edolstra @ron and @zimbatm with the help of @RaitoBezarius and @fricklerhandwerk.

rjpc · December 5, 2023, 8:52pm

Donated what I could at the moment. I encourage all users/hobbyists/supporters to donate what they can. Thanks so much for this hard work and thinking this stuff out.

adamcstephens · December 6, 2023, 1:58am

Will the Foundation also be using the OpenCollective or is this just for outside donations? They provide some nice transparency mechanisms, which I think is helpful for the community.

ron · December 6, 2023, 4:50am

We’ve looked at the general topic a few times in the past and I’m super open to finding a way that works to make things as transparent as possible. With OpenCollective the main issue of transferring from the foundation to the collective and back is that we would probably loose a minimum of 5% ( Fees - Open Collective Foundation). Could be worth it but probably something we should consider.

enobayram · December 6, 2023, 7:10pm

I’m not sure if this is the best place to mention this, but I’ve been thinking for a while that building and caching derivations for third parties could be a very good revenue stream for NixOS. This wouldn’t just be a generic CI service, since NixOS infrastructure would be running these builds, it could trust the integrity of the output, so it could sign it and upload it to the nixos cache, this would make it very attractive for companies, because then their binaries would be available to Nix users without any configuration change.

Majiir · December 6, 2023, 8:15pm

There were discussions and even some development work in the past to enable distributing the cache across volunteer/donated storage, using technologies like IPFS, Bittorrent, or something else. In my opinion, that type of solution gives the NixOS ecosystem a better chance of surviving a rapid increase in users, and helps ensure a long future for NixOS even if it instead wanes in popularity.

Is there anybody organizing around those kinds of solutions? It sounds like this could be a parallel effort to the CDN optimizations, which have a better shot at helping in the ~1-year horizon. I know from past threads that there are people who would like to contribute to a distributed cache solution, but many would-be contributors may not have enough knowledge about the current obstacles or could use guidance on where they can focus their efforts.

I don’t want to pollute the S3 thread with a different cache strategy altogether, just looking for the right place to discuss it. Thanks!

delroth · December 8, 2023, 6:37pm

This is a multi-year research project with unknown chances of success. To the best of my knowledge, there is ~nobody that’s ever productionized this kind of volunteer-distributed storage. IPFS keeps being brought up, but that project has always overpromised and underdelivered. They’ve engaged with several communities that have this kind of storage need (e.g. Archive Team) and nothing ever came out of it.

There are solutions to get out of AWS that aren’t major research projects with uncertain outcome. It wouldn’t be very hard technically to just run a small storage cluster, either on-prem or at some provider like Hetzner. It comes with reliability tradeoffs (though compared to single-region AWS, maybe not that much, especially in us-east-1…). It would likely be competitive in costs even with our current heavily subsidized AWS bill. But the biggest issue is the lack of leadership / decision making capacity to actually commit to such a project. As long as nobody is empowered to make that kind of decisions for NixOS, I don’t think we’ll ever get anything else than incremental improvements on the status quo.

(I also believe that at least for the next 1-3 years we can get away with “just” those incremental improvements, though.)

nixinator · December 8, 2023, 7:08pm

Yeah, I have a solution…

I made a prototype that distributes NAR’s over a p2p bit torrent like protocol, but it needs more work to become production ready. I believe changes to the nix store layer may help.

The proposal was put into NGI, however at the time it was not successful.

So here we are, I am firm believer the cache can securely distributed by it’s users for it users who want to donate bandwidth to Nix/OS, reducing the reliance on centralised CDN’s,

I’m a believer in IPFS for ‘certain’ types of data… but the size of the DHT will never scale for locating and publishing content, unless the project can somehow break the laws of physic’s and computer science all in one fell swoop.

Unfortunately, I’ve heard NGI are no longer directly funding Nix projects, so here we are ,a bit dead in the water with the idea.

There are some other internet architecture things that need to change for p2p to work, one of those is IPV6 and computers being able to connect end to end without NAT.

By the looks of it, this will never happen… … If the funding had come in all those years ago…, we would be in a much stronger position to do it, that was 3 years ago.

There are some changes to the store layer that need to happen, the ability to keep the original nar, while ‘using’ the nar, would reduce the storage requirements on the peers by a large magnitude…

interesting stuff!

APCodes · December 8, 2023, 7:18pm

I am firm believer the cache can securely distributed by it’s users for it users who want to donate bandwidth to Nix/OS, reducing the reliance on centralised CDN’s,

If this were to work, it would surely be awesome!

In the meantime however:

Seeing this thread and the underlying problem now for the first time, I might actually consider donating some money for this! I am not that wealthy so it won’t be much. But still, I am using this service regularly and it is very reliable. I believe many users might not even be very much aware of this sadly.

Edit: I also have to say given the sheer size of the opening post and my tendency to just glance at it, I was lucky to even see that I actually could donate any money for this. Maybe it would be a good idea to make people somehow more aware of this?

Echo51 · December 11, 2023, 9:28pm

I am unsure if it has been mentioned before, I could not find it with a skim of the discourse and matrix, but wouldn’t Sippy (beta) · Cloudflare R2 docs solve the bandwidth cost issue of migrating off of S3 to R2 if that is still a migration path you are considering?

Latency would be a bit higher for the first few reads, but the popular packages should become quickly cached.

nixinator · December 13, 2023, 5:52pm

This may just robbing peter to pay paul. Your still reliant and at the whim and control of third party CDN’s.

The whole object of the exercise as i see is, is to remove third party distribution mechanisms a keep them to an absolute minimum.

I’d rather nixos foundation buy actual skin with the donations they receive…, not bandwidth or tin, rather give it to companies with a very dubious ethics.

bouk · December 14, 2023, 7:48am

Have you considered enabling intelligent tiering? I’ve used it myself for our internal nix cache to great success, easy to set up and it’ll lower storage costs for things that are accessed infrequently.

adam248 · December 15, 2023, 1:14am

I would appreciate knowing how the testing of this software went.
If you lacked the time to do a deep test why not enable a nix config option like nix.exerpimental.p2p-caching.enable = true; and let the community help test an official p2p solution? If that software doesn’t work, then you could swap the p2p-caching option for a different solution. (p2p-caching.package = attic;)

I believe that if you wish to explore a P2P option then you should tell the community to focus on testing a particular solution. We just need the Foundation to tell us which solution we should focus on to test first.

This way you can continue to focus on the other solutions and let the community work on the p2p solution.
But we do need @ron or someone with some authority to tell the community what p2p solution you would like us to focus on testing and tell us what and how we should report bugs and the like.

One of the main issues with the P2P solution right now is we lack focus. Many different ideas of how to do it. But for a P2P solution to have any chance of working, we need leadership to tell us what to focus on testing.

RaitoBezarius · December 18, 2023, 7:32pm

Hi there, I will speak for myself (i.e. not the Foundation) as a person who coordinated the efforts on the cache, I don’t think the Foundation can offer focus, this is really up to the P2P experts there to build a plan, build a solution and showcase it on interesting scales. This takes time, efforts, energy and sometimes even resources.

I think a lot of folks did say that P2P solutions are not really realistic for the time being at the nixpkgs scale for storage. For distribution, they could work but as you can see, this is a storage problem, not a distribution one.

Otherwise, I would recommend starting with Tahoe-LAFS and build a 100TB cache with that or more and see how that goes w.r.t. to all classical properties a cache may require.

adam248 · January 1, 2024, 2:42am

I understand your points and can see that you understand the problem very well.
However, my main point is that the only real way to test a P2P solution properly is to have an official test case.
If we have multiple different P2P solutions that are all testing in a fragmented way then even when a solution shows real promise people will still say that they are not sure if it can scale properly.

I propose that the foundation sit down just for an hour together, pick the best P2P option that they believe has a chance and add a nix service nix.experimental.p2p-cache.enable then we can test that solution at scale right now. Then if it breaks, then we will know exactly why and either say once and for all that it just can’t scale or find the main cause and solve it either by switching to a different P2P solution or patching the existing once.

Otherwise, I believe we will be forever stuck in small isolated testing environments and be forever waiting for “will it scale” question to be answered.

If P2P is going to be a win, then the Foundation willl have to pick a winner at some point any way.

Based on the amount of community interest in orginal The NixOS Foundation's Call to Action: S3 Costs Require Community Support - #171 by Federico post I believe the community really wants to be able to have a chance at helping more than just giving more money or finding the cheapest hosting solution. While interest is still high in this issue it is the best time to test a P2P solution. If we wait a few years then the interest in this issue could drop resulting in less desire to work on a community run solution.

Also, the nix community is full of really smart people who can handle working through an experimental caching test.

Nix is mostly used by devs who are very invested in this ecosystem and are more than willing to work on making it last for 100 years.

APCodes · January 1, 2024, 11:26am

I propose that the foundation sit down just for an hour together, pick the best P2P option that they believe has a chance and add a nix service nix.experimental.p2p-cache.enable then we can test that solution at scale right now.

I have to admit that does sound like a nice and pragmatic way forward. I have no idea about the underlying technology though. I’d love to try whatever possible solution might be found!

RaitoBezarius · January 1, 2024, 3:36pm

The Foundation does not do technical decision, so you can sit down for an hour together but I fear this may not lead to the outcome you are looking for.

All the data about the scale and what not is public and someone among the P2P group interest has to drive an effort to build a proof of concept which can answer a bunch of questions.

Adding a nix.experimental.p2p-cache.enable is something that you can already do today in nixpkgs or in anything out of tree. But someone has to build it.

For this, it is necessary that the P2P working group come forward with a working example given the public data, again.

Definitely, but for this, there’s a need for the P2P working group to work on an implementation and such a thing has not been happening so far. No matter how much interest is there for X, if no group can build it, we will have trouble to have X.

You do not need an official test case because you seem to be misunderstanding that no amount of official can build the test case in your stead. Build the technology, the code, send it to nixpkgs, convince a working group to join the experimentation, collect the data and publish it, would be what I would have done if I was interested in P2P (which I am not).

Solene · January 1, 2024, 4:08pm

What’s your plan to handle data persistence? Someone has to keep all this expensive storage available.

In my opinion, P2P would only help for content distribution, but there is already a deal with a CDN provider and it costs almost nothing for the NixOS project

misuzu · January 1, 2024, 6:33pm

It would be nice to have the ability to “mirror” the nixpkgs cache, similar to other linux distributions. For example, having a tool that copies all nars for a given nixpkgs commit or something like that

adam248 · January 2, 2024, 3:39am

@Solene Firstly, I have discussed my views on the benefits of a global P2P cache at length in the previous posts:

But to summarize:

First we need to separate caching from archiving as these two have different optimization requirements. Archives require large storage but are not so concerned with bandwidth. However, caches care more about bandwidth than storage as they are updated and pruned regularly. For example: a package that is only being used by less than 10 people should probably be not in the cache as those 10 people can build it on their own machines or host a local cache if it is needed for a large set of machines. In such a case there should be a Network Admin who is running the show anyway…

It is my understanding that the only thing that needs data persistence is the tarball archive for software where the source code is not freely available for compiling from source. All other data can be rebuilt if it is missing from the P2P network when or if needed.

Yes, I know that the data persistence problem has been solved for now via the deal with a CDN.
However, bandwidth cost are also an ongoing concern if the Nix community continues to grow. My concern is that it seems the current model has the cache system merged with the archive system as a few people have been asking me how P2P solves data persistence, and I feel like they are missing the point I am trying to make. P2P solves the cache (distribution) side of the problem not the archive (persistence) side. But if Nix is reproducible (I know there are some limits to this) then persistence is not the main problem in the first place, it is the cache (distribution) side that is the main ongoing problem of concern.

If P2P is successful, then we can have Hydra be the system that seeds the P2P network as the “trusted-node”, and if there are a large number of nodes hosting a certain piece of data then the CDN can be pruned without fear of a loss of cache performance (as the P2P network is supplying the bandwidth). This pruning can be done organically instead of by necessity (if we have this crisis arise again in the future requiring a serious prune all at once manually by a person which could lead to real headaches for many people.)

I know there are people who are not keen on P2P for various reasons, but one of the main reasons why I am so strongly for it is because I believe it will make the Nix ecosystem future proof and build redundancy into into the system as a whole. (I am thinking 100 years future proof here)

@RaitoBezarius I see that you said someone just needs to make it and add nix.experimental.p2p-cache.enable? But who approves that if not the Foundation? I seem to be missing something here…

If that is the case, then there are a couple of working packages already in nixpkgs from what I have seen. If they can be ported under a main Nix config option then we can all test together globally then work out the bugs as we find them. For P2P to work it requires everyone to be on the same network and same protocol to concentrate the collective bandwidth, othewise P2P won’t work at scale.

There is a chance (10 years in the future) that a successful P2P network can replace most of what is on the CDN, which will save a serious amount of money for someone, which can then be used to pay more Nix devs at the Foundation. Money spent on talent is better than it spent on infrastructure IMHO.

Thanks for the comments I appreciate them very much.