I think we should recognise that we have multiple different kinds of data with very different requirements. A solution that works for one might not work well for others.
Moving forward, perhaps we should discuss solutions for each individually, rather than finding an all-encompassing solution similar to what we currently have.
This is how I’d differentiate:
Another aspect I’d like to see explored is nar size vs. latency. Some closures contain lots of rather small paths while others have fewer but larger paths. It might be beneficial to also handle these separately.
That’s impressive deduplication and I’m sure you could even improve it even further by stripping store paths (obviously keeping a tiny record to put them back) but I’m yet to come across a deduplicating archiver that’s fast. I highly doubt something like bup would be fast enough to serve data in real-time which is a requirement for any binary cache.
I think we can postpone the migration process, and instead have all historical data archived in S3 Glacier Deep Archive for now, it would cost us less than 500USD/mo for 500TB of data. Meanwhile we can have hydra pushing new paths to R2 or other cheaper alternatives and call it a day. This would indeed cause a loss of access to historical data for a brief period, but given the timeline, still reasonable.
Edit: the retrieval free for glacier would still be, massive, but that’s how aws works ¯_(ツ)_/¯
And speaking of historical data and its research potentials, we can contact research facilities that have the motivation and ability to store and serve them for us, namely CERN, they did use nix for a while with LHCb. And they are by all means experts in handling HUGE amounts of data for a extended time span.
I want to suggest that we leave the possiblity of garbage colleccting the binary cache out of the discussion—at least as far as the short term solution is concerned. Garbage collecting the binary cache is a big problem in two ways:
It is a question of policy: Do we want to garbage collect the cache? What do we want to keep? What can be deleted? This should be a community decision which can’t be achieved in a month’s time.
It is an engineering problem, i.e. how do we determine what store paths are to be deleted according to our established policy.
Additionally, garbage collecting the binary cache is risky, as we may wind up intentionally or erroneously deleting data we may miss—be it in 2 months or 10 years.
For the policy discussion one problem is that there is not a lot information about the cache available (although this has recently improved). Also it’d invariably involve looking to the future, i.e. how will our cache grow, how will storage costs develop etc.
For the engineering side, the big problem is that the store paths correspond to build recipes dynamically generated by a turing complete language, making it all but trivial to determine all store paths in the cache stemming from a specific nixpkgs revision. Assuming all store paths in the binary cache have been created by hydra.nixos.org (is this true for all time?), we have the /eval/<id>/store-paths endpoint available (from which store-paths.xz is generated) as a starting point. That will of course never contain build time only artifacts or intermediate derivations that never show up as a job in Hydra—among those, though, are the most valuable store paths in the binary cache, i.e. patches, sources etc. Even though we have tools to (reliably?) track those down, it becomes more difficult to do so for every historical nixpkgs revision. Additionally there is the question of the completeness of the data Hydra has (what happens to evals associated with deleted jobsets?). (If we were to garbage collect the binary cache, I think we should probably try figuring out what is safe to delete rather than determining what to keep.)
Garbage collection is a long term solution if at all.
It seems to be that finding a way to deduplicate the (cold) storage or archiving little used data more cost effectively is a similar amount of engineering effort, but less risky and comparatively uncontroversial—while still offering a sizeable cost reduction.
It is far from complete, or good, but I made GitHub - nix-how/marsnix: Taking Nix Offline prior to NixCon 2022 to work on Nixpkgs offline, and it worked great. It fetches all of the inputs (FODs) so that it’s possible to recompute the outputs (input-addressed derivations) entirely offline.
The bandwidth is a smaller part of the cost thanks to the CDN, but there’s engineering work in the Nix client that can be done to help reduce it.
Improving Nix’s ability to handle multiple binary caches would make it easier for users to use a local/alternative cache. Right now companies need to setup their own caching server using external tools, and explicitly workaround issues with Nix to make it usable. For example: Setting up a private Nix cache for fun and profit
There is also cachecache made by @cleverca22, which acts as a transparent http caching proxy. Each time a request is made to cache.nixos.org that output will be cached on the LAN on the server running cachecache. Now imagine if everyone ran a cachecache on 127.0.0.1, that’d be a cachecachecache and it’d be almost peer to peer. Now imagine if people recursively looked up eachother’s cachecache’s like dnsmasq recursive resolving.
I’m yet to come across a deduplicating archiver that’s fast
From my link, bupstash deduplicates at 500 MB/s. It’s multi-threaded Rust, vs bup being Python.
This means 500 TB would take 11 days.
I highly doubt something like bup would be fast enough to serve data in real-time which is a requirement for any binary cache.
Note this is not what I’m suggesting in this thread. While I’m interested in using deduplication for binary cache storing in general at a later time, here I am only suggesting to use it to reduce the amount of one-time data egress from S3.
S3 charges per the number of requests, at $0.4/1M requests in Standard and $1/1M for Infrequent Access and
S3 request IOPS can be slower than with a regular disk.
If deduping makes multiple requests to the same S3 objects, that could quickly add up costs. Currently there’re 667M objects, meaning just reading all objects once would likely cost in the proximity of $500-$1000.
So both to minimize the cost and time, it would probably be beneficial to do it on a large instance with tons of EBS storage, which would “cache the cache” for all subsequent computation.
NAR reconstruction can be simple to do (streaming the concatenation of each chunk one by one) and the resulting NAR can be cached as well so subsequent accesses can be as fast as the current setup.
I’m currently doing this on my personal Attic setup, with a Cloudflare Worker controlling the caching of the reconstructed NARs. The NAR reconstruction itself can even be done on those edge FaaS services with CDN integration like Cloudflare Workers or Fastly Compute@Edge, so there are fewer potential bottlenecks.
While we keep the general discussion here and on the matrix channel, we are starting to collect all ideas/options/etc. in a more structured format for review in the github issue below. Splitting them between the near term options and the longer term options.
I haven’t had the time to read every comment, but it seems clear that there will need to be some trimming of the working set. IMHO the easiest to delete would be historical install/live ISOs. They’re big, trivial to rebuild from the cache, and almost never substituted from cache.
Reading through the options, a lot of the suggestions are wildly unproven. The main considerations with a volunteer organization like this should be that when you move, you move for a good period of time and the place you move to is as easy to operate as possible (within your budget).
With the financial reserves, the short term solution (3-6 months) should probably be to stay on S3 with some tweaks and eat the 9k/month. If that 9k can be significantly reduced with some of the measures discussed here, that becomes even more attractive.
For longer term I would suggest doing a lot of due diligence before actually moving anywhere.