The NixOS Foundation's Call to Action: S3 Costs Require Community Support

An alternative to this idea is to go for tape-based storage.

LTO-8 autoloader, let’s say PowerVault TL1000 (~5K EUR for a new one) gives you 9*30TB = 270TB of “active tapes” with 15-30 years old lifetime if stored/maintained properly.

A 30TB RW LTO-8 tape cost ~80-100EUR. Storing all the current cache twice would cost 3K EUR + 5K EUR for the autoloader + electricity costs or colocation costs.

Also, this is very relevant for Hydra store paths writes because ultimately they are really sequential, aren’t they?

It is also a trivially extensible solution because we can just pile up more tapes as we move forward in the future.

(though someone needs to change the tapes if we overgrow the autoloader capacity or more autoloaders are needed.)

2 Likes

I think we should recognise that we have multiple different kinds of data with very different requirements. A solution that works for one might not work well for others.
Moving forward, perhaps we should discuss solutions for each individually, rather than finding an all-encompassing solution similar to what we currently have.

This is how I’d differentiate:

Kind Size Latency Requirements Throughput Requirements
narinfo Small High Small
nar Large Medium High
source Medium Low Medium

Another aspect I’d like to see explored is nar size vs. latency. Some closures contain lots of rather small paths while others have fewer but larger paths. It might be beneficial to also handle these separately.

4 Likes

That’s impressive deduplication and I’m sure you could even improve it even further by stripping store paths (obviously keeping a tiny record to put them back) but I’m yet to come across a deduplicating archiver that’s fast. I highly doubt something like bup would be fast enough to serve data in real-time which is a requirement for any binary cache.

3 Likes

I think we can postpone the migration process, and instead have all historical data archived in S3 Glacier Deep Archive for now, it would cost us less than 500USD/mo for 500TB of data. Meanwhile we can have hydra pushing new paths to R2 or other cheaper alternatives and call it a day. This would indeed cause a loss of access to historical data for a brief period, but given the timeline, still reasonable.

Edit: the retrieval free for glacier would still be, massive, but that’s how aws works ¯_(ツ)_/¯

4 Likes

And speaking of historical data and its research potentials, we can contact research facilities that have the motivation and ability to store and serve them for us, namely CERN, they did use nix for a while with LHCb[1]. And they are by all means experts in handling HUGE amounts of data for a extended time span.

[1] lhcb-nix · GitLab

2 Likes

I want to suggest that we leave the possiblity of garbage colleccting the binary cache out of the discussion—at least as far as the short term solution is concerned. Garbage collecting the binary cache is a big problem in two ways:

  1. It is a question of policy: Do we want to garbage collect the cache? What do we want to keep? What can be deleted? This should be a community decision which can’t be achieved in a month’s time.
  2. It is an engineering problem, i.e. how do we determine what store paths are to be deleted according to our established policy.

Additionally, garbage collecting the binary cache is risky, as we may wind up intentionally or erroneously deleting data we may miss—be it in 2 months or 10 years.

For the policy discussion one problem is that there is not a lot information about the cache available (although this has recently improved). Also it’d invariably involve looking to the future, i.e. how will our cache grow, how will storage costs develop etc.

For the engineering side, the big problem is that the store paths correspond to build recipes dynamically generated by a turing complete language, making it all but trivial to determine all store paths in the cache stemming from a specific nixpkgs revision. Assuming all store paths in the binary cache have been created by hydra.nixos.org (is this true for all time?), we have the /eval/<id>/store-paths endpoint available (from which store-paths.xz is generated) as a starting point. That will of course never contain build time only artifacts or intermediate derivations that never show up as a job in Hydra—among those, though, are the most valuable store paths in the binary cache, i.e. patches, sources etc. Even though we have tools to (reliably?) track those down, it becomes more difficult to do so for every historical nixpkgs revision. Additionally there is the question of the completeness of the data Hydra has (what happens to evals associated with deleted jobsets?). (If we were to garbage collect the binary cache, I think we should probably try figuring out what is safe to delete rather than determining what to keep.)

Garbage collection is a long term solution if at all.


It seems to be that finding a way to deduplicate the (cold) storage or archiving little used data more cost effectively is a similar amount of engineering effort, but less risky and comparatively uncontroversial—while still offering a sizeable cost reduction.

12 Likes

See also the apparent difficulty of getting an offline version of nixpkgs for a given version, or whatever was in this thread exactly: Using NixOS in an isolated environment .

2 Likes

It is far from complete, or good, but I made GitHub - nix-how/marsnix: Taking Nix Offline prior to NixCon 2022 to work on Nixpkgs offline, and it worked great. It fetches all of the inputs (FODs) so that it’s possible to recompute the outputs (input-addressed derivations) entirely offline.

4 Likes

The bandwidth is a smaller part of the cost thanks to the CDN, but there’s engineering work in the Nix client that can be done to help reduce it.

Improving Nix’s ability to handle multiple binary caches would make it easier for users to use a local/alternative cache. Right now companies need to setup their own caching server using external tools, and explicitly workaround issues with Nix to make it usable. For example: Setting up a private Nix cache for fun and profit

I started some of that work in https://github.com/NixOS/nix/pull/7188, maybe I’ll go pick it back up.

3 Likes

There is also cachecache made by @cleverca22, which acts as a transparent http caching proxy. Each time a request is made to cache.nixos.org that output will be cached on the LAN on the server running cachecache. Now imagine if everyone ran a cachecache on 127.0.0.1, that’d be a cachecachecache and it’d be almost peer to peer. Now imagine if people recursively looked up eachother’s cachecache’s like dnsmasq recursive resolving.

1 Like

This only help the distribution part, which is already solved by the free CDN from fastly.

You still need to store all the stuff upstream.

4 Likes

I’m yet to come across a deduplicating archiver that’s fast

From my link, bupstash deduplicates at 500 MB/s. It’s multi-threaded Rust, vs bup being Python.

This means 500 TB would take 11 days.

I highly doubt something like bup would be fast enough to serve data in real-time which is a requirement for any binary cache.

Note this is not what I’m suggesting in this thread. While I’m interested in using deduplication for binary cache storing in general at a later time, here I am only suggesting to use it to reduce the amount of one-time data egress from S3.

4 Likes

I think we should also keep in mind that

  1. S3 charges per the number of requests, at $0.4/1M requests in Standard and $1/1M for Infrequent Access and
  2. S3 request IOPS can be slower than with a regular disk.

If deduping makes multiple requests to the same S3 objects, that could quickly add up costs. Currently there’re 667M objects, meaning just reading all objects once would likely cost in the proximity of $500-$1000.

So both to minimize the cost and time, it would probably be beneficial to do it on a large instance with tons of EBS storage, which would “cache the cache” for all subsequent computation.

4 Likes

Yes.

This should not be necessary; modern dedup backup tools can do the dedup directly against a remote server.

2 Likes

I think we agree on that, my point is more about the time and cost saved though, especially if the dedup tools (like bupstash or even restic) can access the same object multiple times.

2 Likes

NAR reconstruction can be simple to do (streaming the concatenation of each chunk one by one) and the resulting NAR can be cached as well so subsequent accesses can be as fast as the current setup.

I’m currently doing this on my personal Attic setup, with a Cloudflare Worker controlling the caching of the reconstructed NARs. The NAR reconstruction itself can even be done on those edge FaaS services with CDN integration like Cloudflare Workers or Fastly Compute@Edge, so there are fewer potential bottlenecks.

5 Likes

While we keep the general discussion here and on the matrix channel, we are starting to collect all ideas/options/etc. in a more structured format for review in the github issue below. Splitting them between the near term options and the longer term options.

[Short Term Strategy and Priorities] Migration of S3 Bucket Payments to Foundation · Issue #82 · NixOS/foundation (github.com)

[Long Term Strategy and Priorities] Migration of S3 Bucket Payments to Foundation · Issue #86 · NixOS/foundation (github.com)

10 Likes

I haven’t had the time to read every comment, but it seems clear that there will need to be some trimming of the working set. IMHO the easiest to delete would be historical install/live ISOs. They’re big, trivial to rebuild from the cache, and almost never substituted from cache.

3 Likes

Backblaze is excellent. I’ve used B2 at work for storage of over 100 TB of data, but I’ve not tested actually serving those files to the public from the buckets.

I forgot that they have a specific offer for this kind of transfer:

Backblaze covers data transfer and any legacy vendor egress fees on >50TB committed contracts and all sizes of B2 Reserve bundles.

See also How to Migrate From Block to Object Storage: Solving SaaS What-Ifs.

5 Likes

Reading through the options, a lot of the suggestions are wildly unproven. The main considerations with a volunteer organization like this should be that when you move, you move for a good period of time and the place you move to is as easy to operate as possible (within your budget).

With the financial reserves, the short term solution (3-6 months) should probably be to stay on S3 with some tweaks and eat the 9k/month. If that 9k can be significantly reduced with some of the measures discussed here, that becomes even more attractive.

For longer term I would suggest doing a lot of due diligence before actually moving anywhere.

7 Likes