The NixOS Foundation's Call to Action: S3 Costs Require Community Support

It hasn’t really been said explicitly, and this is as good a prompt as any:

  • The problems under discussion have arisen because there has not been any pressure on the growth of storage until now. Huge thanks, again, to the existing sponsors, but it was always going to reach a point where unsustainable growth hit a cliff.
  • The corollary of this is that there likely is a lot of low-hanging fruit in terms of storage reduction, via dedup / smarter gc / hierarchical migration and others discussed above.
  • The efforts and strategy the project puts in place now, both in the short and medium term, will be crucial in any efforts in attracting new sponsors, to demonstrate the ability to manage bounds around the problem. Among other things, it’s no fun for a sponsor to be in the awkward position of being a critical SPoF when they have to withdraw, either. I’m sure a number of potential sponsors will be more interested if the burden and the risk can be shared around.
10 Likes

Do we have some numbers for fixed output derivations outputs (FODs)? A mix of «old release channels» + «all FODs» + «everything since the last release» might be 95% of utility and permit slow 100% recovery (at a cost in compute, sure).

2 Likes

In fact, one of the main reasons the NixOS Foundation was created was to ensure continuity of the infrastructure by having the financial reserves to deal with sponsoring of the binary cache ending. The foundation currently has ~€230K in the bank, so we are prepared for this.

Moreover, that $32K is a worst case that only happens if we decide to move all of the binary cache out of S3 and we have to pay for it. There are much cheaper scenarios, e.g. we stay on S3 and we garbage-collect the binary cache. (As I described here, keeping all NixOS release closures ever made and deleting everything else shrinks the binary cache to about ~6 of its current size.)

16 Likes

ca-derivations have the potential to reduce the growth of new store entries but the feature is not receiving much love and mostly dead in the last year. It would also increase the effectiveness of alternative distribution methods like torrent.

6 Likes

A lot of bug fixing has been done on CA over the last few months.

7 Likes

I think staying in such a financially hostile environment is misguided, irrespective of whether we reduce our overall storage requirements or not. The egress quirk at AWS S3 is egregious at best.

In general, I think we should pay for our mission-critical infrastructure, and that means finding a sustainable partner.

The best idea to tackle the immediate problem, I find, is Backblaze B2 and their egress fee waiver, if you transfer more than 10 TB and stay for at least 12 months. Their overall storage cost is also much lower and egress costs to Fastly don’t exist, because both of them are part of the bandwidth alliance.

Also, I really wish we had two threads, one for solving the immediate problem, and one for discussing long-term solutions. This thread is really long and noisy, and lacks coherence because of that.

25 Likes

Suggestion:

Have cache.nixos.org backed by 2 infrastructures:

  1. “Unsponsored-storage” – There should be a backup storage for cache.nixos.org that stores all store paths in a cheap way, which does not rely on benevolent sponsors or potentially-temporary cost waivers. It need not serve cache.nixos.org’s daily load, but its hosting and egress should be cheap, such that it can be copied to whatever the current daily-load-storage is without large transfer costs.
    • Example: Ceph on Hetzner, at ~1.5 $/TB storage, 0.15 $/TB egress.
    • Also serves as a disaster-recovery backup in case the daily-load-storage disappears, e.g. because it ceases service.
    • All Hydra outputs would be copied onto it.
    • Ideally paid for by the NixOS foundation / covered by donations?
  2. “Daily-load-storage” – Serve’s cache.nixos.org’s daily load, using whatever good partnerships and sponsorships we can get.

By having our own fallback, we could confidently take special offers or partnerships, without creating single points of failure that are difficult to migrate off of.


Is “unsponsored-storage” feasible?

As a data point:

  • My company runs Ceph-on-NixOS, as two ~500 TB clusters, with 3 Hetzner SX 134 machines each.
  • Each machine costs ~250 $/month excl. VAT, making the the total cost of a 500 TB raw cluster around 750 $/month.
  • For durability and High-Availability:
    We use 3x replication. Accordingly, this reduces usable storage from raw to 33%. But Ceph also supports Erasure Coding, e.g. K+M=6, M=2 supports 66% storage efficiency. You can check it in e.g. MinIO's EC calculator, inputing values 3, 1, 10, 16, 6, 2 into the text boxes for the mentioned Hetzner setup.
  • The Ceph cluster is low maintenance. It requires work mainly for NixOS / Ceph version upgrades, so every 6 months.
    • When a disk fails (which eventually happens with many disks), we email Hetzner and it gets replaced in 15 minutes on average, for no additional cost.

I believe this is something that an infrastructure team could do, even on a volunteer basis; certainly on a paid-less-than-$9k/month basis.

12 Likes

Another concrete suggestion on how to reduce the $32k cost:

Do deduplication using a deduplicating backup software such as bup or bupstash, before the egress.

I’ve investigated this a bit in Investigate deduplication to reduce storage and transfer · Issue #89380 · NixOS/nixpkgs · GitHub

In the latest post from today I found that a dedup factor of 3.5x (thus egress cost reduction factor) seems immediately achievable.

(I know of nix-specific dedup solutions such as https://tvix.dev from @flokli linked above, but I haven’t had time to compare that yet, and so far only looked into general-purpose software that’s immediatly available and that I’m already familiar with.)

1 Like

An alternative to this idea is to go for tape-based storage.

LTO-8 autoloader, let’s say PowerVault TL1000 (~5K EUR for a new one) gives you 9*30TB = 270TB of “active tapes” with 15-30 years old lifetime if stored/maintained properly.

A 30TB RW LTO-8 tape cost ~80-100EUR. Storing all the current cache twice would cost 3K EUR + 5K EUR for the autoloader + electricity costs or colocation costs.

Also, this is very relevant for Hydra store paths writes because ultimately they are really sequential, aren’t they?

It is also a trivially extensible solution because we can just pile up more tapes as we move forward in the future.

(though someone needs to change the tapes if we overgrow the autoloader capacity or more autoloaders are needed.)

3 Likes

I think we should recognise that we have multiple different kinds of data with very different requirements. A solution that works for one might not work well for others.
Moving forward, perhaps we should discuss solutions for each individually, rather than finding an all-encompassing solution similar to what we currently have.

This is how I’d differentiate:

Kind Size Latency Requirements Throughput Requirements
narinfo Small High Small
nar Large Medium High
source Medium Low Medium

Another aspect I’d like to see explored is nar size vs. latency. Some closures contain lots of rather small paths while others have fewer but larger paths. It might be beneficial to also handle these separately.

3 Likes

That’s impressive deduplication and I’m sure you could even improve it even further by stripping store paths (obviously keeping a tiny record to put them back) but I’m yet to come across a deduplicating archiver that’s fast. I highly doubt something like bup would be fast enough to serve data in real-time which is a requirement for any binary cache.

3 Likes

I think we can postpone the migration process, and instead have all historical data archived in S3 Glacier Deep Archive for now, it would cost us less than 500USD/mo for 500TB of data. Meanwhile we can have hydra pushing new paths to R2 or other cheaper alternatives and call it a day. This would indeed cause a loss of access to historical data for a brief period, but given the timeline, still reasonable.

Edit: the retrieval free for glacier would still be, massive, but that’s how aws works ¯_(ツ)_/¯

4 Likes

And speaking of historical data and its research potentials, we can contact research facilities that have the motivation and ability to store and serve them for us, namely CERN, they did use nix for a while with LHCb[1]. And they are by all means experts in handling HUGE amounts of data for a extended time span.

[1] lhcb-nix · GitLab

2 Likes

I want to suggest that we leave the possiblity of garbage colleccting the binary cache out of the discussion—at least as far as the short term solution is concerned. Garbage collecting the binary cache is a big problem in two ways:

  1. It is a question of policy: Do we want to garbage collect the cache? What do we want to keep? What can be deleted? This should be a community decision which can’t be achieved in a month’s time.
  2. It is an engineering problem, i.e. how do we determine what store paths are to be deleted according to our established policy.

Additionally, garbage collecting the binary cache is risky, as we may wind up intentionally or erroneously deleting data we may miss—be it in 2 months or 10 years.

For the policy discussion one problem is that there is not a lot information about the cache available (although this has recently improved). Also it’d invariably involve looking to the future, i.e. how will our cache grow, how will storage costs develop etc.

For the engineering side, the big problem is that the store paths correspond to build recipes dynamically generated by a turing complete language, making it all but trivial to determine all store paths in the cache stemming from a specific nixpkgs revision. Assuming all store paths in the binary cache have been created by hydra.nixos.org (is this true for all time?), we have the /eval/<id>/store-paths endpoint available (from which store-paths.xz is generated) as a starting point. That will of course never contain build time only artifacts or intermediate derivations that never show up as a job in Hydra—among those, though, are the most valuable store paths in the binary cache, i.e. patches, sources etc. Even though we have tools to (reliably?) track those down, it becomes more difficult to do so for every historical nixpkgs revision. Additionally there is the question of the completeness of the data Hydra has (what happens to evals associated with deleted jobsets?). (If we were to garbage collect the binary cache, I think we should probably try figuring out what is safe to delete rather than determining what to keep.)

Garbage collection is a long term solution if at all.


It seems to be that finding a way to deduplicate the (cold) storage or archiving little used data more cost effectively is a similar amount of engineering effort, but less risky and comparatively uncontroversial—while still offering a sizeable cost reduction.

11 Likes

See also the apparent difficulty of getting an offline version of nixpkgs for a given version, or whatever was in this thread exactly: Using NixOS in an isolated environment .

2 Likes

It is far from complete, or good, but I made GitHub - nix-how/marsnix: Taking Nix Offline prior to NixCon 2022 to work on Nixpkgs offline, and it worked great. It fetches all of the inputs (FODs) so that it’s possible to recompute the outputs (input-addressed derivations) entirely offline.

4 Likes

The bandwidth is a smaller part of the cost thanks to the CDN, but there’s engineering work in the Nix client that can be done to help reduce it.

Improving Nix’s ability to handle multiple binary caches would make it easier for users to use a local/alternative cache. Right now companies need to setup their own caching server using external tools, and explicitly workaround issues with Nix to make it usable. For example: Setting up a private Nix cache for fun and profit

I started some of that work in Allow missing binary caches by default by arcuru · Pull Request #7188 · NixOS/nix · GitHub, maybe I’ll go pick it back up.

3 Likes

There is also cachecache made by @cleverca22, which acts as a transparent http caching proxy. Each time a request is made to cache.nixos.org that output will be cached on the LAN on the server running cachecache. Now imagine if everyone ran a cachecache on 127.0.0.1, that’d be a cachecachecache and it’d be almost peer to peer. Now imagine if people recursively looked up eachother’s cachecache’s like dnsmasq recursive resolving.

1 Like

This only help the distribution part, which is already solved by the free CDN from fastly.

You still need to store all the stuff upstream.

4 Likes

I’m yet to come across a deduplicating archiver that’s fast

From my link, bupstash deduplicates at 500 MB/s. It’s multi-threaded Rust, vs bup being Python.

This means 500 TB would take 11 days.

I highly doubt something like bup would be fast enough to serve data in real-time which is a requirement for any binary cache.

Note this is not what I’m suggesting in this thread. While I’m interested in using deduplication for binary cache storing in general at a later time, here I am only suggesting to use it to reduce the amount of one-time data egress from S3.

4 Likes