The NixOS Foundation's Call to Action: S3 Costs Require Community Support

joepie91 · June 3, 2023, 7:07pm

The general storage model of Tahoe looks something like this (IIRC):

You specify an N-of-M encoding, which are the erasure coding parameters; the storage overhead is (M / N), and you can lose (M - N) shares without it being irrecoverable. Optionally you can specify a ‘happy’ parameter that specifies the minimum number of distinct node the shares are distributed across, but realistically we probably just want that set to M, so I will assume it to be equal to M.
Storage nodes are polled for having sufficient free space to store a share, and the nodes are selected in response latency order. A share is pushed to each of them.
Upon retrieval, many nodes are asked whether they have the shares in question, until N responses are obtained (ie. enough to satisfy the request). The nodes are again selected in order of response latency; additional nodes are ignored.
A verify/repair cycle effectively involves ‘walking the tree’ and identifying any shares for which insufficient nodes have shares available. The available shares (up to N) are retrieved, and the missing shares are replicated and pushed to enough nodes that the total available share count is back up to M.

The intention of having a storage cluster that is resilient against disappearing storage nodes is not to allow random people to connect their personal computer at home; the churn would indeed cause a cascading failure. Rather, it’s meant to compensate for both unexpected outages (something something yearly us-east-1 outage), as well as sponsors deliberately pulling out - which would still be an uncommon occurrence.

It means that in the case of uncommon-but-likely-to-happen-eventually failures, we don’t end up back in the situation that we are in now, where we have to scramble to find a new place to migrate things to, or risk a total outage. It also means that the burden of storing 500TB of historical builds can be shared across multiple storage contributors, rather than expecting a single person or organization to foot the bill (which is a much bigger ask).

While I understand your concerns about backend latency, we also have to be realistic and acknowledge that “having a company foot the entire S3 bill” has been an incredibly privileged position for the project to be in, and one that is unlikely to reoccur without severe compromises.

And that while having perfectly performant and high-end infrastructure is nice to have, the more important thing is that we have working infrastructure at all, and if we cannot reasonably afford to have high-end infrastructure (or it requires a deal with the devil), then we just cannot afford it and will have to look towards more sustainable options without this sort of increasingly-difficult-to-satisfy dependency.

To emphasize: this is not what I am suggesting, and I’m not sure how you got that from my post. What I am suggesting is the technological layer to support distributed storage more easily and more safely; exactly who is selected to provide that storage is a separate policy concern. The point is to open up more options from a practical perspective.

thoughtpolice · June 3, 2023, 8:13pm

This is a better design than I was expecting actually (perhaps my fault), thanks!

It should be noted that you really need to be careful with this because you need to ensure everyone abides by the availability policy you expect. If every sponsor just puts their storage nodes in us-east-1, then you’ve gained nothing when that failure happens. I’m sure you know this though, I’m just writing it out to sort of go back to that point that it’s definitely an operator issue. (FWIW, I don’t think cache.nixos.org being resilient to us-east-1 outages is important for us but I admit it’s just my silly opinion.)

My point is that performance is not simply nice to have; performance is actually a critical component of the system for users all over the globe in the average case. The CDN isn’t enough to abstract that fact. Performance has to be considered from the start, not after the fact, because the effort to squeeze performance out of an existing design is often exponential over time as the performance profile flattens. And performance cliffs can and do come out of nowhere — so previous indications of good performance are not necessarily a reliable predictor of future performance without the design work to support that.

This does not mean we need to aim for the absolute peak theoretical performance at the expense of everything else. But it is IMO a major factor and needs to be weighed appropriately in the overall future of the system. For example, we might imagine a design where the cache gets split into two parts; maybe the latest 3 weeks of Hydra uploads stay in a hot cache with very low latency, while all older ones get moved to a very high-latency but inexpensive storage layer. That could be worthwhile tradeoff if we want to retain everything without negatively impacting the average case. Maybe we don’t care about the average case and all of it being medium-speed storage would be OK. But we have to decide that for ourselves, not just say that performance doesn’t matter.

Performance, as well as storage cost, is exactly one of the tensions that must be resolved when designing solutions for problems like this. It’s not a foregone conclusion.

It’s just meant in jest of course, I know that’s not what you mean (mostly an unfair dig at funny people on reddit.) But really, I’m not sure it solves any more problems than what we’re already dealing with. The problem is about money, not allocating hard drives.

Speaking as someone whose operated online FOSS services and worked on the Faustian bargains like the original one here where “You are under the free grace of our corporate overlords being nice”, think of it like this. Accepting $90 worth of disk space from a sponsor isn’t really useful, because it’s peanuts; that may go away if someone has a car accident and their insurance payment goes up, or they decide they hate you, or whatever; and you have little to gain. But accepting $9,000 worth of storage is more useful because you can accomplish things with it, and also, people gifting it generally want their $9,000 to be put to good purposes and not wasted. There’s a risk to losing that money, but much more to gain at the same time by acting on it. That doesn’t always make it a good choice, but it’s a variation of that theme where: a customer paying you $100 is much more interested in your success than someone paying $5.

This leaves you in a place where you wouldn’t accept $90 of hard drives from some random Joe Schmo, it’s not worth the effort. You want a partnership that’s mutual and beneficial. But those decisions aren’t so fickle; people don’t just give you $9,000 for nothing normally. It’s part of a deal. Sponsorship, development work, whatever.

So once you’ve hit this point, you are basically no longer in a scenario where things the Tahoe model apply, where you have “untrusted storage.” At least, you only need the parts that prevent bitrot; adversarial scenarios no longer applies. You’re already trusting these people by definition the moment you enter this agreement.

And here’s the rub in my example: you aren’t being given storage in a sponsorship, almost ever. You’re being given money. Due to the logistics of finance departments — they’re not going to give you hard drives on a server that might be secretly owned by bad actors that you need to protect against. They’re going to give you cold hard cash. I think once I literally got on a Skype call with a finance person so that I could log into the account page of whatever cloud service bullshit, and type in the credit card number of theirs with the money they were giving us (all so they wouldn’t have to email the CC number to me.) Most of the time they just waive your account fees entirely, by the way, on the operator side.

That’s how like 99% of corporate sponsorships are done, with credit cards over Zoom or Skype or a phone call to an account manager. Then I just click “Create VPS” or “Create Storage Bucket” as many times as I want in the Web UI. There is no point at which I had to trust random peoples hard drives or the hard drives of a cloud account I do not own. This is what I mean by problems we don’t have: unless you assume literally every random person can contribute to the cluster (random data hoarder who can come and go at any time, causing adverse performance/network impacts), the trust boundary is defined by a checking account and people putting money into it, and the companies who are then accepted into the program. Not on-node cryptography. And once you get to that conclusion, there are suddenly many more options available.

The case you are describing where we should be aiming is to have a lot of sponsors, and where losing one of them isn’t catastrophic — yes, we need to aim for that. But if we only had 5 sponsors and 40% of them pulled out one day, the net effect is the same whether or not they are providing storage, or direct deposits to your account. And even in that scenario, a sponsor stopping giving you money is a lot better than storage going away, and in the end run has effectively the same net concerns. You’re going to get timelines, you’re going to negotiate, you need to relocate the data, you’re going to find new money, all of these things. Money is actually much easier from many POVs because checking accounts are easier to add money to than infrastructure is to migrate and maintain. Finding $9,000 from a Benevolent Corporation to shove into the NixOS.org account to keep things floating for 4 more weeks before July 1st is actually very easy compared to doing a massive migration of 0.5PB of data in the same timeframe. It’s a matter of how much you can grovel to someone with a nice CFO.

Anyway, again. If something like Tahoe is a potential solution, I’m not against it. But this is a big critical system of ours and we should be pretty careful to solve problems we actually have here. And my read is that our problem here is runaway storage cost, and sustainability of the cache long-term, above all else. The other stuff is all just a matter of design.

(I probably shouldn’t have started this subthread to be fair; in the mean time the major concern like I said is plugging the hole in the ship or whatever we need to do.)

joepie91 · June 3, 2023, 8:35pm

I think we agree here. I’m not saying that performance doesn’t matter at all; just that “shooting for the best of the best” probably shouldn’t be a hard requirement given the circumstances.

On which note, with how little space is apparently being used by narinfo, I think the whole performance situation could be improved significantly by keeping all the narinfo directly on the ‘gateway server’ (operated/funded by the Foundation) and only using the storage cluster for the bulk data, which is much less latency-sensitive.

This is heavily dependent on the type of sponsor. In large enterprises and startup-y companies, you are correct. But university sponsors are more likely to give you a server or rack space. A VPS provider donating spare capacity (actually quite common!) is likely to give you a custom plan priced at $0.00 renewal. Older small ‘greybeard’ companies might give you a shell account with a network storage mount. And so on.

There’s more than one type of sponsorship, and the type that you are describing - as you implied - also tends to be the one where some sort of concrete return-on-investment is expected. This makes them a risky type of sponsor because you have to make Faustian bargains to get anywhere, and they tend to get riskier as the thing being sponsored grows bigger (given the power dynamics involved in difficult-to-satisfy sponsorship needs).

So no, I don’t think our root problem is the storage cost. I think our root problem is having a storage architecture that leaves expensive options (either in money or in concessions to a sponsor) as the only viable option to maintain continuity, and the runaway storage cost is just the inevitable consequence. That is what I am aiming to address with my long-term suggestion of something like Tahoe-LAFS.

Something more ‘community-scale’ that opens up more sponsoring possibilities that do not necessarily carry the same risks as this type of sponsorship does (but without ruling out the current type of sponsorship), and that more generally removes the complete and total dependency on corporate goodwill for the project’s continued existence (which I hopefully won’t need to explain why it is a problem).

Sure. That’s why I’m suggesting trialing it in parallel with the existing infrastructure; if it doesn’t work out, we’ll know so before we’re relying on it in any way.

chris-martin · June 3, 2023, 8:35pm

Another question…

… Is there any data handy to see what that growth curve looks like?

lewo · June 3, 2023, 9:18pm

I actually started to think about using SWH because i wanted to find a way to reduce the size of our binary cache (for ecological considerations mainly) while preserving our ability to rebuild everything forever (that’s the mission of SWH). So, we have a loader, but i never find time/energy to continue working on this topic and there are still a lot of thing to achieve

Software Heritage seems to have some kind of interest into Nix, they could host this.

AFAIK, they don’t want to store binaries, but only source codes.

So, on the short term, this won’t help us, but on the long term, i still think we should consider relying on SWH.

nh2 · June 3, 2023, 9:55pm

I’d like to pick up on the following topic somebody mentioned:

32K to export the data, RIP. It’s a general good reminder to never rely on sponsors for essential infra.

As a project, we should really analyze how we ended up in this situation, and what to do to avoid it in the future.

It feels like bad planning to rely entirely on a single sponsor paying a huge bill, without creating a backup plan for what to do when the sponsor disappears – thus putting the project in front of a $32k surprise migration bill.

A simple backup plan could have been something as simple as buying 3 Hetzner servers to back up all historical store paths (for ~500 $/month), so that the migration to anything else would be free, vs $32k.

I think it would be smart to make it the NixOS foundations task to ensure we do not get into such situations in the future.

matthewcroughan · June 3, 2023, 10:07pm

That doesn’t mean we shouldn’t do it. At no point did I say we should rely upon it solely, I simply said it should be available as one of many available substitution mechanisms. Would you argue that we should be ditching or ignoring solutions just because they are not fitting your idea of perfect?

Making content-addressed data available via more protocols is a net-good thing. Diversity of protocols creates more availability, I don’t see how you can debate that.

joepie91 · June 3, 2023, 10:19pm

My point is that it literally doesn’t solve the problem that this thread is about, not even poorly. There is already a thread about adding IPFS support for distribution purposes. That’s the correct place to discuss this.

joepie91 · June 3, 2023, 10:49pm

This was already discussed in the Matrix room, but given the high volume of discussion over there, I felt it would be a good idea to make a note about it here as well, for posterity:

I’d personally feel extremely uncomfortable with moving project infrastructure to Cloudflare, given their long history of outright malicious behaviour, including (but definitely not limited to) actively providing cover to a community that has deliberately driven multiple people to suicide, and otherwise harasses marginalized folks on a daily basis. CF is probably about as close to “deal with the devil” as we could get here.

ron · June 4, 2023, 1:13am

We will be holding the community call on 2023-06-07T15:00:00Z
Planned Agenda

Brief budget and timeline review
Review/discuss all potential options
Brainstorm/Q&A

Video call link: https://meet.google.com/pyr-orzm-ahm
Or dial: ‪(US) +1 252-385-2704‬ PIN: ‪320 231 639‬#
More phone numbers: https://tel.meet/pyr-orzm-ahm?pin=1212541034968

Thank you again for jumping all in on this with us!

uep · June 4, 2023, 3:53am

It hasn’t really been said explicitly, and this is as good a prompt as any:

The problems under discussion have arisen because there has not been any pressure on the growth of storage until now. Huge thanks, again, to the existing sponsors, but it was always going to reach a point where unsustainable growth hit a cliff.
The corollary of this is that there likely is a lot of low-hanging fruit in terms of storage reduction, via dedup / smarter gc / hierarchical migration and others discussed above.
The efforts and strategy the project puts in place now, both in the short and medium term, will be crucial in any efforts in attracting new sponsors, to demonstrate the ability to manage bounds around the problem. Among other things, it’s no fun for a sponsor to be in the awkward position of being a critical SPoF when they have to withdraw, either. I’m sure a number of potential sponsors will be more interested if the burden and the risk can be shared around.

7c6f434c · June 4, 2023, 7:57am

Do we have some numbers for fixed output derivations outputs (FODs)? A mix of «old release channels» + «all FODs» + «everything since the last release» might be 95% of utility and permit slow 100% recovery (at a cost in compute, sure).

edolstra · June 4, 2023, 9:25am

In fact, one of the main reasons the NixOS Foundation was created was to ensure continuity of the infrastructure by having the financial reserves to deal with sponsoring of the binary cache ending. The foundation currently has ~€230K in the bank, so we are prepared for this.

Moreover, that $32K is a worst case that only happens if we decide to move all of the binary cache out of S3 and we have to pay for it. There are much cheaper scenarios, e.g. we stay on S3 and we garbage-collect the binary cache. (As I described here, keeping all NixOS release closures ever made and deleting everything else shrinks the binary cache to about ~6 of its current size.)

Sandro · June 4, 2023, 10:07am

ca-derivations have the potential to reduce the growth of new store entries but the feature is not receiving much love and mostly dead in the last year. It would also increase the effectiveness of alternative distribution methods like torrent.

Solene · June 4, 2023, 10:16am

A lot of bug fixing has been done on CA over the last few months.

hexa · June 4, 2023, 10:54am

I think staying in such a financially hostile environment is misguided, irrespective of whether we reduce our overall storage requirements or not. The egress quirk at AWS S3 is egregious at best.

In general, I think we should pay for our mission-critical infrastructure, and that means finding a sustainable partner.

The best idea to tackle the immediate problem, I find, is Backblaze B2 and their egress fee waiver, if you transfer more than 10 TB and stay for at least 12 months. Their overall storage cost is also much lower and egress costs to Fastly don’t exist, because both of them are part of the bandwidth alliance.

Also, I really wish we had two threads, one for solving the immediate problem, and one for discussing long-term solutions. This thread is really long and noisy, and lacks coherence because of that.

nh2 · June 4, 2023, 12:19pm

Suggestion:

Have cache.nixos.org backed by 2 infrastructures:

“Unsponsored-storage” – There should be a backup storage for cache.nixos.org that stores all store paths in a cheap way, which does not rely on benevolent sponsors or potentially-temporary cost waivers. It need not serve cache.nixos.org’s daily load, but its hosting and egress should be cheap, such that it can be copied to whatever the current daily-load-storage is without large transfer costs.
- Example: Ceph on Hetzner, at ~1.5 $/TB storage, 0.15 $/TB egress.
- Also serves as a disaster-recovery backup in case the daily-load-storage disappears, e.g. because it ceases service.
- All Hydra outputs would be copied onto it.
- Ideally paid for by the NixOS foundation / covered by donations?
“Daily-load-storage” – Serve’s cache.nixos.org’s daily load, using whatever good partnerships and sponsorships we can get.

By having our own fallback, we could confidently take special offers or partnerships, without creating single points of failure that are difficult to migrate off of.

Is “unsponsored-storage” feasible?

As a data point:

My company runs Ceph-on-NixOS, as two ~500 TB clusters, with 3 Hetzner SX 134 machines each.
Each machine costs ~250 $/month excl. VAT, making the the total cost of a 500 TB raw cluster around 750 $/month.
For durability and High-Availability:
We use 3x replication. Accordingly, this reduces usable storage from raw to 33%. But Ceph also supports Erasure Coding, e.g. K+M=6, M=2 supports 66% storage efficiency. You can check it in e.g. MinIO's EC calculator, inputing values 3, 1, 10, 16, 6, 2 into the text boxes for the mentioned Hetzner setup.
The Ceph cluster is low maintenance. It requires work mainly for NixOS / Ceph version upgrades, so every 6 months.
- When a disk fails (which eventually happens with many disks), we email Hetzner and it gets replaced in 15 minutes on average, for no additional cost.

I believe this is something that an infrastructure team could do, even on a volunteer basis; certainly on a paid-less-than-$9k/month basis.

nh2 · June 4, 2023, 12:33pm

Another concrete suggestion on how to reduce the $32k cost:

Do deduplication using a deduplicating backup software such as bup or bupstash, before the egress.

I’ve investigated this a bit in Investigate deduplication to reduce storage and transfer · Issue #89380 · NixOS/nixpkgs · GitHub

In the latest post from today I found that a dedup factor of 3.5x (thus egress cost reduction factor) seems immediately achievable.

(I know of nix-specific dedup solutions such as https://tvix.dev from @flokli linked above, but I haven’t had time to compare that yet, and so far only looked into general-purpose software that’s immediatly available and that I’m already familiar with.)

RaitoBezarius · June 4, 2023, 1:04pm

An alternative to this idea is to go for tape-based storage.

LTO-8 autoloader, let’s say PowerVault TL1000 (~5K EUR for a new one) gives you 9*30TB = 270TB of “active tapes” with 15-30 years old lifetime if stored/maintained properly.

A 30TB RW LTO-8 tape cost ~80-100EUR. Storing all the current cache twice would cost 3K EUR + 5K EUR for the autoloader + electricity costs or colocation costs.

Also, this is very relevant for Hydra store paths writes because ultimately they are really sequential, aren’t they?

It is also a trivially extensible solution because we can just pile up more tapes as we move forward in the future.

(though someone needs to change the tapes if we overgrow the autoloader capacity or more autoloaders are needed.)

Atemu · June 4, 2023, 1:28pm

I think we should recognise that we have multiple different kinds of data with very different requirements. A solution that works for one might not work well for others.
Moving forward, perhaps we should discuss solutions for each individually, rather than finding an all-encompassing solution similar to what we currently have.

This is how I’d differentiate:

Kind	Size	Latency Requirements	Throughput Requirements
narinfo	Small	High	Small
nar	Large	Medium	High
source	Medium	Low	Medium

Another aspect I’d like to see explored is nar size vs. latency. Some closures contain lots of rather small paths while others have fewer but larger paths. It might be beneficial to also handle these separately.