The NixOS Foundation's Call to Action: S3 Costs Require Community Support

joepie91 · June 3, 2023, 5:42pm

As an early suggestion for a long-term solution (not as a short-term solution!), I’d suggest considering Tahoe-LAFS. It’s an actively-replicated distributed storage system with much stronger availability properties than something like IPFS or BitTorrent.

Advantages:

Distributed across arbitrary low-trust peers of arbitrary size; this would effectively allow third parties to sponsor storage without being dependent on their continued sponsorship, if any sponsor pulls out then the content just gets redistributed over a couple others
Cryptographic integrity is built into the addressing; this means that the storage nodes cannot tamper with (or even read) the content that they are storing, which reduces the level of trust necessary and therefore opens up possibilities of more storage sponsors
Essentially RAID-over-the-network; uses erasure coding to have efficient redundancy against data loss, and replication is an active task (content blocks are pushed to storage nodes) so you are not dependent on people deciding to ‘seed’ something, therefore it is not subject to rot of unpopular files
Can deal with any size storage node; there’s no need to match the size of different nodes (unlike many other RAID-like systems), and even small personal nodes provided by individual contributors would be useful, as long as they are reliably online.

Disadvantages:

There is no explicit ‘delete’ functionality, only (optional) time-based expiry and renewal. This is not likely to matter to us, given our “never delete anything” usecase.
Performance can be variable; I don’t think this is an issue for us, given that we have a CDN fronting it.
Access requires something that speaks the protocol; so we’d need to proxy requests through a centralized server for retrieval (though that server doesn’t need to persistently store anything). That server would also be responsible for triggering eg. repairs of files for which some ‘shares’ (copies) have been lost.
I think it doesn’t protect against a malicious storage node pretending to have a file but not actually having it, unless you do regular integrity checks (which is an option that is available by default).
The only one I can think of that might be a problem for us: there’s no real resource accounting, and any participating node can store data in the storage cluster without it being attributable to them. It may be possible to fix this by running a modified version of the software that rejects all storage requests except for those from the central coordination node; I don’t know whether this is available as a configurable option out-of-the-box.

With a Tahoe-LAFS setup, the only centralized infrastructure that would need to be maintained by the project would be that central coordination server. It would do a fair amount of traffic (it’d essentially be the ‘gateway’ to the storage cluster), but require little to no persistent storage. This would be very, very cheap to maintain, both in terms of money and in terms of sysadmin maintenance.

I feel like it would be worth trialing this in parallel to the existing infrastructure; gradually directing more and more backend requests to the Tahoe cluster and then falling back to the existing S3 storage if they fail. That way we can test in a low-risk way whether a) we actually have sufficient sysadmin capacity to keep this running in the longer term, and b) the software itself can meet our needs.

Edit: I’d be happy to coordinate the technical side of this, of course, insofar my health allows.

pauldoo · June 3, 2023, 5:47pm

That’s very misleading, and assumes the “S3 Standard” storage class. Other storage classes are available. The “Archive Instant Access” tier (part of “Intelligent Tiering”) is $4/TB per month, for example.

It was already mentioned in an earlier reply, but if not already enabled turning on Intelligent Tiering would be a no-brainier.

chris-martin · June 3, 2023, 6:07pm

Is there any way we can be more specific about when this “sometimes” is? It seems to me that some data is extremely valuable and a lot of the data is probably not valuable at all, if it can be regenerated as needed. Would I be correct in supposing, for example, that if, hypothetically, only the “fetch” derivations were kept and everything else with infrequent access were pruned from the cache, everything would still be buildable and the only impact would be bad build times for lesser-used packages (for people who aren’t already using a third-party cache host for their more niche projects)?

matthewcroughan · June 3, 2023, 6:34pm

We definitely need more support for distributed, content-addressed systems as substituters. Candidates include Bittorrent, IPFS, Hypercore, Eris. Imagine if every NixOS config seeded content-addressed paths via bittorrent.

https://github.com/NixOS/nixpkgs/pull/212930

joepie91 · June 3, 2023, 6:37pm

Please keep in mind that these systems do not provide availability guarantees. They are distribution systems, not storage systems, and therefore do not address the problem of durable long-term storage of the binary cache (and thus do not solve the problem that we’re dealing with here right now).

thoughtpolice · June 3, 2023, 6:49pm

Something like this paragraph worries me. The problem with these sorts of things are metastable and cascading failures. Data replication and resharding is already extremely difficult to achieve reliably in existing systems that achieve PBs-and-beyond scale. I don’t think we want potential storage domains to go completely missing and causing issues like this without careful operator oversight. (That’s assuming we want to actually store large data volumes and not throw stuff away/archive it.)

This hints at something which is that Tahoe does solve a set of problems, but they may not be ours. The coordination could be cheap to run, yes, but that isn’t the only consideration we have and may even be considered a coincidence if none of the other problems really overlap. Our problem isn’t really untrusted storage. It’s just having storage in general.

Tail performance in latency and throughput is critical to the usability of the binary cache in the IMO, the CDN is part of that but not all of that. See for example nixos-org-configurations/212. I’ve in the past seen reports from users regularly reporting multi-second long TTFBs in distant locations like Shanghai and Singapore without features like Shielding. That kind of latency destroys everything, and in some cases causes timeouts in some software; a timeout is effectively the same thing to a user as “this file doesn’t exist” or “this file was deleted by the operator in a data loss accident.” They’re operationally different, but morally the same from a user/operator/SLO/I-paid-you-money perspective.

The CDN works so well it’s easy to forget that. Nobody thinks forest fires are a problem when you prevent them in the first place, after all. It’s a combination of factors. One is that S3 is actually a really good and reliable product. Is it expensive as hell? Yes. But it’s good, it’s fast, it has good latency to clients close to the bucket. It’s a well explored problem to serve these S3 files quickly; so that’s why it works so well.

(One latency performance factor that’s unexplored IMO is the fact that the design of narinfo files incurs a form of HOL blocking. When you want to download foo.nar you need all the dependencies, but you don’t know them until you have recursively traversed and downloaded every dependent .narinfo file, but that’s a bunch of small latency-sensitive files to grab hold of.)

If “performance is variable” means “Random introduced spikes of 100ms in the backend from origin to cache” — OK, that would blow like, 50% or more of your entire intra-stack latency budget in a well-oiled distributed system in a relatively large company. But it would work for us, maybe. But if “performance is variable” means that “Requesting a .narinfo file from Los Angeles causes a recursive set of requests, randomly choosing a guy in Singapore and another person in Australia to serve them both, with totally random latency between them”, that’s going to be quite bad, I think. And if the guy in Australia can then just turn his PC off and require a guy in Singapore to rewrite parts of his hard drive to handle that, that will also have consequences to the availability of other nodes in practice.

I don’t know what the performance profile of Tahoe is, to be fair. It could be worth exploring as an alternative and it may be quite good. Some other problems remain e.g. provisioning of storage is clearly an operator concern and I don’t think trusting data hoarders to not unplug their hard drives is a good approach. But I just want to push back on the notion the CDN makes performance anomalies irrelevant. Performance is absolutely relevant, and only feels irrelevant exactly because the CDN and Amazon took care of performance for us. And at large data sizes, performance anomalies can easily snowball into complete denials of service for users. Unfortunately, that might require careful operator experience to account for, and it’s going to be hard to sell that when most of the alternative options are probably more well understood from that perspective.

dinvlad · June 3, 2023, 6:51pm

I’d add that funding-wise practically speaking, unless some major provider steps up to cover the costs, it seems a one-time capital investment to build-your-own origin server would make more sense than leasing any off-the-shelf VPS storage solution, since a one-time major fundraising campaign would be easier to accomplish than an ongoing raising of $1,000s/month (this is assuming the whole thing cannot be easily re-architectured in the short-term to entirely eliminate the need to store this much…)

There would still be a constant cost of colocation and an associated drive replacement service, but that would be an order of magnitude less ($100s instead of $1,000s/month) and so easier to sustain with a minimal ongoing donation campaign.

IamfromSpace · June 3, 2023, 6:53pm

From reading through the thread, it sounds like there are two independent problems here. One is caching/serving assets/serving in-demand assets quickly. The other seems to be that this an archive as much as it is a cache.

If original sources are unavailable, rebuilding is not an option, and this eventually has a poisoning effect into the nix ecosystem. If a significant dependency is totally lost, all depended packages are now at risk of loss and so on. Based on this, we can’t really think about this as only cache. Pruning unimportant things may still be possible, but that’s probably very tricky, to the point where deferral of this issue is likely a good idea.

This means we need to solve both problems, and for archive purposes, it’s almost impossible to get away from a centralized authority and reduce costs. Full distribution makes no archival guarantees—or it defers responsibility to others to do it themselves. If this archival layer exists, it may then be possible to be more distributed in on-demand cache layers on top, to try and spread out the data transfer costs.

So the question is really how do we pay as little as possible for archiving cold assets and then either reduce transfer costs or distribute them out by some means.

Disregard: I had hoped that CloudFront would be a good option for cutting egress costs (CF to S3 is free, so you pay only CFs pricing), but it looks like the savings are relatively small. It may however still be a quick win simply mark caches as immutable and then dramatically scale back the storage tier in S3. The hot items should be so hot that they are rarely actually going back to origin. Egress only moves a little, but storage savings could be significant.

edit: I reread and we of course already have a CDN, which is amazingly doing a ton. Possibly on final fully centralized cache layer would buffer the CDN and the buckets enough to lower the S3 tier even further. Since a CDN is distributed, this makes for more misses than a single unified layer would see.

joepie91 · June 3, 2023, 7:07pm

The general storage model of Tahoe looks something like this (IIRC):

You specify an N-of-M encoding, which are the erasure coding parameters; the storage overhead is (M / N), and you can lose (M - N) shares without it being irrecoverable. Optionally you can specify a ‘happy’ parameter that specifies the minimum number of distinct node the shares are distributed across, but realistically we probably just want that set to M, so I will assume it to be equal to M.
Storage nodes are polled for having sufficient free space to store a share, and the nodes are selected in response latency order. A share is pushed to each of them.
Upon retrieval, many nodes are asked whether they have the shares in question, until N responses are obtained (ie. enough to satisfy the request). The nodes are again selected in order of response latency; additional nodes are ignored.
A verify/repair cycle effectively involves ‘walking the tree’ and identifying any shares for which insufficient nodes have shares available. The available shares (up to N) are retrieved, and the missing shares are replicated and pushed to enough nodes that the total available share count is back up to M.

The intention of having a storage cluster that is resilient against disappearing storage nodes is not to allow random people to connect their personal computer at home; the churn would indeed cause a cascading failure. Rather, it’s meant to compensate for both unexpected outages (something something yearly us-east-1 outage), as well as sponsors deliberately pulling out - which would still be an uncommon occurrence.

It means that in the case of uncommon-but-likely-to-happen-eventually failures, we don’t end up back in the situation that we are in now, where we have to scramble to find a new place to migrate things to, or risk a total outage. It also means that the burden of storing 500TB of historical builds can be shared across multiple storage contributors, rather than expecting a single person or organization to foot the bill (which is a much bigger ask).

While I understand your concerns about backend latency, we also have to be realistic and acknowledge that “having a company foot the entire S3 bill” has been an incredibly privileged position for the project to be in, and one that is unlikely to reoccur without severe compromises.

And that while having perfectly performant and high-end infrastructure is nice to have, the more important thing is that we have working infrastructure at all, and if we cannot reasonably afford to have high-end infrastructure (or it requires a deal with the devil), then we just cannot afford it and will have to look towards more sustainable options without this sort of increasingly-difficult-to-satisfy dependency.

To emphasize: this is not what I am suggesting, and I’m not sure how you got that from my post. What I am suggesting is the technological layer to support distributed storage more easily and more safely; exactly who is selected to provide that storage is a separate policy concern. The point is to open up more options from a practical perspective.

thoughtpolice · June 3, 2023, 8:13pm

This is a better design than I was expecting actually (perhaps my fault), thanks!

It should be noted that you really need to be careful with this because you need to ensure everyone abides by the availability policy you expect. If every sponsor just puts their storage nodes in us-east-1, then you’ve gained nothing when that failure happens. I’m sure you know this though, I’m just writing it out to sort of go back to that point that it’s definitely an operator issue. (FWIW, I don’t think cache.nixos.org being resilient to us-east-1 outages is important for us but I admit it’s just my silly opinion.)

My point is that performance is not simply nice to have; performance is actually a critical component of the system for users all over the globe in the average case. The CDN isn’t enough to abstract that fact. Performance has to be considered from the start, not after the fact, because the effort to squeeze performance out of an existing design is often exponential over time as the performance profile flattens. And performance cliffs can and do come out of nowhere — so previous indications of good performance are not necessarily a reliable predictor of future performance without the design work to support that.

This does not mean we need to aim for the absolute peak theoretical performance at the expense of everything else. But it is IMO a major factor and needs to be weighed appropriately in the overall future of the system. For example, we might imagine a design where the cache gets split into two parts; maybe the latest 3 weeks of Hydra uploads stay in a hot cache with very low latency, while all older ones get moved to a very high-latency but inexpensive storage layer. That could be worthwhile tradeoff if we want to retain everything without negatively impacting the average case. Maybe we don’t care about the average case and all of it being medium-speed storage would be OK. But we have to decide that for ourselves, not just say that performance doesn’t matter.

Performance, as well as storage cost, is exactly one of the tensions that must be resolved when designing solutions for problems like this. It’s not a foregone conclusion.

It’s just meant in jest of course, I know that’s not what you mean (mostly an unfair dig at funny people on reddit.) But really, I’m not sure it solves any more problems than what we’re already dealing with. The problem is about money, not allocating hard drives.

Speaking as someone whose operated online FOSS services and worked on the Faustian bargains like the original one here where “You are under the free grace of our corporate overlords being nice”, think of it like this. Accepting $90 worth of disk space from a sponsor isn’t really useful, because it’s peanuts; that may go away if someone has a car accident and their insurance payment goes up, or they decide they hate you, or whatever; and you have little to gain. But accepting $9,000 worth of storage is more useful because you can accomplish things with it, and also, people gifting it generally want their $9,000 to be put to good purposes and not wasted. There’s a risk to losing that money, but much more to gain at the same time by acting on it. That doesn’t always make it a good choice, but it’s a variation of that theme where: a customer paying you $100 is much more interested in your success than someone paying $5.

This leaves you in a place where you wouldn’t accept $90 of hard drives from some random Joe Schmo, it’s not worth the effort. You want a partnership that’s mutual and beneficial. But those decisions aren’t so fickle; people don’t just give you $9,000 for nothing normally. It’s part of a deal. Sponsorship, development work, whatever.

So once you’ve hit this point, you are basically no longer in a scenario where things the Tahoe model apply, where you have “untrusted storage.” At least, you only need the parts that prevent bitrot; adversarial scenarios no longer applies. You’re already trusting these people by definition the moment you enter this agreement.

And here’s the rub in my example: you aren’t being given storage in a sponsorship, almost ever. You’re being given money. Due to the logistics of finance departments — they’re not going to give you hard drives on a server that might be secretly owned by bad actors that you need to protect against. They’re going to give you cold hard cash. I think once I literally got on a Skype call with a finance person so that I could log into the account page of whatever cloud service bullshit, and type in the credit card number of theirs with the money they were giving us (all so they wouldn’t have to email the CC number to me.) Most of the time they just waive your account fees entirely, by the way, on the operator side.

That’s how like 99% of corporate sponsorships are done, with credit cards over Zoom or Skype or a phone call to an account manager. Then I just click “Create VPS” or “Create Storage Bucket” as many times as I want in the Web UI. There is no point at which I had to trust random peoples hard drives or the hard drives of a cloud account I do not own. This is what I mean by problems we don’t have: unless you assume literally every random person can contribute to the cluster (random data hoarder who can come and go at any time, causing adverse performance/network impacts), the trust boundary is defined by a checking account and people putting money into it, and the companies who are then accepted into the program. Not on-node cryptography. And once you get to that conclusion, there are suddenly many more options available.

The case you are describing where we should be aiming is to have a lot of sponsors, and where losing one of them isn’t catastrophic — yes, we need to aim for that. But if we only had 5 sponsors and 40% of them pulled out one day, the net effect is the same whether or not they are providing storage, or direct deposits to your account. And even in that scenario, a sponsor stopping giving you money is a lot better than storage going away, and in the end run has effectively the same net concerns. You’re going to get timelines, you’re going to negotiate, you need to relocate the data, you’re going to find new money, all of these things. Money is actually much easier from many POVs because checking accounts are easier to add money to than infrastructure is to migrate and maintain. Finding $9,000 from a Benevolent Corporation to shove into the NixOS.org account to keep things floating for 4 more weeks before July 1st is actually very easy compared to doing a massive migration of 0.5PB of data in the same timeframe. It’s a matter of how much you can grovel to someone with a nice CFO.

Anyway, again. If something like Tahoe is a potential solution, I’m not against it. But this is a big critical system of ours and we should be pretty careful to solve problems we actually have here. And my read is that our problem here is runaway storage cost, and sustainability of the cache long-term, above all else. The other stuff is all just a matter of design.

(I probably shouldn’t have started this subthread to be fair; in the mean time the major concern like I said is plugging the hole in the ship or whatever we need to do.)

joepie91 · June 3, 2023, 8:35pm

I think we agree here. I’m not saying that performance doesn’t matter at all; just that “shooting for the best of the best” probably shouldn’t be a hard requirement given the circumstances.

On which note, with how little space is apparently being used by narinfo, I think the whole performance situation could be improved significantly by keeping all the narinfo directly on the ‘gateway server’ (operated/funded by the Foundation) and only using the storage cluster for the bulk data, which is much less latency-sensitive.

This is heavily dependent on the type of sponsor. In large enterprises and startup-y companies, you are correct. But university sponsors are more likely to give you a server or rack space. A VPS provider donating spare capacity (actually quite common!) is likely to give you a custom plan priced at $0.00 renewal. Older small ‘greybeard’ companies might give you a shell account with a network storage mount. And so on.

There’s more than one type of sponsorship, and the type that you are describing - as you implied - also tends to be the one where some sort of concrete return-on-investment is expected. This makes them a risky type of sponsor because you have to make Faustian bargains to get anywhere, and they tend to get riskier as the thing being sponsored grows bigger (given the power dynamics involved in difficult-to-satisfy sponsorship needs).

So no, I don’t think our root problem is the storage cost. I think our root problem is having a storage architecture that leaves expensive options (either in money or in concessions to a sponsor) as the only viable option to maintain continuity, and the runaway storage cost is just the inevitable consequence. That is what I am aiming to address with my long-term suggestion of something like Tahoe-LAFS.

Something more ‘community-scale’ that opens up more sponsoring possibilities that do not necessarily carry the same risks as this type of sponsorship does (but without ruling out the current type of sponsorship), and that more generally removes the complete and total dependency on corporate goodwill for the project’s continued existence (which I hopefully won’t need to explain why it is a problem).

Sure. That’s why I’m suggesting trialing it in parallel with the existing infrastructure; if it doesn’t work out, we’ll know so before we’re relying on it in any way.

chris-martin · June 3, 2023, 8:35pm

Another question…

… Is there any data handy to see what that growth curve looks like?

lewo · June 3, 2023, 9:18pm

I actually started to think about using SWH because i wanted to find a way to reduce the size of our binary cache (for ecological considerations mainly) while preserving our ability to rebuild everything forever (that’s the mission of SWH). So, we have a loader, but i never find time/energy to continue working on this topic and there are still a lot of thing to achieve

Software Heritage seems to have some kind of interest into Nix, they could host this.

AFAIK, they don’t want to store binaries, but only source codes.

So, on the short term, this won’t help us, but on the long term, i still think we should consider relying on SWH.

nh2 · June 3, 2023, 9:55pm

I’d like to pick up on the following topic somebody mentioned:

32K to export the data, RIP. It’s a general good reminder to never rely on sponsors for essential infra.

As a project, we should really analyze how we ended up in this situation, and what to do to avoid it in the future.

It feels like bad planning to rely entirely on a single sponsor paying a huge bill, without creating a backup plan for what to do when the sponsor disappears – thus putting the project in front of a $32k surprise migration bill.

A simple backup plan could have been something as simple as buying 3 Hetzner servers to back up all historical store paths (for ~500 $/month), so that the migration to anything else would be free, vs $32k.

I think it would be smart to make it the NixOS foundations task to ensure we do not get into such situations in the future.

matthewcroughan · June 3, 2023, 10:07pm

That doesn’t mean we shouldn’t do it. At no point did I say we should rely upon it solely, I simply said it should be available as one of many available substitution mechanisms. Would you argue that we should be ditching or ignoring solutions just because they are not fitting your idea of perfect?

Making content-addressed data available via more protocols is a net-good thing. Diversity of protocols creates more availability, I don’t see how you can debate that.

joepie91 · June 3, 2023, 10:19pm

My point is that it literally doesn’t solve the problem that this thread is about, not even poorly. There is already a thread about adding IPFS support for distribution purposes. That’s the correct place to discuss this.

joepie91 · June 3, 2023, 10:49pm

This was already discussed in the Matrix room, but given the high volume of discussion over there, I felt it would be a good idea to make a note about it here as well, for posterity:

I’d personally feel extremely uncomfortable with moving project infrastructure to Cloudflare, given their long history of outright malicious behaviour, including (but definitely not limited to) actively providing cover to a community that has deliberately driven multiple people to suicide, and otherwise harasses marginalized folks on a daily basis. CF is probably about as close to “deal with the devil” as we could get here.

ron · June 4, 2023, 1:13am

We will be holding the community call on 2023-06-07T15:00:00Z
Planned Agenda

Brief budget and timeline review
Review/discuss all potential options
Brainstorm/Q&A

Video call link: https://meet.google.com/pyr-orzm-ahm
Or dial: ‪(US) +1 252-385-2704‬ PIN: ‪320 231 639‬#
More phone numbers: https://tel.meet/pyr-orzm-ahm?pin=1212541034968

Thank you again for jumping all in on this with us!

uep · June 4, 2023, 3:53am

It hasn’t really been said explicitly, and this is as good a prompt as any:

The problems under discussion have arisen because there has not been any pressure on the growth of storage until now. Huge thanks, again, to the existing sponsors, but it was always going to reach a point where unsustainable growth hit a cliff.
The corollary of this is that there likely is a lot of low-hanging fruit in terms of storage reduction, via dedup / smarter gc / hierarchical migration and others discussed above.
The efforts and strategy the project puts in place now, both in the short and medium term, will be crucial in any efforts in attracting new sponsors, to demonstrate the ability to manage bounds around the problem. Among other things, it’s no fun for a sponsor to be in the awkward position of being a critical SPoF when they have to withdraw, either. I’m sure a number of potential sponsors will be more interested if the burden and the risk can be shared around.

7c6f434c · June 4, 2023, 7:57am

Do we have some numbers for fixed output derivations outputs (FODs)? A mix of «old release channels» + «all FODs» + «everything since the last release» might be 95% of utility and permit slow 100% recovery (at a cost in compute, sure).