The NixOS Foundation's Call to Action: S3 Costs Require Community Support

vcunat · June 3, 2023, 2:40pm

https://github.com/NixOS/foundation/issues/79

vcunat · June 3, 2023, 2:50pm

As for this S3 thread, I don’t think a good enough reliability and availability is doable just by a couple community members in their free time. And even if we discount human work, it still won’t be close to free (HW, probably not runnable “at home”, etc.)

vhodges · June 3, 2023, 2:51pm

just a small clarification: it seems its 3TB/day in egress

joepie91 · June 3, 2023, 2:54pm

Ah right, I misunderstood. It’d still end up considerably cheaper than AWS, though - around $3400 per month assuming 500TB of data.

pcs3rd · June 3, 2023, 2:59pm

So, I don’t have a great understanding of how the cache system works outside of “It caches built binaries for common platforms”, but wouldn’t the solution to data loss be to just reintroduce said binary the next time someone builds it or whenever hydra hits it?

joepie91 · June 3, 2023, 3:09pm

The problem is that there are a lot of historical builds in the binary cache; the binary cache doesn’t build things on-demand, but rather rebuilds all the (eligible) packages every time the nixpkgs channel gets updated - so that the binaries are there already once a client requests them.

This means that it will never revisit past builds unless someone explicitly creates a task for it. It would probably be possible to retroactively do a ‘clean’ rebuild of past nixpkgs versions, but to try and reproduce the entire history of nixpkgs (which is what currently lives in the binary cache, more or less) would require a lot of computing power. I don’t know whether that’s viable in practice.

joepie91 · June 3, 2023, 3:33pm

So, a quick survey of “S3-compatible” providers that I am aware of, on a monthly basis (S3 pricing comes from Backblaze’s page, have not verified):

AWS S3: $26/TB storage, $90/TB traffic, no minimum storage
Cloudflare R2: $15/TB storage plus ‘operation fees’, free traffic; possibly sponsorable, but this would create another sponsor dependency
Backblaze B2: $5/TB storage, $10/TB traffic, no minimum storage, supposedly free migration from S3
Wasabi: $6/TB storage, free(?) traffic, 90 days minimum storage
Storj: $4/TB storage plus ‘segment fee’, $7/TB traffic, no minimum storage, supposedly free migration from S3

I’d say that these seem like a viable short-term solution to migrate to, so that we can hit the deadline of a month, while also significantly cutting expenses; after that, it’s probably worth revisiting whether something more cost-efficient is possible, and what’s possible in terms of infrastructure maintenance capacity after the infra team gets split?

abathur · June 3, 2023, 3:34pm

Had a couple ideas while responding to someone on HN who also asked the “can we just rebuild it all?” question:

I guess if there’s a bit of Pareto in the distribution of build output sizes, there might be meaningful savings in rebuilding the packages that take up the most storage and paying to move the small stuff.
It might also be possible to triage the sources in the cache–re-fetch all that are still available from the original source or a mirror, and then just export the ones that either couldn’t be fetched or could be fetched but the contents no longer match the hash.

davidbarsky · June 3, 2023, 3:39pm

Hi! I apologize for not reading through the entire thread yet, and of what I’ve seen, the prevailing sentiment in this thread might necessarily align with this proposal, but I thought I’d mention it anyways.

When I was at Amazon, I got AWS to cover crate.io’s hosting costs. AWS has since setup an open-source sponsorship program that gives out AWS credits with a (last I checked) not-unreasonable application process, and it might be worthwhile to apply if you haven’t already.

(I am a little sick right now, so brain don’t work good.)

jtolio · June 3, 2023, 4:18pm

An admittedly biased point about Wasabi - costs increase substantially if you start egressing more than 1x bytes at rest per month.

On the other hand, the per-“segment” costs with Storj are essentially per-object charges, but can be basically eliminated with some packing (e.g. zipping up lots of small files together). The Storj platform natively understands zips and can pull individual files out of zips without downloading the whole zip for this reason.

ericpauley · June 3, 2023, 4:40pm

Since the large-scale data storage is partially intended to benefit research, consider partnering with an academic institution. Many institutions qualify for data egress waivers: Data egress waiver available for eligible researchers and institutions | AWS Public Sector Blog

In my experience, institutions use substantially less than their waiver limit across the org so the data transfer will basically be free for them. This transition could be done with virtually no technical measures at all (i.e., just re-owning the org to an institution) and would then allow migration off AWS in the future basically for free. Some university research group may even be willing to host the data on a storage cluster.

Another thought: if the data is largely duplicative at the bytestream level across objects, would decompressing and recompressing (as one stream) the data before egress save on substantial costs for the transition?

thoughtpolice · June 3, 2023, 5:42pm

I think regardless of what service provider we end up dealing with, if we move away from AWS or not, this is a wakeup call that we’re going to need to mitigate some of these problems with good old-fashioned engineering and elbow grease. This is true even if we ran our own servers somewhere. I’ve written several Nix binary caches, worked on a prototype of heavily optimizing the Fastly cache that serves cache.nixos.org (when I worked there), and have some ideas about CI systems and whatnot. Some notes:

The overwhelming cost is storage, not transfer costs. But that’s actually because the CDN is effective in practice, yet there is nothing to help storage. That’s the “this airplane has red dots all over it” effect.
- Caching is still probably not as good as it could be, and overall global hit rate efficiency could be pushed higher, last I remember. I had some solutions to attack this but it’s detailed and with the given breakdown, $900/mo transfer is better than I expected, I think? (Note that hit rate is not the same as cached savings; despite the numbers being 1,500TIB/29TiB, the global efficiency isn’t nearly that good IIRC.)
- Fastly is probably a reliable partner and I somewhat doubt NixOS is a major contributor to their network profile, so leaning into the cache further is probably OK. (I don’t work there anymore so don’t take this as an agreement. IANAL.)
- A significant amount of visible, end-user performance comes from p99 latency on .narinfo files, among other things. (More on this later.) So a cache, and more importantly a global cache, is always going to be needed, ignoring storage costs; it still needs to be factored into the design.
The cache has a very long tail of old data, a reasonable ratio of warm/cold data — with a large amount of warm data in an absolute sense, and a reasonable full bandwith rate. 95th percentile is 1k/rpm @ 2.6GiB/m, so this is well within reason to handle on a single server, I think.
Nix caches tend to have a particular storage profile:
- Reads are very frequent and hot. Most importantly, reads are latency sensitive, and must be served immediately.
- The hot path often concerns narinfo files, which are very small. This is a really important case, because it’s the hot path in both the 200 and the 404 case.
- Storage is very expensive over time, from .nar files.
- Writes can be frequent in an absolute sense, but generally experience the “90/10” read/write skew in my experience. They are also not latency sensitive. Therefore:
  - Batch uploads as aggressively as possible.
  - Just overall, aggressively trade latency for aggregate throughput and storage size wins.
A Nix binary cache actually has an extremely simple interface and in this case, that’s for (mostly) better, not worse. Some improvements could be made but the design allows a bit of freedom while being simple.

Some other things:

While absolute performance has normalized for components in cloud providers e.g. storage is faster and closer to CPU speeds than ever, it has not normalized on a cost basis. Compute costs are vastly cheaper compared to storage costs on most cloud providers, even though the relative performance gap between them has closed since, say, 2010. And critically, we are bound by cost, not performance.
We don’t just need absolute numbers. Performance and capacity planning actually requires the distributions behind the numbers. For example, it is important to know what the split between normal/cold data is not just in some moment in time, but over time as well. Was the split 90/10 one year ago and now it’s 75/25? That’s a big change in direction that we can’t understand a lot of things without knowing that.
- Another example: what’s the p95 vs p99 object size served by the cache? This is actually a huge piece of the puzzle. There’s something like a fixed 2GiB limit last I remember; I suspect the p99 is somewhere around 1.1 to 1.5 GB due to closures like GHC and LLVM. But the p99 case is also probably the case dominated by the transfer time; therefore a more expensive read path read path might still be profitable, in these cases.
- This was one of the things I wanted to add to the cache, was a lot more metrics tracking, because it’s really important for things like this.
- We also need to be able to sample the cache more effectively. I don’t know how to tackle this, but it’s needed for better investigations.

So the key takeaways are:

Reads need low latency.
Reads are often, but not always, small and frequent (.narinfo files).
- When I say small, I mean something like 1kb or less.
- Small files are notoriously difficult to handle for most solutions, so this is really important. But also, the cache should be serving 99.9% of all narinfos directly. Even then, cold reads can really hurt in general, in my experience.
The overall absolute bandwidth needs are actually relatively minimal, thanks to the cache.
Writes can be very high latency (seconds or even minutes before artifacts materialize in the cache is OK; it doesn’t matter if the latest llvm build takes 5 extra minutes to appear after a 1 hour build)
The absolute file sizes, however, are extremely aggressive in practice for the kinds of systems people run today and the kind of closures we produce.
If we have to pay money, or someone does, we need to apply some old school principles like this. Arbitraging your compute for better storage will be worth it.

I’ve done some experiments on this and I could write more about it. Notably, I believe algorithms like FastCDC are absolutely practical enough to run in “real time” on the write path of a Nix binary cache, and more importantly have tremendous benefits; some experiments I did with an almost-pure-Rust solution gave me somewhere upwards of 80% deduplication, though this is relatively impossible to reproduce reliably without taking samples of the Nix cache over time. This sort of approach requires the creation of an index that has to be constructed and then used on the read path, but only if the narinfo was found by a client, and only if it’s uncached. So it’s not that bad in practice for a lot of uses I think.

And on top of that, given the other parts of the performance profile, you can make other things very simple here if you can forecast correctly. I strongly suspect for example given the above numbers that a custom bit of software to do the above deduplication (a singlular multi-threaded concurrent server), in an active leader-standby replication to replicate indicies — with just two servers, could easily provide very high yearly uptime. The dirty secret about cloud stuff is that it can be very reliable in small numbers; you can easily run a a couple VPSs for years at a time with minimal interruption. Two big servers is much simpler than 10 small ones. If you have storage under control, you could easily run these 2X servers on standard commodity VPS providers for something like $300 USD thanks to the fact egress is mostly mitigated by the Fastly cache; Hetzner would give us 20TB for free, for example, with a 10GBit link, at only 1/euro-overcharge-per-TB. That completely covers the costs from Server-to-CDN in our case, for free, and the compute with standby replication is dwarfed by storage costs, even then.

I could keep going on. However, this is all engineering. The problem is that right now we have a sinking ship, but I don’t know how to handle that reasonably I’m afraid. But whatever we do though, once we abandon this ship we’re going to need to actually put some resources into this because the current path isn’t going to easily be sustainable. It will be difficult given the amount of resources available, but I think it can be done.

joepie91 · June 3, 2023, 5:42pm

As an early suggestion for a long-term solution (not as a short-term solution!), I’d suggest considering Tahoe-LAFS. It’s an actively-replicated distributed storage system with much stronger availability properties than something like IPFS or BitTorrent.

Advantages:

Distributed across arbitrary low-trust peers of arbitrary size; this would effectively allow third parties to sponsor storage without being dependent on their continued sponsorship, if any sponsor pulls out then the content just gets redistributed over a couple others
Cryptographic integrity is built into the addressing; this means that the storage nodes cannot tamper with (or even read) the content that they are storing, which reduces the level of trust necessary and therefore opens up possibilities of more storage sponsors
Essentially RAID-over-the-network; uses erasure coding to have efficient redundancy against data loss, and replication is an active task (content blocks are pushed to storage nodes) so you are not dependent on people deciding to ‘seed’ something, therefore it is not subject to rot of unpopular files
Can deal with any size storage node; there’s no need to match the size of different nodes (unlike many other RAID-like systems), and even small personal nodes provided by individual contributors would be useful, as long as they are reliably online.

Disadvantages:

There is no explicit ‘delete’ functionality, only (optional) time-based expiry and renewal. This is not likely to matter to us, given our “never delete anything” usecase.
Performance can be variable; I don’t think this is an issue for us, given that we have a CDN fronting it.
Access requires something that speaks the protocol; so we’d need to proxy requests through a centralized server for retrieval (though that server doesn’t need to persistently store anything). That server would also be responsible for triggering eg. repairs of files for which some ‘shares’ (copies) have been lost.
I think it doesn’t protect against a malicious storage node pretending to have a file but not actually having it, unless you do regular integrity checks (which is an option that is available by default).
The only one I can think of that might be a problem for us: there’s no real resource accounting, and any participating node can store data in the storage cluster without it being attributable to them. It may be possible to fix this by running a modified version of the software that rejects all storage requests except for those from the central coordination node; I don’t know whether this is available as a configurable option out-of-the-box.

With a Tahoe-LAFS setup, the only centralized infrastructure that would need to be maintained by the project would be that central coordination server. It would do a fair amount of traffic (it’d essentially be the ‘gateway’ to the storage cluster), but require little to no persistent storage. This would be very, very cheap to maintain, both in terms of money and in terms of sysadmin maintenance.

I feel like it would be worth trialing this in parallel to the existing infrastructure; gradually directing more and more backend requests to the Tahoe cluster and then falling back to the existing S3 storage if they fail. That way we can test in a low-risk way whether a) we actually have sufficient sysadmin capacity to keep this running in the longer term, and b) the software itself can meet our needs.

Edit: I’d be happy to coordinate the technical side of this, of course, insofar my health allows.

pauldoo · June 3, 2023, 5:47pm

That’s very misleading, and assumes the “S3 Standard” storage class. Other storage classes are available. The “Archive Instant Access” tier (part of “Intelligent Tiering”) is $4/TB per month, for example.

It was already mentioned in an earlier reply, but if not already enabled turning on Intelligent Tiering would be a no-brainier.

chris-martin · June 3, 2023, 6:07pm

Is there any way we can be more specific about when this “sometimes” is? It seems to me that some data is extremely valuable and a lot of the data is probably not valuable at all, if it can be regenerated as needed. Would I be correct in supposing, for example, that if, hypothetically, only the “fetch” derivations were kept and everything else with infrequent access were pruned from the cache, everything would still be buildable and the only impact would be bad build times for lesser-used packages (for people who aren’t already using a third-party cache host for their more niche projects)?

matthewcroughan · June 3, 2023, 6:34pm

We definitely need more support for distributed, content-addressed systems as substituters. Candidates include Bittorrent, IPFS, Hypercore, Eris. Imagine if every NixOS config seeded content-addressed paths via bittorrent.

github.com/NixOS/nixpkgs

fetchFromBittorrent: init

NixOS:master ← MatthewCroughan:mc/fetchFromBittorrent

opened 09:51AM - 27 Jan 23 UTC

MatthewCroughan

+112 -0

Adds and documents a basic FOD Fetcher for Bittorrent that uses Transmission as …a client, which has the additional benefit of being able to fully configure this fetcher using `builtins.toJSON`, since Transmission uses a JSON configuration, as shown in the example below. ``` { fetchFromBittorrent }: fetchFromBittorrent { config = { peer-limit-global = 100; }; # Optional url = "magnet:?xt=urn:btih:dd8255ecdc7ca55fb0bbf81323d87062db1f6d1c"; sha256 = ""; } ``` If a magnet url is provided, the derivation `name` is derived from the sha1 bittorrent info hash or the sha256 variant for bittorrent v2, which is also supported. This name can be reversed to download the torrent in most clients. Thanks to @ursi for helping me with the regexp for parsing the Magnet URIs ###### Description of changes  ###### Things done - Built on platform(s) - [x] x86_64-linux - [ ] aarch64-linux - [ ] x86_64-darwin - [ ] aarch64-darwin - [ ] For non-Linux: Is `sandbox = true` set in `nix.conf`? (See [Nix manual](https://nixos.org/manual/nix/stable/command-ref/conf-file.html)) - [ ] Tested, as applicable: - [NixOS test(s)](https://nixos.org/manual/nixos/unstable/index.html#sec-nixos-tests) (look inside [nixos/tests](https://github.com/NixOS/nixpkgs/blob/master/nixos/tests)) - and/or [package tests](https://nixos.org/manual/nixpkgs/unstable/#sec-package-tests) - or, for functions and "core" functionality, tests in [lib/tests](https://github.com/NixOS/nixpkgs/blob/master/lib/tests) or [pkgs/test](https://github.com/NixOS/nixpkgs/blob/master/pkgs/test) - made sure NixOS tests are [linked](https://nixos.org/manual/nixpkgs/unstable/#ssec-nixos-tests-linking) to the relevant packages - [ ] Tested compilation of all packages that depend on this change using `nix-shell -p nixpkgs-review --run "nixpkgs-review rev HEAD"`. Note: all changes have to be committed, also see [nixpkgs-review usage](https://github.com/Mic92/nixpkgs-review#usage) - [x] Tested basic functionality of all binary files (usually in `./result/bin/`) - [23.05 Release Notes (or backporting 22.11 Release notes)](https://github.com/NixOS/nixpkgs/blob/master/CONTRIBUTING.md#generating-2305-release-notes) - [ ] (Package updates) Added a release notes entry if the change is major or breaking - [ ] (Module updates) Added a release notes entry if the change is significant - [ ] (Module addition) Added a release notes entry if adding a new NixOS module - [ ] (Release notes changes) Ran `nixos/doc/manual/md-to-db.sh` to update generated release notes - [ ] Fits [CONTRIBUTING.md](https://github.com/NixOS/nixpkgs/blob/master/CONTRIBUTING.md).

joepie91 · June 3, 2023, 6:37pm

Please keep in mind that these systems do not provide availability guarantees. They are distribution systems, not storage systems, and therefore do not address the problem of durable long-term storage of the binary cache (and thus do not solve the problem that we’re dealing with here right now).

thoughtpolice · June 3, 2023, 6:49pm

Something like this paragraph worries me. The problem with these sorts of things are metastable and cascading failures. Data replication and resharding is already extremely difficult to achieve reliably in existing systems that achieve PBs-and-beyond scale. I don’t think we want potential storage domains to go completely missing and causing issues like this without careful operator oversight. (That’s assuming we want to actually store large data volumes and not throw stuff away/archive it.)

This hints at something which is that Tahoe does solve a set of problems, but they may not be ours. The coordination could be cheap to run, yes, but that isn’t the only consideration we have and may even be considered a coincidence if none of the other problems really overlap. Our problem isn’t really untrusted storage. It’s just having storage in general.

Tail performance in latency and throughput is critical to the usability of the binary cache in the IMO, the CDN is part of that but not all of that. See for example nixos-org-configurations/212. I’ve in the past seen reports from users regularly reporting multi-second long TTFBs in distant locations like Shanghai and Singapore without features like Shielding. That kind of latency destroys everything, and in some cases causes timeouts in some software; a timeout is effectively the same thing to a user as “this file doesn’t exist” or “this file was deleted by the operator in a data loss accident.” They’re operationally different, but morally the same from a user/operator/SLO/I-paid-you-money perspective.

The CDN works so well it’s easy to forget that. Nobody thinks forest fires are a problem when you prevent them in the first place, after all. It’s a combination of factors. One is that S3 is actually a really good and reliable product. Is it expensive as hell? Yes. But it’s good, it’s fast, it has good latency to clients close to the bucket. It’s a well explored problem to serve these S3 files quickly; so that’s why it works so well.

(One latency performance factor that’s unexplored IMO is the fact that the design of narinfo files incurs a form of HOL blocking. When you want to download foo.nar you need all the dependencies, but you don’t know them until you have recursively traversed and downloaded every dependent .narinfo file, but that’s a bunch of small latency-sensitive files to grab hold of.)

If “performance is variable” means “Random introduced spikes of 100ms in the backend from origin to cache” — OK, that would blow like, 50% or more of your entire intra-stack latency budget in a well-oiled distributed system in a relatively large company. But it would work for us, maybe. But if “performance is variable” means that “Requesting a .narinfo file from Los Angeles causes a recursive set of requests, randomly choosing a guy in Singapore and another person in Australia to serve them both, with totally random latency between them”, that’s going to be quite bad, I think. And if the guy in Australia can then just turn his PC off and require a guy in Singapore to rewrite parts of his hard drive to handle that, that will also have consequences to the availability of other nodes in practice.

I don’t know what the performance profile of Tahoe is, to be fair. It could be worth exploring as an alternative and it may be quite good. Some other problems remain e.g. provisioning of storage is clearly an operator concern and I don’t think trusting data hoarders to not unplug their hard drives is a good approach. But I just want to push back on the notion the CDN makes performance anomalies irrelevant. Performance is absolutely relevant, and only feels irrelevant exactly because the CDN and Amazon took care of performance for us. And at large data sizes, performance anomalies can easily snowball into complete denials of service for users. Unfortunately, that might require careful operator experience to account for, and it’s going to be hard to sell that when most of the alternative options are probably more well understood from that perspective.

dinvlad · June 3, 2023, 6:51pm

I’d add that funding-wise practically speaking, unless some major provider steps up to cover the costs, it seems a one-time capital investment to build-your-own origin server would make more sense than leasing any off-the-shelf VPS storage solution, since a one-time major fundraising campaign would be easier to accomplish than an ongoing raising of $1,000s/month (this is assuming the whole thing cannot be easily re-architectured in the short-term to entirely eliminate the need to store this much…)

There would still be a constant cost of colocation and an associated drive replacement service, but that would be an order of magnitude less ($100s instead of $1,000s/month) and so easier to sustain with a minimal ongoing donation campaign.

IamfromSpace · June 3, 2023, 6:53pm

From reading through the thread, it sounds like there are two independent problems here. One is caching/serving assets/serving in-demand assets quickly. The other seems to be that this an archive as much as it is a cache.

If original sources are unavailable, rebuilding is not an option, and this eventually has a poisoning effect into the nix ecosystem. If a significant dependency is totally lost, all depended packages are now at risk of loss and so on. Based on this, we can’t really think about this as only cache. Pruning unimportant things may still be possible, but that’s probably very tricky, to the point where deferral of this issue is likely a good idea.

This means we need to solve both problems, and for archive purposes, it’s almost impossible to get away from a centralized authority and reduce costs. Full distribution makes no archival guarantees—or it defers responsibility to others to do it themselves. If this archival layer exists, it may then be possible to be more distributed in on-demand cache layers on top, to try and spread out the data transfer costs.

So the question is really how do we pay as little as possible for archiving cold assets and then either reduce transfer costs or distribute them out by some means.

Disregard: I had hoped that CloudFront would be a good option for cutting egress costs (CF to S3 is free, so you pay only CFs pricing), but it looks like the savings are relatively small. It may however still be a quick win simply mark caches as immutable and then dramatically scale back the storage tier in S3. The hot items should be so hot that they are rarely actually going back to origin. Egress only moves a little, but storage savings could be significant.

edit: I reread and we of course already have a CDN, which is amazingly doing a ton. Possibly on final fully centralized cache layer would buffer the CDN and the buckets enough to lower the S3 tier even further. Since a CDN is distributed, this makes for more misses than a single unified layer would see.