The NixOS Foundation's Call to Action: S3 Costs Require Community Support

https://github.com/NixOS/foundation/issues/79

3 Likes

As for this S3 thread, I donā€™t think a good enough reliability and availability is doable just by a couple community members in their free time. And even if we discount human work, it still wonā€™t be close to free (HW, probably not runnable ā€œat homeā€, etc.)

3 Likes

just a small clarification: it seems its 3TB/day in egress

Ah right, I misunderstood. Itā€™d still end up considerably cheaper than AWS, though - around $3400 per month assuming 500TB of data.

3 Likes

So, I donā€™t have a great understanding of how the cache system works outside of ā€œIt caches built binaries for common platformsā€, but wouldnā€™t the solution to data loss be to just reintroduce said binary the next time someone builds it or whenever hydra hits it?

2 Likes

The problem is that there are a lot of historical builds in the binary cache; the binary cache doesnā€™t build things on-demand, but rather rebuilds all the (eligible) packages every time the nixpkgs channel gets updated - so that the binaries are there already once a client requests them.

This means that it will never revisit past builds unless someone explicitly creates a task for it. It would probably be possible to retroactively do a ā€˜cleanā€™ rebuild of past nixpkgs versions, but to try and reproduce the entire history of nixpkgs (which is what currently lives in the binary cache, more or less) would require a lot of computing power. I donā€™t know whether thatā€™s viable in practice.

4 Likes

So, a quick survey of ā€œS3-compatibleā€ providers that I am aware of, on a monthly basis (S3 pricing comes from Backblazeā€™s page, have not verified):

  • AWS S3: $26/TB storage, $90/TB traffic, no minimum storage
  • Cloudflare R2: $15/TB storage plus ā€˜operation feesā€™, free traffic; possibly sponsorable, but this would create another sponsor dependency
  • Backblaze B2: $5/TB storage, $10/TB traffic, no minimum storage, supposedly free migration from S3
  • Wasabi: $6/TB storage, free(?) traffic, 90 days minimum storage
  • Storj: $4/TB storage plus ā€˜segment feeā€™, $7/TB traffic, no minimum storage, supposedly free migration from S3

Iā€™d say that these seem like a viable short-term solution to migrate to, so that we can hit the deadline of a month, while also significantly cutting expenses; after that, itā€™s probably worth revisiting whether something more cost-efficient is possible, and whatā€™s possible in terms of infrastructure maintenance capacity after the infra team gets split?

5 Likes

Had a couple ideas while responding to someone on HN who also asked the ā€œcan we just rebuild it all?ā€ question:

  • I guess if thereā€™s a bit of Pareto in the distribution of build output sizes, there might be meaningful savings in rebuilding the packages that take up the most storage and paying to move the small stuff.
  • It might also be possible to triage the sources in the cacheā€“re-fetch all that are still available from the original source or a mirror, and then just export the ones that either couldnā€™t be fetched or could be fetched but the contents no longer match the hash.
3 Likes

Hi! I apologize for not reading through the entire thread yet, and of what Iā€™ve seen, the prevailing sentiment in this thread might necessarily align with this proposal, but I thought Iā€™d mention it anyways.

When I was at Amazon, I got AWS to cover crate.ioā€™s hosting costs. AWS has since setup an open-source sponsorship program that gives out AWS credits with a (last I checked) not-unreasonable application process, and it might be worthwhile to apply if you havenā€™t already.

(I am a little sick right now, so brain donā€™t work good.)

12 Likes

An admittedly biased point about Wasabi - costs increase substantially if you start egressing more than 1x bytes at rest per month.

On the other hand, the per-ā€œsegmentā€ costs with Storj are essentially per-object charges, but can be basically eliminated with some packing (e.g. zipping up lots of small files together). The Storj platform natively understands zips and can pull individual files out of zips without downloading the whole zip for this reason.

Since the large-scale data storage is partially intended to benefit research, consider partnering with an academic institution. Many institutions qualify for data egress waivers: Data egress waiver available for eligible researchers and institutions | AWS Public Sector Blog

In my experience, institutions use substantially less than their waiver limit across the org so the data transfer will basically be free for them. This transition could be done with virtually no technical measures at all (i.e., just re-owning the org to an institution) and would then allow migration off AWS in the future basically for free. Some university research group may even be willing to host the data on a storage cluster.

Another thought: if the data is largely duplicative at the bytestream level across objects, would decompressing and recompressing (as one stream) the data before egress save on substantial costs for the transition?

8 Likes

I think regardless of what service provider we end up dealing with, if we move away from AWS or not, this is a wakeup call that weā€™re going to need to mitigate some of these problems with good old-fashioned engineering and elbow grease. This is true even if we ran our own servers somewhere. Iā€™ve written several Nix binary caches, worked on a prototype of heavily optimizing the Fastly cache that serves cache.nixos.org (when I worked there), and have some ideas about CI systems and whatnot. Some notes:

  • The overwhelming cost is storage, not transfer costs. But thatā€™s actually because the CDN is effective in practice, yet there is nothing to help storage. Thatā€™s the ā€œthis airplane has red dots all over itā€ effect.
    • Caching is still probably not as good as it could be, and overall global hit rate efficiency could be pushed higher, last I remember. I had some solutions to attack this but itā€™s detailed and with the given breakdown, $900/mo transfer is better than I expected, I think? (Note that hit rate is not the same as cached savings; despite the numbers being 1,500TIB/29TiB, the global efficiency isnā€™t nearly that good IIRC.)
    • Fastly is probably a reliable partner and I somewhat doubt NixOS is a major contributor to their network profile, so leaning into the cache further is probably OK. (I donā€™t work there anymore so donā€™t take this as an agreement. IANAL.)
    • A significant amount of visible, end-user performance comes from p99 latency on .narinfo files, among other things. (More on this later.) So a cache, and more importantly a global cache, is always going to be needed, ignoring storage costs; it still needs to be factored into the design.
  • The cache has a very long tail of old data, a reasonable ratio of warm/cold data ā€” with a large amount of warm data in an absolute sense, and a reasonable full bandwith rate. 95th percentile is 1k/rpm @ 2.6GiB/m, so this is well within reason to handle on a single server, I think.
  • Nix caches tend to have a particular storage profile:
    • Reads are very frequent and hot. Most importantly, reads are latency sensitive, and must be served immediately.
    • The hot path often concerns narinfo files, which are very small. This is a really important case, because itā€™s the hot path in both the 200 and the 404 case.
    • Storage is very expensive over time, from .nar files.
    • Writes can be frequent in an absolute sense, but generally experience the ā€œ90/10ā€ read/write skew in my experience. They are also not latency sensitive. Therefore:
      • Batch uploads as aggressively as possible.
      • Just overall, aggressively trade latency for aggregate throughput and storage size wins.
  • A Nix binary cache actually has an extremely simple interface and in this case, thatā€™s for (mostly) better, not worse. Some improvements could be made but the design allows a bit of freedom while being simple.

Some other things:

  • While absolute performance has normalized for components in cloud providers e.g. storage is faster and closer to CPU speeds than ever, it has not normalized on a cost basis. Compute costs are vastly cheaper compared to storage costs on most cloud providers, even though the relative performance gap between them has closed since, say, 2010. And critically, we are bound by cost, not performance.
  • We donā€™t just need absolute numbers. Performance and capacity planning actually requires the distributions behind the numbers. For example, it is important to know what the split between normal/cold data is not just in some moment in time, but over time as well. Was the split 90/10 one year ago and now itā€™s 75/25? Thatā€™s a big change in direction that we canā€™t understand a lot of things without knowing that.
    • Another example: whatā€™s the p95 vs p99 object size served by the cache? This is actually a huge piece of the puzzle. Thereā€™s something like a fixed 2GiB limit last I remember; I suspect the p99 is somewhere around 1.1 to 1.5 GB due to closures like GHC and LLVM. But the p99 case is also probably the case dominated by the transfer time; therefore a more expensive read path read path might still be profitable, in these cases.
    • This was one of the things I wanted to add to the cache, was a lot more metrics tracking, because itā€™s really important for things like this.
    • We also need to be able to sample the cache more effectively. I donā€™t know how to tackle this, but itā€™s needed for better investigations.

So the key takeaways are:

  • Reads need low latency.
  • Reads are often, but not always, small and frequent (.narinfo files).
    • When I say small, I mean something like 1kb or less.
    • Small files are notoriously difficult to handle for most solutions, so this is really important. But also, the cache should be serving 99.9% of all narinfos directly. Even then, cold reads can really hurt in general, in my experience.
  • The overall absolute bandwidth needs are actually relatively minimal, thanks to the cache.
  • Writes can be very high latency (seconds or even minutes before artifacts materialize in the cache is OK; it doesnā€™t matter if the latest llvm build takes 5 extra minutes to appear after a 1 hour build)
  • The absolute file sizes, however, are extremely aggressive in practice for the kinds of systems people run today and the kind of closures we produce.
  • If we have to pay money, or someone does, we need to apply some old school principles like this. Arbitraging your compute for better storage will be worth it.

Iā€™ve done some experiments on this and I could write more about it. Notably, I believe algorithms like FastCDC are absolutely practical enough to run in ā€œreal timeā€ on the write path of a Nix binary cache, and more importantly have tremendous benefits; some experiments I did with an almost-pure-Rust solution gave me somewhere upwards of 80% deduplication, though this is relatively impossible to reproduce reliably without taking samples of the Nix cache over time. This sort of approach requires the creation of an index that has to be constructed and then used on the read path, but only if the narinfo was found by a client, and only if itā€™s uncached. So itā€™s not that bad in practice for a lot of uses I think.

And on top of that, given the other parts of the performance profile, you can make other things very simple here if you can forecast correctly. I strongly suspect for example given the above numbers that a custom bit of software to do the above deduplication (a singlular multi-threaded concurrent server), in an active leader-standby replication to replicate indicies ā€” with just two servers, could easily provide very high yearly uptime. The dirty secret about cloud stuff is that it can be very reliable in small numbers; you can easily run a a couple VPSs for years at a time with minimal interruption. Two big servers is much simpler than 10 small ones. If you have storage under control, you could easily run these 2X servers on standard commodity VPS providers for something like $300 USD thanks to the fact egress is mostly mitigated by the Fastly cache; Hetzner would give us 20TB for free, for example, with a 10GBit link, at only 1/euro-overcharge-per-TB. That completely covers the costs from Server-to-CDN in our case, for free, and the compute with standby replication is dwarfed by storage costs, even then.

I could keep going on. However, this is all engineering. The problem is that right now we have a sinking ship, but I donā€™t know how to handle that reasonably Iā€™m afraid. But whatever we do though, once we abandon this ship weā€™re going to need to actually put some resources into this because the current path isnā€™t going to easily be sustainable. It will be difficult given the amount of resources available, but I think it can be done.

26 Likes

As an early suggestion for a long-term solution (not as a short-term solution!), Iā€™d suggest considering Tahoe-LAFS. Itā€™s an actively-replicated distributed storage system with much stronger availability properties than something like IPFS or BitTorrent.

Advantages:

  • Distributed across arbitrary low-trust peers of arbitrary size; this would effectively allow third parties to sponsor storage without being dependent on their continued sponsorship, if any sponsor pulls out then the content just gets redistributed over a couple others
  • Cryptographic integrity is built into the addressing; this means that the storage nodes cannot tamper with (or even read) the content that they are storing, which reduces the level of trust necessary and therefore opens up possibilities of more storage sponsors
  • Essentially RAID-over-the-network; uses erasure coding to have efficient redundancy against data loss, and replication is an active task (content blocks are pushed to storage nodes) so you are not dependent on people deciding to ā€˜seedā€™ something, therefore it is not subject to rot of unpopular files
  • Can deal with any size storage node; thereā€™s no need to match the size of different nodes (unlike many other RAID-like systems), and even small personal nodes provided by individual contributors would be useful, as long as they are reliably online.

Disadvantages:

  • There is no explicit ā€˜deleteā€™ functionality, only (optional) time-based expiry and renewal. This is not likely to matter to us, given our ā€œnever delete anythingā€ usecase.
  • Performance can be variable; I donā€™t think this is an issue for us, given that we have a CDN fronting it.
  • Access requires something that speaks the protocol; so weā€™d need to proxy requests through a centralized server for retrieval (though that server doesnā€™t need to persistently store anything). That server would also be responsible for triggering eg. repairs of files for which some ā€˜sharesā€™ (copies) have been lost.
  • I think it doesnā€™t protect against a malicious storage node pretending to have a file but not actually having it, unless you do regular integrity checks (which is an option that is available by default).
  • The only one I can think of that might be a problem for us: thereā€™s no real resource accounting, and any participating node can store data in the storage cluster without it being attributable to them. It may be possible to fix this by running a modified version of the software that rejects all storage requests except for those from the central coordination node; I donā€™t know whether this is available as a configurable option out-of-the-box.

With a Tahoe-LAFS setup, the only centralized infrastructure that would need to be maintained by the project would be that central coordination server. It would do a fair amount of traffic (itā€™d essentially be the ā€˜gatewayā€™ to the storage cluster), but require little to no persistent storage. This would be very, very cheap to maintain, both in terms of money and in terms of sysadmin maintenance.

I feel like it would be worth trialing this in parallel to the existing infrastructure; gradually directing more and more backend requests to the Tahoe cluster and then falling back to the existing S3 storage if they fail. That way we can test in a low-risk way whether a) we actually have sufficient sysadmin capacity to keep this running in the longer term, and b) the software itself can meet our needs.

Edit: Iā€™d be happy to coordinate the technical side of this, of course, insofar my health allows.

6 Likes

Thatā€™s very misleading, and assumes the ā€œS3 Standardā€ storage class. Other storage classes are available. The ā€œArchive Instant Accessā€ tier (part of ā€œIntelligent Tieringā€) is $4/TB per month, for example.

It was already mentioned in an earlier reply, but if not already enabled turning on Intelligent Tiering would be a no-brainier.

4 Likes

Is there any way we can be more specific about when this ā€œsometimesā€ is? It seems to me that some data is extremely valuable and a lot of the data is probably not valuable at all, if it can be regenerated as needed. Would I be correct in supposing, for example, that if, hypothetically, only the ā€œfetchā€ derivations were kept and everything else with infrequent access were pruned from the cache, everything would still be buildable and the only impact would be bad build times for lesser-used packages (for people who arenā€™t already using a third-party cache host for their more niche projects)?

4 Likes

We definitely need more support for distributed, content-addressed systems as substituters. Candidates include Bittorrent, IPFS, Hypercore, Eris. Imagine if every NixOS config seeded content-addressed paths via bittorrent.

https://github.com/NixOS/nixpkgs/pull/212930

7 Likes

Please keep in mind that these systems do not provide availability guarantees. They are distribution systems, not storage systems, and therefore do not address the problem of durable long-term storage of the binary cache (and thus do not solve the problem that weā€™re dealing with here right now).

13 Likes

Something like this paragraph worries me. The problem with these sorts of things are metastable and cascading failures. Data replication and resharding is already extremely difficult to achieve reliably in existing systems that achieve PBs-and-beyond scale. I donā€™t think we want potential storage domains to go completely missing and causing issues like this without careful operator oversight. (Thatā€™s assuming we want to actually store large data volumes and not throw stuff away/archive it.)

This hints at something which is that Tahoe does solve a set of problems, but they may not be ours. The coordination could be cheap to run, yes, but that isnā€™t the only consideration we have and may even be considered a coincidence if none of the other problems really overlap. Our problem isnā€™t really untrusted storage. Itā€™s just having storage in general.

Tail performance in latency and throughput is critical to the usability of the binary cache in the IMO, the CDN is part of that but not all of that. See for example nixos-org-configurations/212. Iā€™ve in the past seen reports from users regularly reporting multi-second long TTFBs in distant locations like Shanghai and Singapore without features like Shielding. That kind of latency destroys everything, and in some cases causes timeouts in some software; a timeout is effectively the same thing to a user as ā€œthis file doesnā€™t existā€ or ā€œthis file was deleted by the operator in a data loss accident.ā€ Theyā€™re operationally different, but morally the same from a user/operator/SLO/I-paid-you-money perspective.

The CDN works so well itā€™s easy to forget that. Nobody thinks forest fires are a problem when you prevent them in the first place, after all. Itā€™s a combination of factors. One is that S3 is actually a really good and reliable product. Is it expensive as hell? Yes. But itā€™s good, itā€™s fast, it has good latency to clients close to the bucket. Itā€™s a well explored problem to serve these S3 files quickly; so thatā€™s why it works so well.

(One latency performance factor thatā€™s unexplored IMO is the fact that the design of narinfo files incurs a form of HOL blocking. When you want to download foo.nar you need all the dependencies, but you donā€™t know them until you have recursively traversed and downloaded every dependent .narinfo file, but thatā€™s a bunch of small latency-sensitive files to grab hold of.)

If ā€œperformance is variableā€ means ā€œRandom introduced spikes of 100ms in the backend from origin to cacheā€ ā€” OK, that would blow like, 50% or more of your entire intra-stack latency budget in a well-oiled distributed system in a relatively large company. But it would work for us, maybe. But if ā€œperformance is variableā€ means that ā€œRequesting a .narinfo file from Los Angeles causes a recursive set of requests, randomly choosing a guy in Singapore and another person in Australia to serve them both, with totally random latency between themā€, thatā€™s going to be quite bad, I think. And if the guy in Australia can then just turn his PC off and require a guy in Singapore to rewrite parts of his hard drive to handle that, that will also have consequences to the availability of other nodes in practice.

I donā€™t know what the performance profile of Tahoe is, to be fair. It could be worth exploring as an alternative and it may be quite good. Some other problems remain e.g. provisioning of storage is clearly an operator concern and I donā€™t think trusting data hoarders to not unplug their hard drives is a good approach. But I just want to push back on the notion the CDN makes performance anomalies irrelevant. Performance is absolutely relevant, and only feels irrelevant exactly because the CDN and Amazon took care of performance for us. And at large data sizes, performance anomalies can easily snowball into complete denials of service for users. Unfortunately, that might require careful operator experience to account for, and itā€™s going to be hard to sell that when most of the alternative options are probably more well understood from that perspective.

7 Likes

Iā€™d add that funding-wise practically speaking, unless some major provider steps up to cover the costs, it seems a one-time capital investment to build-your-own origin server would make more sense than leasing any off-the-shelf VPS storage solution, since a one-time major fundraising campaign would be easier to accomplish than an ongoing raising of $1,000s/month (this is assuming the whole thing cannot be easily re-architectured in the short-term to entirely eliminate the need to store this muchā€¦)

There would still be a constant cost of colocation and an associated drive replacement service, but that would be an order of magnitude less ($100s instead of $1,000s/month) and so easier to sustain with a minimal ongoing donation campaign.

5 Likes

From reading through the thread, it sounds like there are two independent problems here. One is caching/serving assets/serving in-demand assets quickly. The other seems to be that this an archive as much as it is a cache.

If original sources are unavailable, rebuilding is not an option, and this eventually has a poisoning effect into the nix ecosystem. If a significant dependency is totally lost, all depended packages are now at risk of loss and so on. Based on this, we canā€™t really think about this as only cache. Pruning unimportant things may still be possible, but thatā€™s probably very tricky, to the point where deferral of this issue is likely a good idea.

This means we need to solve both problems, and for archive purposes, itā€™s almost impossible to get away from a centralized authority and reduce costs. Full distribution makes no archival guaranteesā€”or it defers responsibility to others to do it themselves. If this archival layer exists, it may then be possible to be more distributed in on-demand cache layers on top, to try and spread out the data transfer costs.

So the question is really how do we pay as little as possible for archiving cold assets and then either reduce transfer costs or distribute them out by some means.

Disregard: I had hoped that CloudFront would be a good option for cutting egress costs (CF to S3 is free, so you pay only CFs pricing), but it looks like the savings are relatively small. It may however still be a quick win simply mark caches as immutable and then dramatically scale back the storage tier in S3. The hot items should be so hot that they are rarely actually going back to origin. Egress only moves a little, but storage savings could be significant.

edit: I reread and we of course already have a CDN, which is amazingly doing a ton. Possibly on final fully centralized cache layer would buffer the CDN and the buckets enough to lower the S3 tier even further. Since a CDN is distributed, this makes for more misses than a single unified layer would see.

8 Likes