The NixOS Foundation's Call to Action: S3 Costs Require Community Support

I think we should seriously consider running and/or organizing our own infrastructure here. I’m not a fan of being dependent on the goodwill of corporations - they do fundamentally exist to turn a profit, and “paying for a FOSS project’s storage” doesn’t generally do that, so eventually the tap runs dry. In this case we got advance notice, but if we switch over to another sponsored provider, there’s really no certainty that we won’t have much worse luck next time.

This amount of storage is well within range of what can be organized at small scale, and 3TB of transfer is essentially nothing; even a $5 VPS gives you that amount of traffic for free nowadays. At the very least we should probably be paying for a cheap storage/hosting provider with some sort of hard contractual obligation - whether it’s something like Backblaze B2 or just a plain ol’ server with lots of hard drives at Hetzner/OVH/Leaseweb/etc.

I don’t think decentralized storage is a viable option for a backing store; availability is a notorious problem for that, and the likelihood of data loss is very high.

Edit: A typical price for non-AWS/Azure/GCP storage at various providers is $5/TB/month.

11 Likes

No. (for running an S3 replacement on our own) We barely manage to maintain the infra we already have. And in one month we’re supposed to start maintaining this? If this doesn’t work for an hour, noone will be able to pull stuff from cache.nixos.org (except lucky stuff cached by Fastly I guess).

14 Likes

This is why I also mentioned options like Backblaze B2; they charge $5 per TB per month in storage, and $10 per TB of traffic, if I’m not mistaken. For 500TB of data and 3TB of monthly transfer, this works out to around $2500 per month; which is significantly less than the current $9000/month estimate.

The “running your own infrastructure” option is for if costs need to be reduced further than that.

Of course there’s no reason we can’t move to something like B2 for an immediate cost reduction, and then in the long term work out something more sustainable, without the “shutdown in one month” pressure.

8 Likes

Also, something that just occurred to me: we may be able to significantly reduce transfer-out costs by sending it through AWS Lightsail (their VPS service). If I’m not mistaken, traffic between S3 and Lightsail is free, and Lightsail itself has cheaper egress.

3 Likes

We barely manage to maintain the infra we already have.

It sounds like you’re overworked: have you ever publicly called for volunteers to join the infrastructure team?
I think managing NixOS infrastructure is genuinely cool: given a large base of the NixOS users do sys admin either for work or as a hobby it shouldn’t be too hard to find someone.

7 Likes

https://github.com/NixOS/foundation/issues/79

3 Likes

As for this S3 thread, I don’t think a good enough reliability and availability is doable just by a couple community members in their free time. And even if we discount human work, it still won’t be close to free (HW, probably not runnable “at home”, etc.)

3 Likes

just a small clarification: it seems its 3TB/day in egress

Ah right, I misunderstood. It’d still end up considerably cheaper than AWS, though - around $3400 per month assuming 500TB of data.

3 Likes

So, I don’t have a great understanding of how the cache system works outside of “It caches built binaries for common platforms”, but wouldn’t the solution to data loss be to just reintroduce said binary the next time someone builds it or whenever hydra hits it?

2 Likes

The problem is that there are a lot of historical builds in the binary cache; the binary cache doesn’t build things on-demand, but rather rebuilds all the (eligible) packages every time the nixpkgs channel gets updated - so that the binaries are there already once a client requests them.

This means that it will never revisit past builds unless someone explicitly creates a task for it. It would probably be possible to retroactively do a ‘clean’ rebuild of past nixpkgs versions, but to try and reproduce the entire history of nixpkgs (which is what currently lives in the binary cache, more or less) would require a lot of computing power. I don’t know whether that’s viable in practice.

4 Likes

So, a quick survey of “S3-compatible” providers that I am aware of, on a monthly basis (S3 pricing comes from Backblaze’s page, have not verified):

  • AWS S3: $26/TB storage, $90/TB traffic, no minimum storage
  • Cloudflare R2: $15/TB storage plus ‘operation fees’, free traffic; possibly sponsorable, but this would create another sponsor dependency
  • Backblaze B2: $5/TB storage, $10/TB traffic, no minimum storage, supposedly free migration from S3
  • Wasabi: $6/TB storage, free(?) traffic, 90 days minimum storage
  • Storj: $4/TB storage plus ‘segment fee’, $7/TB traffic, no minimum storage, supposedly free migration from S3

I’d say that these seem like a viable short-term solution to migrate to, so that we can hit the deadline of a month, while also significantly cutting expenses; after that, it’s probably worth revisiting whether something more cost-efficient is possible, and what’s possible in terms of infrastructure maintenance capacity after the infra team gets split?

5 Likes

Had a couple ideas while responding to someone on HN who also asked the “can we just rebuild it all?” question:

  • I guess if there’s a bit of Pareto in the distribution of build output sizes, there might be meaningful savings in rebuilding the packages that take up the most storage and paying to move the small stuff.
  • It might also be possible to triage the sources in the cache–re-fetch all that are still available from the original source or a mirror, and then just export the ones that either couldn’t be fetched or could be fetched but the contents no longer match the hash.
3 Likes

Hi! I apologize for not reading through the entire thread yet, and of what I’ve seen, the prevailing sentiment in this thread might necessarily align with this proposal, but I thought I’d mention it anyways.

When I was at Amazon, I got AWS to cover crate.io’s hosting costs. AWS has since setup an open-source sponsorship program that gives out AWS credits with a (last I checked) not-unreasonable application process, and it might be worthwhile to apply if you haven’t already.

(I am a little sick right now, so brain don’t work good.)

12 Likes

An admittedly biased point about Wasabi - costs increase substantially if you start egressing more than 1x bytes at rest per month.

On the other hand, the per-“segment” costs with Storj are essentially per-object charges, but can be basically eliminated with some packing (e.g. zipping up lots of small files together). The Storj platform natively understands zips and can pull individual files out of zips without downloading the whole zip for this reason.

Since the large-scale data storage is partially intended to benefit research, consider partnering with an academic institution. Many institutions qualify for data egress waivers: Data egress waiver available for eligible researchers and institutions | AWS Public Sector Blog

In my experience, institutions use substantially less than their waiver limit across the org so the data transfer will basically be free for them. This transition could be done with virtually no technical measures at all (i.e., just re-owning the org to an institution) and would then allow migration off AWS in the future basically for free. Some university research group may even be willing to host the data on a storage cluster.

Another thought: if the data is largely duplicative at the bytestream level across objects, would decompressing and recompressing (as one stream) the data before egress save on substantial costs for the transition?

8 Likes

I think regardless of what service provider we end up dealing with, if we move away from AWS or not, this is a wakeup call that we’re going to need to mitigate some of these problems with good old-fashioned engineering and elbow grease. This is true even if we ran our own servers somewhere. I’ve written several Nix binary caches, worked on a prototype of heavily optimizing the Fastly cache that serves cache.nixos.org (when I worked there), and have some ideas about CI systems and whatnot. Some notes:

  • The overwhelming cost is storage, not transfer costs. But that’s actually because the CDN is effective in practice, yet there is nothing to help storage. That’s the “this airplane has red dots all over it” effect.
    • Caching is still probably not as good as it could be, and overall global hit rate efficiency could be pushed higher, last I remember. I had some solutions to attack this but it’s detailed and with the given breakdown, $900/mo transfer is better than I expected, I think? (Note that hit rate is not the same as cached savings; despite the numbers being 1,500TIB/29TiB, the global efficiency isn’t nearly that good IIRC.)
    • Fastly is probably a reliable partner and I somewhat doubt NixOS is a major contributor to their network profile, so leaning into the cache further is probably OK. (I don’t work there anymore so don’t take this as an agreement. IANAL.)
    • A significant amount of visible, end-user performance comes from p99 latency on .narinfo files, among other things. (More on this later.) So a cache, and more importantly a global cache, is always going to be needed, ignoring storage costs; it still needs to be factored into the design.
  • The cache has a very long tail of old data, a reasonable ratio of warm/cold data — with a large amount of warm data in an absolute sense, and a reasonable full bandwith rate. 95th percentile is 1k/rpm @ 2.6GiB/m, so this is well within reason to handle on a single server, I think.
  • Nix caches tend to have a particular storage profile:
    • Reads are very frequent and hot. Most importantly, reads are latency sensitive, and must be served immediately.
    • The hot path often concerns narinfo files, which are very small. This is a really important case, because it’s the hot path in both the 200 and the 404 case.
    • Storage is very expensive over time, from .nar files.
    • Writes can be frequent in an absolute sense, but generally experience the “90/10” read/write skew in my experience. They are also not latency sensitive. Therefore:
      • Batch uploads as aggressively as possible.
      • Just overall, aggressively trade latency for aggregate throughput and storage size wins.
  • A Nix binary cache actually has an extremely simple interface and in this case, that’s for (mostly) better, not worse. Some improvements could be made but the design allows a bit of freedom while being simple.

Some other things:

  • While absolute performance has normalized for components in cloud providers e.g. storage is faster and closer to CPU speeds than ever, it has not normalized on a cost basis. Compute costs are vastly cheaper compared to storage costs on most cloud providers, even though the relative performance gap between them has closed since, say, 2010. And critically, we are bound by cost, not performance.
  • We don’t just need absolute numbers. Performance and capacity planning actually requires the distributions behind the numbers. For example, it is important to know what the split between normal/cold data is not just in some moment in time, but over time as well. Was the split 90/10 one year ago and now it’s 75/25? That’s a big change in direction that we can’t understand a lot of things without knowing that.
    • Another example: what’s the p95 vs p99 object size served by the cache? This is actually a huge piece of the puzzle. There’s something like a fixed 2GiB limit last I remember; I suspect the p99 is somewhere around 1.1 to 1.5 GB due to closures like GHC and LLVM. But the p99 case is also probably the case dominated by the transfer time; therefore a more expensive read path read path might still be profitable, in these cases.
    • This was one of the things I wanted to add to the cache, was a lot more metrics tracking, because it’s really important for things like this.
    • We also need to be able to sample the cache more effectively. I don’t know how to tackle this, but it’s needed for better investigations.

So the key takeaways are:

  • Reads need low latency.
  • Reads are often, but not always, small and frequent (.narinfo files).
    • When I say small, I mean something like 1kb or less.
    • Small files are notoriously difficult to handle for most solutions, so this is really important. But also, the cache should be serving 99.9% of all narinfos directly. Even then, cold reads can really hurt in general, in my experience.
  • The overall absolute bandwidth needs are actually relatively minimal, thanks to the cache.
  • Writes can be very high latency (seconds or even minutes before artifacts materialize in the cache is OK; it doesn’t matter if the latest llvm build takes 5 extra minutes to appear after a 1 hour build)
  • The absolute file sizes, however, are extremely aggressive in practice for the kinds of systems people run today and the kind of closures we produce.
  • If we have to pay money, or someone does, we need to apply some old school principles like this. Arbitraging your compute for better storage will be worth it.

I’ve done some experiments on this and I could write more about it. Notably, I believe algorithms like FastCDC are absolutely practical enough to run in “real time” on the write path of a Nix binary cache, and more importantly have tremendous benefits; some experiments I did with an almost-pure-Rust solution gave me somewhere upwards of 80% deduplication, though this is relatively impossible to reproduce reliably without taking samples of the Nix cache over time. This sort of approach requires the creation of an index that has to be constructed and then used on the read path, but only if the narinfo was found by a client, and only if it’s uncached. So it’s not that bad in practice for a lot of uses I think.

And on top of that, given the other parts of the performance profile, you can make other things very simple here if you can forecast correctly. I strongly suspect for example given the above numbers that a custom bit of software to do the above deduplication (a singlular multi-threaded concurrent server), in an active leader-standby replication to replicate indicies — with just two servers, could easily provide very high yearly uptime. The dirty secret about cloud stuff is that it can be very reliable in small numbers; you can easily run a a couple VPSs for years at a time with minimal interruption. Two big servers is much simpler than 10 small ones. If you have storage under control, you could easily run these 2X servers on standard commodity VPS providers for something like $300 USD thanks to the fact egress is mostly mitigated by the Fastly cache; Hetzner would give us 20TB for free, for example, with a 10GBit link, at only 1/euro-overcharge-per-TB. That completely covers the costs from Server-to-CDN in our case, for free, and the compute with standby replication is dwarfed by storage costs, even then.

I could keep going on. However, this is all engineering. The problem is that right now we have a sinking ship, but I don’t know how to handle that reasonably I’m afraid. But whatever we do though, once we abandon this ship we’re going to need to actually put some resources into this because the current path isn’t going to easily be sustainable. It will be difficult given the amount of resources available, but I think it can be done.

26 Likes

As an early suggestion for a long-term solution (not as a short-term solution!), I’d suggest considering Tahoe-LAFS. It’s an actively-replicated distributed storage system with much stronger availability properties than something like IPFS or BitTorrent.

Advantages:

  • Distributed across arbitrary low-trust peers of arbitrary size; this would effectively allow third parties to sponsor storage without being dependent on their continued sponsorship, if any sponsor pulls out then the content just gets redistributed over a couple others
  • Cryptographic integrity is built into the addressing; this means that the storage nodes cannot tamper with (or even read) the content that they are storing, which reduces the level of trust necessary and therefore opens up possibilities of more storage sponsors
  • Essentially RAID-over-the-network; uses erasure coding to have efficient redundancy against data loss, and replication is an active task (content blocks are pushed to storage nodes) so you are not dependent on people deciding to ‘seed’ something, therefore it is not subject to rot of unpopular files
  • Can deal with any size storage node; there’s no need to match the size of different nodes (unlike many other RAID-like systems), and even small personal nodes provided by individual contributors would be useful, as long as they are reliably online.

Disadvantages:

  • There is no explicit ‘delete’ functionality, only (optional) time-based expiry and renewal. This is not likely to matter to us, given our “never delete anything” usecase.
  • Performance can be variable; I don’t think this is an issue for us, given that we have a CDN fronting it.
  • Access requires something that speaks the protocol; so we’d need to proxy requests through a centralized server for retrieval (though that server doesn’t need to persistently store anything). That server would also be responsible for triggering eg. repairs of files for which some ‘shares’ (copies) have been lost.
  • I think it doesn’t protect against a malicious storage node pretending to have a file but not actually having it, unless you do regular integrity checks (which is an option that is available by default).
  • The only one I can think of that might be a problem for us: there’s no real resource accounting, and any participating node can store data in the storage cluster without it being attributable to them. It may be possible to fix this by running a modified version of the software that rejects all storage requests except for those from the central coordination node; I don’t know whether this is available as a configurable option out-of-the-box.

With a Tahoe-LAFS setup, the only centralized infrastructure that would need to be maintained by the project would be that central coordination server. It would do a fair amount of traffic (it’d essentially be the ‘gateway’ to the storage cluster), but require little to no persistent storage. This would be very, very cheap to maintain, both in terms of money and in terms of sysadmin maintenance.

I feel like it would be worth trialing this in parallel to the existing infrastructure; gradually directing more and more backend requests to the Tahoe cluster and then falling back to the existing S3 storage if they fail. That way we can test in a low-risk way whether a) we actually have sufficient sysadmin capacity to keep this running in the longer term, and b) the software itself can meet our needs.

Edit: I’d be happy to coordinate the technical side of this, of course, insofar my health allows.

6 Likes

That’s very misleading, and assumes the “S3 Standard” storage class. Other storage classes are available. The “Archive Instant Access” tier (part of “Intelligent Tiering”) is $4/TB per month, for example.

It was already mentioned in an earlier reply, but if not already enabled turning on Intelligent Tiering would be a no-brainier.

4 Likes

Is there any way we can be more specific about when this “sometimes” is? It seems to me that some data is extremely valuable and a lot of the data is probably not valuable at all, if it can be regenerated as needed. Would I be correct in supposing, for example, that if, hypothetically, only the “fetch” derivations were kept and everything else with infrequent access were pruned from the cache, everything would still be buildable and the only impact would be bad build times for lesser-used packages (for people who aren’t already using a third-party cache host for their more niche projects)?

4 Likes