As for this S3 thread, I donāt think a good enough reliability and availability is doable just by a couple community members in their free time. And even if we discount human work, it still wonāt be close to free (HW, probably not runnable āat homeā, etc.)
just a small clarification: it seems its 3TB/day in egress
Ah right, I misunderstood. Itād still end up considerably cheaper than AWS, though - around $3400 per month assuming 500TB of data.
So, I donāt have a great understanding of how the cache system works outside of āIt caches built binaries for common platformsā, but wouldnāt the solution to data loss be to just reintroduce said binary the next time someone builds it or whenever hydra hits it?
The problem is that there are a lot of historical builds in the binary cache; the binary cache doesnāt build things on-demand, but rather rebuilds all the (eligible) packages every time the nixpkgs channel gets updated - so that the binaries are there already once a client requests them.
This means that it will never revisit past builds unless someone explicitly creates a task for it. It would probably be possible to retroactively do a ācleanā rebuild of past nixpkgs versions, but to try and reproduce the entire history of nixpkgs (which is what currently lives in the binary cache, more or less) would require a lot of computing power. I donāt know whether thatās viable in practice.
So, a quick survey of āS3-compatibleā providers that I am aware of, on a monthly basis (S3 pricing comes from Backblazeās page, have not verified):
- AWS S3: $26/TB storage, $90/TB traffic, no minimum storage
- Cloudflare R2: $15/TB storage plus āoperation feesā, free traffic; possibly sponsorable, but this would create another sponsor dependency
- Backblaze B2: $5/TB storage, $10/TB traffic, no minimum storage, supposedly free migration from S3
- Wasabi: $6/TB storage, free(?) traffic, 90 days minimum storage
- Storj: $4/TB storage plus āsegment feeā, $7/TB traffic, no minimum storage, supposedly free migration from S3
Iād say that these seem like a viable short-term solution to migrate to, so that we can hit the deadline of a month, while also significantly cutting expenses; after that, itās probably worth revisiting whether something more cost-efficient is possible, and whatās possible in terms of infrastructure maintenance capacity after the infra team gets split?
Had a couple ideas while responding to someone on HN who also asked the ācan we just rebuild it all?ā question:
- I guess if thereās a bit of Pareto in the distribution of build output sizes, there might be meaningful savings in rebuilding the packages that take up the most storage and paying to move the small stuff.
- It might also be possible to triage the sources in the cacheāre-fetch all that are still available from the original source or a mirror, and then just export the ones that either couldnāt be fetched or could be fetched but the contents no longer match the hash.
Hi! I apologize for not reading through the entire thread yet, and of what Iāve seen, the prevailing sentiment in this thread might necessarily align with this proposal, but I thought Iād mention it anyways.
When I was at Amazon, I got AWS to cover crate.ioās hosting costs. AWS has since setup an open-source sponsorship program that gives out AWS credits with a (last I checked) not-unreasonable application process, and it might be worthwhile to apply if you havenāt already.
(I am a little sick right now, so brain donāt work good.)
An admittedly biased point about Wasabi - costs increase substantially if you start egressing more than 1x bytes at rest per month.
On the other hand, the per-āsegmentā costs with Storj are essentially per-object charges, but can be basically eliminated with some packing (e.g. zipping up lots of small files together). The Storj platform natively understands zips and can pull individual files out of zips without downloading the whole zip for this reason.
Since the large-scale data storage is partially intended to benefit research, consider partnering with an academic institution. Many institutions qualify for data egress waivers: Data egress waiver available for eligible researchers and institutions | AWS Public Sector Blog
In my experience, institutions use substantially less than their waiver limit across the org so the data transfer will basically be free for them. This transition could be done with virtually no technical measures at all (i.e., just re-owning the org to an institution) and would then allow migration off AWS in the future basically for free. Some university research group may even be willing to host the data on a storage cluster.
Another thought: if the data is largely duplicative at the bytestream level across objects, would decompressing and recompressing (as one stream) the data before egress save on substantial costs for the transition?
I think regardless of what service provider we end up dealing with, if we move away from AWS or not, this is a wakeup call that weāre going to need to mitigate some of these problems with good old-fashioned engineering and elbow grease. This is true even if we ran our own servers somewhere. Iāve written several Nix binary caches, worked on a prototype of heavily optimizing the Fastly cache that serves cache.nixos.org (when I worked there), and have some ideas about CI systems and whatnot. Some notes:
- The overwhelming cost is storage, not transfer costs. But thatās actually because the CDN is effective in practice, yet there is nothing to help storage. Thatās the āthis airplane has red dots all over itā effect.
- Caching is still probably not as good as it could be, and overall global hit rate efficiency could be pushed higher, last I remember. I had some solutions to attack this but itās detailed and with the given breakdown, $900/mo transfer is better than I expected, I think? (Note that hit rate is not the same as cached savings; despite the numbers being 1,500TIB/29TiB, the global efficiency isnāt nearly that good IIRC.)
- Fastly is probably a reliable partner and I somewhat doubt NixOS is a major contributor to their network profile, so leaning into the cache further is probably OK. (I donāt work there anymore so donāt take this as an agreement. IANAL.)
- A significant amount of visible, end-user performance comes from p99 latency on
.narinfo
files, among other things. (More on this later.) So a cache, and more importantly a global cache, is always going to be needed, ignoring storage costs; it still needs to be factored into the design.
- The cache has a very long tail of old data, a reasonable ratio of warm/cold data ā with a large amount of warm data in an absolute sense, and a reasonable full bandwith rate. 95th percentile is 1k/rpm @ 2.6GiB/m, so this is well within reason to handle on a single server, I think.
- Nix caches tend to have a particular storage profile:
- Reads are very frequent and hot. Most importantly, reads are latency sensitive, and must be served immediately.
- The hot path often concerns narinfo files, which are very small. This is a really important case, because itās the hot path in both the 200 and the 404 case.
- Storage is very expensive over time, from
.nar
files. - Writes can be frequent in an absolute sense, but generally experience the ā90/10ā read/write skew in my experience. They are also not latency sensitive. Therefore:
- Batch uploads as aggressively as possible.
- Just overall, aggressively trade latency for aggregate throughput and storage size wins.
- A Nix binary cache actually has an extremely simple interface and in this case, thatās for (mostly) better, not worse. Some improvements could be made but the design allows a bit of freedom while being simple.
Some other things:
- While absolute performance has normalized for components in cloud providers e.g. storage is faster and closer to CPU speeds than ever, it has not normalized on a cost basis. Compute costs are vastly cheaper compared to storage costs on most cloud providers, even though the relative performance gap between them has closed since, say, 2010. And critically, we are bound by cost, not performance.
- We donāt just need absolute numbers. Performance and capacity planning actually requires the distributions behind the numbers. For example, it is important to know what the split between normal/cold data is not just in some moment in time, but over time as well. Was the split 90/10 one year ago and now itās 75/25? Thatās a big change in direction that we canāt understand a lot of things without knowing that.
- Another example: whatās the p95 vs p99 object size served by the cache? This is actually a huge piece of the puzzle. Thereās something like a fixed 2GiB limit last I remember; I suspect the p99 is somewhere around 1.1 to 1.5 GB due to closures like GHC and LLVM. But the p99 case is also probably the case dominated by the transfer time; therefore a more expensive read path read path might still be profitable, in these cases.
- This was one of the things I wanted to add to the cache, was a lot more metrics tracking, because itās really important for things like this.
- We also need to be able to sample the cache more effectively. I donāt know how to tackle this, but itās needed for better investigations.
So the key takeaways are:
- Reads need low latency.
- Reads are often, but not always, small and frequent (
.narinfo
files).- When I say small, I mean something like 1kb or less.
- Small files are notoriously difficult to handle for most solutions, so this is really important. But also, the cache should be serving 99.9% of all narinfos directly. Even then, cold reads can really hurt in general, in my experience.
- The overall absolute bandwidth needs are actually relatively minimal, thanks to the cache.
- Writes can be very high latency (seconds or even minutes before artifacts materialize in the cache is OK; it doesnāt matter if the latest
llvm
build takes 5 extra minutes to appear after a 1 hour build) - The absolute file sizes, however, are extremely aggressive in practice for the kinds of systems people run today and the kind of closures we produce.
- If we have to pay money, or someone does, we need to apply some old school principles like this. Arbitraging your compute for better storage will be worth it.
Iāve done some experiments on this and I could write more about it. Notably, I believe algorithms like FastCDC are absolutely practical enough to run in āreal timeā on the write path of a Nix binary cache, and more importantly have tremendous benefits; some experiments I did with an almost-pure-Rust solution gave me somewhere upwards of 80% deduplication, though this is relatively impossible to reproduce reliably without taking samples of the Nix cache over time. This sort of approach requires the creation of an index that has to be constructed and then used on the read path, but only if the narinfo was found by a client, and only if itās uncached. So itās not that bad in practice for a lot of uses I think.
And on top of that, given the other parts of the performance profile, you can make other things very simple here if you can forecast correctly. I strongly suspect for example given the above numbers that a custom bit of software to do the above deduplication (a singlular multi-threaded concurrent server), in an active leader-standby replication to replicate indicies ā with just two servers, could easily provide very high yearly uptime. The dirty secret about cloud stuff is that it can be very reliable in small numbers; you can easily run a a couple VPSs for years at a time with minimal interruption. Two big servers is much simpler than 10 small ones. If you have storage under control, you could easily run these 2X servers on standard commodity VPS providers for something like $300 USD thanks to the fact egress is mostly mitigated by the Fastly cache; Hetzner would give us 20TB for free, for example, with a 10GBit link, at only 1/euro-overcharge-per-TB. That completely covers the costs from Server-to-CDN in our case, for free, and the compute with standby replication is dwarfed by storage costs, even then.
I could keep going on. However, this is all engineering. The problem is that right now we have a sinking ship, but I donāt know how to handle that reasonably Iām afraid. But whatever we do though, once we abandon this ship weāre going to need to actually put some resources into this because the current path isnāt going to easily be sustainable. It will be difficult given the amount of resources available, but I think it can be done.
As an early suggestion for a long-term solution (not as a short-term solution!), Iād suggest considering Tahoe-LAFS. Itās an actively-replicated distributed storage system with much stronger availability properties than something like IPFS or BitTorrent.
Advantages:
- Distributed across arbitrary low-trust peers of arbitrary size; this would effectively allow third parties to sponsor storage without being dependent on their continued sponsorship, if any sponsor pulls out then the content just gets redistributed over a couple others
- Cryptographic integrity is built into the addressing; this means that the storage nodes cannot tamper with (or even read) the content that they are storing, which reduces the level of trust necessary and therefore opens up possibilities of more storage sponsors
- Essentially RAID-over-the-network; uses erasure coding to have efficient redundancy against data loss, and replication is an active task (content blocks are pushed to storage nodes) so you are not dependent on people deciding to āseedā something, therefore it is not subject to rot of unpopular files
- Can deal with any size storage node; thereās no need to match the size of different nodes (unlike many other RAID-like systems), and even small personal nodes provided by individual contributors would be useful, as long as they are reliably online.
Disadvantages:
- There is no explicit ādeleteā functionality, only (optional) time-based expiry and renewal. This is not likely to matter to us, given our ānever delete anythingā usecase.
- Performance can be variable; I donāt think this is an issue for us, given that we have a CDN fronting it.
- Access requires something that speaks the protocol; so weād need to proxy requests through a centralized server for retrieval (though that server doesnāt need to persistently store anything). That server would also be responsible for triggering eg. repairs of files for which some āsharesā (copies) have been lost.
- I think it doesnāt protect against a malicious storage node pretending to have a file but not actually having it, unless you do regular integrity checks (which is an option that is available by default).
- The only one I can think of that might be a problem for us: thereās no real resource accounting, and any participating node can store data in the storage cluster without it being attributable to them. It may be possible to fix this by running a modified version of the software that rejects all storage requests except for those from the central coordination node; I donāt know whether this is available as a configurable option out-of-the-box.
With a Tahoe-LAFS setup, the only centralized infrastructure that would need to be maintained by the project would be that central coordination server. It would do a fair amount of traffic (itād essentially be the āgatewayā to the storage cluster), but require little to no persistent storage. This would be very, very cheap to maintain, both in terms of money and in terms of sysadmin maintenance.
I feel like it would be worth trialing this in parallel to the existing infrastructure; gradually directing more and more backend requests to the Tahoe cluster and then falling back to the existing S3 storage if they fail. That way we can test in a low-risk way whether a) we actually have sufficient sysadmin capacity to keep this running in the longer term, and b) the software itself can meet our needs.
Edit: Iād be happy to coordinate the technical side of this, of course, insofar my health allows.
Thatās very misleading, and assumes the āS3 Standardā storage class. Other storage classes are available. The āArchive Instant Accessā tier (part of āIntelligent Tieringā) is $4/TB per month, for example.
It was already mentioned in an earlier reply, but if not already enabled turning on Intelligent Tiering would be a no-brainier.
Is there any way we can be more specific about when this āsometimesā is? It seems to me that some data is extremely valuable and a lot of the data is probably not valuable at all, if it can be regenerated as needed. Would I be correct in supposing, for example, that if, hypothetically, only the āfetchā derivations were kept and everything else with infrequent access were pruned from the cache, everything would still be buildable and the only impact would be bad build times for lesser-used packages (for people who arenāt already using a third-party cache host for their more niche projects)?
We definitely need more support for distributed, content-addressed systems as substituters. Candidates include Bittorrent, IPFS, Hypercore, Eris. Imagine if every NixOS config seeded content-addressed paths via bittorrent.
Please keep in mind that these systems do not provide availability guarantees. They are distribution systems, not storage systems, and therefore do not address the problem of durable long-term storage of the binary cache (and thus do not solve the problem that weāre dealing with here right now).
Something like this paragraph worries me. The problem with these sorts of things are metastable and cascading failures. Data replication and resharding is already extremely difficult to achieve reliably in existing systems that achieve PBs-and-beyond scale. I donāt think we want potential storage domains to go completely missing and causing issues like this without careful operator oversight. (Thatās assuming we want to actually store large data volumes and not throw stuff away/archive it.)
This hints at something which is that Tahoe does solve a set of problems, but they may not be ours. The coordination could be cheap to run, yes, but that isnāt the only consideration we have and may even be considered a coincidence if none of the other problems really overlap. Our problem isnāt really untrusted storage. Itās just having storage in general.
Tail performance in latency and throughput is critical to the usability of the binary cache in the IMO, the CDN is part of that but not all of that. See for example nixos-org-configurations/212. Iāve in the past seen reports from users regularly reporting multi-second long TTFBs in distant locations like Shanghai and Singapore without features like Shielding. That kind of latency destroys everything, and in some cases causes timeouts in some software; a timeout is effectively the same thing to a user as āthis file doesnāt existā or āthis file was deleted by the operator in a data loss accident.ā Theyāre operationally different, but morally the same from a user/operator/SLO/I-paid-you-money perspective.
The CDN works so well itās easy to forget that. Nobody thinks forest fires are a problem when you prevent them in the first place, after all. Itās a combination of factors. One is that S3 is actually a really good and reliable product. Is it expensive as hell? Yes. But itās good, itās fast, it has good latency to clients close to the bucket. Itās a well explored problem to serve these S3 files quickly; so thatās why it works so well.
(One latency performance factor thatās unexplored IMO is the fact that the design of narinfo files incurs a form of HOL blocking. When you want to download foo.nar
you need all the dependencies, but you donāt know them until you have recursively traversed and downloaded every dependent .narinfo
file, but thatās a bunch of small latency-sensitive files to grab hold of.)
If āperformance is variableā means āRandom introduced spikes of 100ms in the backend from origin to cacheā ā OK, that would blow like, 50% or more of your entire intra-stack latency budget in a well-oiled distributed system in a relatively large company. But it would work for us, maybe. But if āperformance is variableā means that āRequesting a .narinfo
file from Los Angeles causes a recursive set of requests, randomly choosing a guy in Singapore and another person in Australia to serve them both, with totally random latency between themā, thatās going to be quite bad, I think. And if the guy in Australia can then just turn his PC off and require a guy in Singapore to rewrite parts of his hard drive to handle that, that will also have consequences to the availability of other nodes in practice.
I donāt know what the performance profile of Tahoe is, to be fair. It could be worth exploring as an alternative and it may be quite good. Some other problems remain e.g. provisioning of storage is clearly an operator concern and I donāt think trusting data hoarders to not unplug their hard drives is a good approach. But I just want to push back on the notion the CDN makes performance anomalies irrelevant. Performance is absolutely relevant, and only feels irrelevant exactly because the CDN and Amazon took care of performance for us. And at large data sizes, performance anomalies can easily snowball into complete denials of service for users. Unfortunately, that might require careful operator experience to account for, and itās going to be hard to sell that when most of the alternative options are probably more well understood from that perspective.
Iād add that funding-wise practically speaking, unless some major provider steps up to cover the costs, it seems a one-time capital investment to build-your-own origin server would make more sense than leasing any off-the-shelf VPS storage solution, since a one-time major fundraising campaign would be easier to accomplish than an ongoing raising of $1,000s/month (this is assuming the whole thing cannot be easily re-architectured in the short-term to entirely eliminate the need to store this muchā¦)
There would still be a constant cost of colocation and an associated drive replacement service, but that would be an order of magnitude less ($100s instead of $1,000s/month) and so easier to sustain with a minimal ongoing donation campaign.
From reading through the thread, it sounds like there are two independent problems here. One is caching/serving assets/serving in-demand assets quickly. The other seems to be that this an archive as much as it is a cache.
If original sources are unavailable, rebuilding is not an option, and this eventually has a poisoning effect into the nix ecosystem. If a significant dependency is totally lost, all depended packages are now at risk of loss and so on. Based on this, we canāt really think about this as only cache. Pruning unimportant things may still be possible, but thatās probably very tricky, to the point where deferral of this issue is likely a good idea.
This means we need to solve both problems, and for archive purposes, itās almost impossible to get away from a centralized authority and reduce costs. Full distribution makes no archival guaranteesāor it defers responsibility to others to do it themselves. If this archival layer exists, it may then be possible to be more distributed in on-demand cache layers on top, to try and spread out the data transfer costs.
So the question is really how do we pay as little as possible for archiving cold assets and then either reduce transfer costs or distribute them out by some means.
Disregard: I had hoped that CloudFront would be a good option for cutting egress costs (CF to S3 is free, so you pay only CFs pricing), but it looks like the savings are relatively small. It may however still be a quick win simply mark caches as immutable and then dramatically scale back the storage tier in S3. The hot items should be so hot that they are rarely actually going back to origin. Egress only moves a little, but storage savings could be significant.
edit: I reread and we of course already have a CDN, which is amazingly doing a ton. Possibly on final fully centralized cache layer would buffer the CDN and the buckets enough to lower the S3 tier even further. Since a CDN is distributed, this makes for more misses than a single unified layer would see.