I have a suggestion with respect to how to get it out, although it has the potential risk of costing even more money in the process: can we deduplicate the data by stuffing it into a content addressed store on top of s3 before pulling that deduplicated content address store out? It might be worthwhile looking at a random sample (also, how large would be necessary?) of the stores to see how much that would actually save before trying it at a full scale however.
Free solutions are Cool and Good however: I donāt have an understanding the scale of Nix in relation to cloud users in general, but it would probably be necessary to discuss with any platform we migrate to, especially if they arenāt huge, because the providers do in fact have costs, and the free solutions probably arenāt meant for a sudden ingress of 150TB and increasing, in an ongoing manner.
I specifically remember seeing some EU related open science / reproducible research related information, which may be a string worth pulling on in the long term. I donāt know what to expect, maybe these solutions will not be a good fit for our requirements. Perhaps the NLNet connection can yield some information here?
A cursory search yielded this https://open-research-europe.ec.europa.eu/for-authors/data-guidelines#approvedrepositories , which has some lists of recommended data providers. Though I expect most scientists wont be using hundreds of terabytes either.
These look superficially interesting
- https://b2share.eudat.eu
- https://zenodo.org/ this is especially interesting, since it came out of CERN and IIUC they have actually big data requirements
These also look superficially interesting from policy / community perspective:
Unrelatedly, I also found About RDA | RDA (might be a global thing).
Yes. Providers which credibly belong to the bandwidth alliance for the foreseeable future can be considered.
Scaleway also has an opensource sponsoring program, though its budget may be limited (presumably itās possible to ask for more than the base credit of 2400 ā¬, but not sure). Their object storage charges 12 ā¬/TB-month, so it would seem to be significantly less expensive than AWS and R2.
Is a mix of the three possible? I mean, using the new storage for the new objects and the most frequently accessed objects (presumably the 27 or 91 TiB of store paths mentioned above), while keeping AWS S3 for the rest until the egress costs for it can be handled. (This doesnāt help much if the storage alone costs more than the egress.)
One provider which on paper offers such a service is BunnyCDNās perma-cache. Put it between Fastly and AWS S3, get it populated gradually as objects are requested (meaning no egress costs except what youād have had anyway, if people behave the same), then at some point use it as source for migration to the destination storage.
I personally canāt give you advice about the migration process to something cheaper. But I encourage people to think about the size of the repo. Itās currently 425 TiB of data that need to be served in high availability.
If you look at nixpkgs commit graph Contributors to NixOS/nixpkgs Ā· GitHub itās easy to see the activity increased drastically over the last few years. Even nixpkgs tarball size went from 25 MiB to 36 MiB in just one year!
In my opinion, the project must consider pruning some data from archives, it would be interesting to see the size of the biggest unused closures instead of pruning everything that is past a certain date or hasnāt been accessed for a while.
The current storage rate isnāt sustainable in the long run, except if you want to throw money away. This also has some ecological implications, the more you store, the more hard drives must be spinning (and itās not linear due to redundancy).
In my opinion, the project must consider pruning some data from archives, it would be interesting to see the size of the biggest unused closures instead of pruning everything that is past a certain date or hasnāt been accessed for a while.
Perhaps just keep source archives and patches? We donāt need to keep every single build of every single channel revision - these can always be rebuilt, but not if their sources are gone.
Case in point: fakeroue
had its source removed, but it was still cached and thus could be recovered. Situations like this are where having the cache is invaluable.
I remember that at the beginning of 2022 there have been figures around 300 TiB in the S3 buckets floating around. Mostly rumors though. And if those rumors are even close to the truth, that means that also the cache increased by about 50% in size, not only the nixpkgs
tarballā¦
Yes, as much as my already hour long bisection session will hate this, though I have to agreeā¦
I think, that applying GC/prune only on the ācoldā storage would be a good first step. And even there, we could start with a block level deduplication, though I am not yet sure how that should look like.
Static block sizes on the FS level will probably not get us far enough, as the NARs are compressed, and a single byte change in the containing data might change the compressed result of all compressed blocks after that, and if it is just due to a slight alignment change.
Then we have deduplication with floating block sizes, like the restic backup repository format uses. There again though I am not aware of any easy to use wrapper that would be able to apply a similar deduplication method to ādynamicā filesystems.
I remember though, that recently (Q4ā22/Q1ā23) a software got announced, that can do some partial NAR deduplication, but didnāt provide any parity or reconstruction mechanisms. Sadly I do not remember the name, nor did I follow the development and whether or not those things got fixed.
If there is interest to keep legacy packages for some reasons, maybe a community / 3rd party substituter could be created, Software Heritage seems to have some kind of interest into Nix, they could host this.
Then we have deduplication with floating block sizes, like the restic backup repository format uses. There again though I am not aware of any easy to use wrapper that would be able to apply a similar deduplication method to ādynamicā filesystems.
I remember though, that recently (Q4ā22/Q1ā23) a software got announced, that can do some partial NAR deduplication,
I assume, you are talking about this ?
It does content-defined chunking and deduplication on the uncompressed NARs using fastcdc.
@domenkozar @ron et al, I probably missed it above, but just wondering if we have Intelligent Tiering (Amazon S3 Intelligent-Tiering Storage Class | AWS) on the buckets?
Another option would be a move to Backblazeās B2 @ $5/TB/Month and $0 egress via Fastly. Note, Iāve only used them for a small side project where it worked fine, but others may have experience with them at scale that would be good to hear.
But I also think a sensible data retention policy and prune/dedupe/GC would help control future growth rates.
Iām curious if this would be possible by straight-up using torrents.
transmission provides a cli and there appears to be FUSE filesystems that will pull torrents on the fly.
With 512Tb, it would be 4,400 users if they all provision 128gb and ~2,000 for caches of 512gb.
I canāt do math apparently. 425 TiB is ~58412GB, so about 460 users with a 128GB cache, or 115 users with a 512GB cache
Torrenting is usually fast for me, but the only thing Iād be really worried about is residential internet upload speeds.
If this were to happen, Iād happily donate a terabyte or two on a machine I have sitting.
In my opinion the problem with distributing storage load to users continues to be that you need āguaranteedā availability, and I think you donāt really have that with a swarm. There are really several issues here that will probably get conflated on-and-off.
- bandwidth costs
- hot storage costs
- cold storage costs
This can be further subdivided by someone that has a clue (though see above, this has probably been largely discussed above).
I would very much hope it isnāt necessary to GC the archives (note we dont actually have full reproducibility for a lot of things, though Iām not sure how important this is; definitely keep the sources), but perhaps putting infrequently accessed data in higher latency storage will decrease costs?
Also, due to the growth curve, I expect a lot of the oldest data is actually not that much of the storage usage? How much could we save by GCing how much?
Edit: and of course, if one checks the finances thread it actually has already been stated that a lot of the data is in colder storage, and Eelco says that GC-ing would remove a lot of data. IIUC.
The cache of all packages from, say, the last couple of year is much less than that: itās should be totally feasible. The rest of the cache is accessed much less frequently and could probably be stored and served from a single host without a CDN.
Also, an interesting point is: Half a petabyte in storage and 3 TB transfer a day? Shit. That's nothing, unless... | Hacker News.
Basically, if we agree we donāt care about 99.9999whatever% reliability we could self-host the cache and cut a substantial cost: itās not that much data or bandwidth as it may seem.
I think we should seriously consider running and/or organizing our own infrastructure here. Iām not a fan of being dependent on the goodwill of corporations - they do fundamentally exist to turn a profit, and āpaying for a FOSS projectās storageā doesnāt generally do that, so eventually the tap runs dry. In this case we got advance notice, but if we switch over to another sponsored provider, thereās really no certainty that we wonāt have much worse luck next time.
This amount of storage is well within range of what can be organized at small scale, and 3TB of transfer is essentially nothing; even a $5 VPS gives you that amount of traffic for free nowadays. At the very least we should probably be paying for a cheap storage/hosting provider with some sort of hard contractual obligation - whether itās something like Backblaze B2 or just a plain olā server with lots of hard drives at Hetzner/OVH/Leaseweb/etc.
I donāt think decentralized storage is a viable option for a backing store; availability is a notorious problem for that, and the likelihood of data loss is very high.
Edit: A typical price for non-AWS/Azure/GCP storage at various providers is $5/TB/month.
No. (for running an S3 replacement on our own) We barely manage to maintain the infra we already have. And in one month weāre supposed to start maintaining this? If this doesnāt work for an hour, noone will be able to pull stuff from cache.nixos.org (except lucky stuff cached by Fastly I guess).
This is why I also mentioned options like Backblaze B2; they charge $5 per TB per month in storage, and $10 per TB of traffic, if Iām not mistaken. For 500TB of data and 3TB of monthly transfer, this works out to around $2500 per month; which is significantly less than the current $9000/month estimate.
The ārunning your own infrastructureā option is for if costs need to be reduced further than that.
Of course thereās no reason we canāt move to something like B2 for an immediate cost reduction, and then in the long term work out something more sustainable, without the āshutdown in one monthā pressure.
Also, something that just occurred to me: we may be able to significantly reduce transfer-out costs by sending it through AWS Lightsail (their VPS service). If Iām not mistaken, traffic between S3 and Lightsail is free, and Lightsail itself has cheaper egress.
We barely manage to maintain the infra we already have.
It sounds like youāre overworked: have you ever publicly called for volunteers to join the infrastructure team?
I think managing NixOS infrastructure is genuinely cool: given a large base of the NixOS users do sys admin either for work or as a hobby it shouldnāt be too hard to find someone.