My point is that it literally doesn’t solve the problem that this thread is about, not even poorly. There is already a thread about adding IPFS support for distribution purposes. That’s the correct place to discuss this.
This was already discussed in the Matrix room, but given the high volume of discussion over there, I felt it would be a good idea to make a note about it here as well, for posterity:
I’d personally feel extremely uncomfortable with moving project infrastructure to Cloudflare, given their long history of outright malicious behaviour, including (but definitely not limited to) actively providing cover to a community that has deliberately driven multiple people to suicide, and otherwise harasses marginalized folks on a daily basis. CF is probably about as close to “deal with the devil” as we could get here.
We will be holding the community call on 2023-06-07T15:00:00Z
Planned Agenda
- Brief budget and timeline review
- Review/discuss all potential options
- Brainstorm/Q&A
Video call link: https://meet.google.com/pyr-orzm-ahm
Or dial: (US) +1 252-385-2704 PIN: 320 231 639#
More phone numbers: https://tel.meet/pyr-orzm-ahm?pin=1212541034968
Thank you again for jumping all in on this with us!
It hasn’t really been said explicitly, and this is as good a prompt as any:
- The problems under discussion have arisen because there has not been any pressure on the growth of storage until now. Huge thanks, again, to the existing sponsors, but it was always going to reach a point where unsustainable growth hit a cliff.
- The corollary of this is that there likely is a lot of low-hanging fruit in terms of storage reduction, via dedup / smarter gc / hierarchical migration and others discussed above.
- The efforts and strategy the project puts in place now, both in the short and medium term, will be crucial in any efforts in attracting new sponsors, to demonstrate the ability to manage bounds around the problem. Among other things, it’s no fun for a sponsor to be in the awkward position of being a critical SPoF when they have to withdraw, either. I’m sure a number of potential sponsors will be more interested if the burden and the risk can be shared around.
Do we have some numbers for fixed output derivations outputs (FODs)? A mix of «old release channels» + «all FODs» + «everything since the last release» might be 95% of utility and permit slow 100% recovery (at a cost in compute, sure).
In fact, one of the main reasons the NixOS Foundation was created was to ensure continuity of the infrastructure by having the financial reserves to deal with sponsoring of the binary cache ending. The foundation currently has ~€230K in the bank, so we are prepared for this.
Moreover, that $32K is a worst case that only happens if we decide to move all of the binary cache out of S3 and we have to pay for it. There are much cheaper scenarios, e.g. we stay on S3 and we garbage-collect the binary cache. (As I described here, keeping all NixOS release closures ever made and deleting everything else shrinks the binary cache to about ~6 of its current size.)
ca-derivations have the potential to reduce the growth of new store entries but the feature is not receiving much love and mostly dead in the last year. It would also increase the effectiveness of alternative distribution methods like torrent.
A lot of bug fixing has been done on CA over the last few months.
I think staying in such a financially hostile environment is misguided, irrespective of whether we reduce our overall storage requirements or not. The egress quirk at AWS S3 is egregious at best.
In general, I think we should pay for our mission-critical infrastructure, and that means finding a sustainable partner.
The best idea to tackle the immediate problem, I find, is Backblaze B2 and their egress fee waiver, if you transfer more than 10 TB and stay for at least 12 months. Their overall storage cost is also much lower and egress costs to Fastly don’t exist, because both of them are part of the bandwidth alliance.
Also, I really wish we had two threads, one for solving the immediate problem, and one for discussing long-term solutions. This thread is really long and noisy, and lacks coherence because of that.
Suggestion:
Have cache.nixos.org
backed by 2 infrastructures:
-
“Unsponsored-storage” – There should be a backup storage for
cache.nixos.org
that stores all store paths in a cheap way, which does not rely on benevolent sponsors or potentially-temporary cost waivers. It need not servecache.nixos.org
’s daily load, but its hosting and egress should be cheap, such that it can be copied to whatever the current daily-load-storage is without large transfer costs.- Example: Ceph on Hetzner, at ~1.5 $/TB storage, 0.15 $/TB egress.
- Also serves as a disaster-recovery backup in case the daily-load-storage disappears, e.g. because it ceases service.
- All Hydra outputs would be copied onto it.
- Ideally paid for by the NixOS foundation / covered by donations?
-
“Daily-load-storage” – Serve’s
cache.nixos.org
’s daily load, using whatever good partnerships and sponsorships we can get.
By having our own fallback, we could confidently take special offers or partnerships, without creating single points of failure that are difficult to migrate off of.
Is “unsponsored-storage” feasible?
As a data point:
- My company runs Ceph-on-NixOS, as two ~500 TB clusters, with 3 Hetzner SX 134 machines each.
- Each machine costs ~250 $/month excl. VAT, making the the total cost of a 500 TB raw cluster around 750 $/month.
- For durability and High-Availability:We use 3x replication. Accordingly, this reduces usable storage from raw to 33%. But Ceph also supports Erasure Coding, e.g. K+M=6, M=2 supports 66% storage efficiency. You can check it in e.g. MinIO's EC calculator, inputing values 3, 1, 10, 16, 6, 2 into the text boxes for the mentioned Hetzner setup.
- The Ceph cluster is low maintenance. It requires work mainly for NixOS / Ceph version upgrades, so every 6 months.
- When a disk fails (which eventually happens with many disks), we email Hetzner and it gets replaced in 15 minutes on average, for no additional cost.
I believe this is something that an infrastructure team could do, even on a volunteer basis; certainly on a paid-less-than-$9k/month basis.
Another concrete suggestion on how to reduce the $32k cost:
Do deduplication using a deduplicating backup software such as bup
or bupstash
, before the egress.
I’ve investigated this a bit in Investigate deduplication to reduce storage and transfer · Issue #89380 · NixOS/nixpkgs · GitHub
In the latest post from today I found that a dedup factor of 3.5x (thus egress cost reduction factor) seems immediately achievable.
(I know of nix-specific dedup solutions such as https://tvix.dev from @flokli linked above, but I haven’t had time to compare that yet, and so far only looked into general-purpose software that’s immediatly available and that I’m already familiar with.)
An alternative to this idea is to go for tape-based storage.
LTO-8 autoloader, let’s say PowerVault TL1000 (~5K EUR for a new one) gives you 9*30TB = 270TB of “active tapes” with 15-30 years old lifetime if stored/maintained properly.
A 30TB RW LTO-8 tape cost ~80-100EUR. Storing all the current cache twice would cost 3K EUR + 5K EUR for the autoloader + electricity costs or colocation costs.
Also, this is very relevant for Hydra store paths writes because ultimately they are really sequential, aren’t they?
It is also a trivially extensible solution because we can just pile up more tapes as we move forward in the future.
(though someone needs to change the tapes if we overgrow the autoloader capacity or more autoloaders are needed.)
I think we should recognise that we have multiple different kinds of data with very different requirements. A solution that works for one might not work well for others.
Moving forward, perhaps we should discuss solutions for each individually, rather than finding an all-encompassing solution similar to what we currently have.
This is how I’d differentiate:
Kind | Size | Latency Requirements | Throughput Requirements |
---|---|---|---|
narinfo | Small | High | Small |
nar | Large | Medium | High |
source | Medium | Low | Medium |
Another aspect I’d like to see explored is nar size vs. latency. Some closures contain lots of rather small paths while others have fewer but larger paths. It might be beneficial to also handle these separately.
That’s impressive deduplication and I’m sure you could even improve it even further by stripping store paths (obviously keeping a tiny record to put them back) but I’m yet to come across a deduplicating archiver that’s fast. I highly doubt something like bup
would be fast enough to serve data in real-time which is a requirement for any binary cache.
I think we can postpone the migration process, and instead have all historical data archived in S3 Glacier Deep Archive for now, it would cost us less than 500USD/mo for 500TB of data. Meanwhile we can have hydra pushing new paths to R2 or other cheaper alternatives and call it a day. This would indeed cause a loss of access to historical data for a brief period, but given the timeline, still reasonable.
Edit: the retrieval free for glacier would still be, massive, but that’s how aws works ¯_(ツ)_/¯
And speaking of historical data and its research potentials, we can contact research facilities that have the motivation and ability to store and serve them for us, namely CERN, they did use nix for a while with LHCb[1]. And they are by all means experts in handling HUGE amounts of data for a extended time span.
I want to suggest that we leave the possiblity of garbage colleccting the binary cache out of the discussion—at least as far as the short term solution is concerned. Garbage collecting the binary cache is a big problem in two ways:
- It is a question of policy: Do we want to garbage collect the cache? What do we want to keep? What can be deleted? This should be a community decision which can’t be achieved in a month’s time.
- It is an engineering problem, i.e. how do we determine what store paths are to be deleted according to our established policy.
Additionally, garbage collecting the binary cache is risky, as we may wind up intentionally or erroneously deleting data we may miss—be it in 2 months or 10 years.
For the policy discussion one problem is that there is not a lot information about the cache available (although this has recently improved). Also it’d invariably involve looking to the future, i.e. how will our cache grow, how will storage costs develop etc.
For the engineering side, the big problem is that the store paths correspond to build recipes dynamically generated by a turing complete language, making it all but trivial to determine all store paths in the cache stemming from a specific nixpkgs revision. Assuming all store paths in the binary cache have been created by hydra.nixos.org
(is this true for all time?), we have the /eval/<id>/store-paths
endpoint available (from which store-paths.xz
is generated) as a starting point. That will of course never contain build time only artifacts or intermediate derivations that never show up as a job in Hydra—among those, though, are the most valuable store paths in the binary cache, i.e. patches, sources etc. Even though we have tools to (reliably?) track those down, it becomes more difficult to do so for every historical nixpkgs revision. Additionally there is the question of the completeness of the data Hydra has (what happens to evals associated with deleted jobsets?). (If we were to garbage collect the binary cache, I think we should probably try figuring out what is safe to delete rather than determining what to keep.)
Garbage collection is a long term solution if at all.
It seems to be that finding a way to deduplicate the (cold) storage or archiving little used data more cost effectively is a similar amount of engineering effort, but less risky and comparatively uncontroversial—while still offering a sizeable cost reduction.
See also the apparent difficulty of getting an offline version of nixpkgs for a given version, or whatever was in this thread exactly: Using NixOS in an isolated environment .
It is far from complete, or good, but I made GitHub - nix-how/marsnix: Taking Nix Offline prior to NixCon 2022 to work on Nixpkgs offline, and it worked great. It fetches all of the inputs (FODs) so that it’s possible to recompute the outputs (input-addressed derivations) entirely offline.
The bandwidth is a smaller part of the cost thanks to the CDN, but there’s engineering work in the Nix client that can be done to help reduce it.
Improving Nix’s ability to handle multiple binary caches would make it easier for users to use a local/alternative cache. Right now companies need to setup their own caching server using external tools, and explicitly workaround issues with Nix to make it usable. For example: Setting up a private Nix cache for fun and profit
I started some of that work in Allow missing binary caches by default by arcuru · Pull Request #7188 · NixOS/nix · GitHub, maybe I’ll go pick it back up.