Firstly, I want to say I really appreciate the work already done to support Content Addressed derivations. I think this is a great thing to have, independent of any potential storage improvements.
I know that this is very early days of your work (and it’s very impressive what you’ve achieved so far) but i think it will be a tough sell for many users to effectively double the storage space required for the nix store (once in nix store, once in IPFS) as more packages become CA. Also, mounting the files from IPFS via FUSE is also not free, and can introduce a significant performance bottleneck. Perhaps IPFS import/export is a feature more suitable for more beefy buildfarm/cache(ix) infrastructure?
Substitution via traditional HTTPS (eg from cloudflare gateway) would allow the 90% of users who just want to get prebuilt dependencies without thinking about IPFS to still drive demand for IPFS hosted build products. Having those derivations be loaded into the nix store as usual, and take no extra space would provide a smooth transition path. Also, laptop users like me won’t have to run the bandwidth/cpu hungry IPFS daemon constantly to avoid warmup time and high latency requesting objects.
How would one go about proposing/working on gateway substitution as an extension to Obsidian’s work?
Can you please expand on this a bit? Don’t clients push .drv “buildplans” to remote builders? These buildplans actually ARE content addressed. The name contains the hash of it’s content, and the result of the build carries the hash of the buildplan. It’s the remote builder that has the power to label any arbitrary data as being built from a buildplan hash, not the clients. AFAIK, the only arbitrary data that clients are allowed to push is fixed-output derivations, so they can’t lie about what that derivation contains as it’s also content addressed. Is there a sneaky loophole hiding here that I don’t know about?
If I’ve got this wrong I’m very happy to be corrected!
Thanks for thinking about how to drive adoption. You make good points. My idea was to drive adoption with source code archival, where the space duplication is far less a concern, but there’s no technical reason we wouldn’t pursue both tracks.
They use to copy the drv file and it’s closure, but more recently they send over just the derivation being built (parsed, to be used in memory only on the remote side) without it’s dependent drvs.
Yes
I’m a bit confused what you mean, but anyways, for traditional “input-addressed” derivations, the output path computation is quite complicated and is computed by hashDerivationModulo. There daemon has no hope of verifying this unless it has the entire derivation closure, so it blindly trusts the output path in the derivation that is sent over.
I have already removed the trust requirement for (fixed and floating) content-addressed derivations, since the daemon ignores any output paths sent over as part of those derivations.
Actually I think it will currently accept any store path, not just content-addressed ones (usually built by fixed-output derivations). This is one reason why it only works for “trusted users” on the daemon side.
Thanks again for the detailed reply. I had to sit on this for a while and do some more reading and thinking
It seemed like you had a roadmap of features to implement. I guess whoever is managing that roadmap?
This is probably massively naive on my part but this seems like a security hole that would be possible to close?
Send the full derivation closure (.drv or parsed, does it matter?) so that the builder can verify path names for itself.
Disallow accepting arbitrary store paths (except input addressed, but input addressed must verify received path matches hash) except for trusted users (to avoid breaking existing workflows?).
Remote builder must build or substitute for any missing store paths from it’s own trusted sources.
Is there any use-case/scenario that this breaks? I’m very interested in the security model of remote builders, so if you can recommend a part of the codebase that I should read to get a better idea of how this all works please drop me a link.
There is in fact some news: we have scoped out NLnet; Peer-to-Peer Access to Our Software Heritage and it has been approved. Once that is done, I hope we’ll have a better shot at merging the work we did least year, because the SWH —IPFS—> NIX workflow will hopefully make it more apparent what the use-cases are.
IPFS doesn’t rely on use of the DHT — you can always connect to a peer directly and it asks existing peers regardless of “who” the main DHT says you should ask.
For the SWH ↔ IPFS bridge, for example, we completely ignored the DHT to start, saying there will just be a dedicated bridge node with a well known address, and one can connect to that.
(I believe one can also get peers from peers, which means that a bandwidth-saturated bridge node could in principle prioritize letting all its wannabe peers know who each other are, so they can act together as a CDN.)
Bottom line is the architecture is very modular, and one can experiment with many different strategies. I think our use-cases (distributing source code from archive, distributing builds) are great ones to test various strategies with too.
I’ve read the RFC, good job! But I have an IPFS specific question, is it possible to use the content raw from the store to plug it to ipfs (this is not recommended by ipfs, but the store being r/o it’s not so bad), instead of adding the files into ipfs and duplicating disk space usage?
We have not tried to do that yet. It is certainly possible in principle yes, but in our implementation we have been more focused on integrating the concepts than tuning performance. The idea is to get a specification / interface we feel good about, and then after that we can finesse the implementation without needing to change the spec / interface.
Just wanted to say, this would be a killer feature IMHO. The amount of problems I wouldn’t have at work, if a newly deployed machine out of the box could just substitute from any nearby machine that has the data. We’ve been mucking around so much with internal substituters and their configuration and also with sharing host stores with VMs. I quiet literally can hardly wait.
I guess content addressing and coordination/planning is the biggest blocker? Or are you having trouble with funding as well?
Is obsidian systems still working on this? I’m sitting here pulling at 200KB/s from cache.nixos.org and just wishing it were over IPFS as another computer in my house had to go through the same thing yesterday