Peer-to-peer binary cache RFC/working group/poll

This is to gauge community interest and help in pursuing the implementation of a peer-to-peer binary cache system for Nix, following a recommendation to do so.

While decentralised, self-sustainable technology is a dream for many, it is good advice to evaluate whether the idea is necessary or realistic in order to invest oneself to a large task.

We don’t know for sure how many people are willing to use the work, and it may be that the foundation is able to continually secure sponsorships and donations such that decentralisation becomes meaningless.

Disclaimer: I don’t have intimate knowledge of Nix/BitTorrent internals and in no way am I a security expert.

See details of existing discussion starting here.

The idea is to shoehorn BitTorrent (if feasible at all) as a binary cache substituter: leechers query store paths on-demand to the swarm, in order to reduce the risk of leaking private derivation data, as attackers would have to bruteforce the private hash + derivation name & version.

Seeders easily opt into the swarm thanks to a new services.nix-serve-p2p.enable: bool = false Nix option enabling a service to join the swarm. This could be added as a comment in the generated config to raise awareness and help it more easily gain traction.

services.nix-serve-p2p.max-upload-speed allows users to decide how much bandwidth they wish to contribute.

To prevent further attacks, rate limit failed queries by determining if the path isn’t publicly known using the union of public store-paths.xz files from registered HTTP substituters.

An additional nix.settings.substituters.?.public: bool = false option could be added to distinguish public vs private substituters, https://cache.nixos.org/ being public.

A possible extra paranoid step would be to add a private: bool = false per-package option to ensure derivations stay private at all costs. The resulting store path would become a dotfile to easily filter them out in the above rate limiting system.

20 Likes

This is an interesting project, I hope it will succeed. But from this thread and the links, one question isn’t answered.

Which problem is it trying to solve?

This can’t solve the issue of the long term storage and distribution of the official substituter. It could help reducing the network load on the substituter, but it’s the CDN job to do that, and it does it well (except in some countries).

This could ease people to share a derivation between computers, but a p2p model seems complicated with regard to trust, security and privacy. And this doesn’t guarantee availability.

3 Likes

Why not both? The full binary cache could be split into smaller shards and distributed among a number of volunteers: companies, universities or just anybody that has spare bandwidth and some disk space. I bet that the bandwidth required for the historical artifacts is so low that a single 1Gbps connection can handle it.

1 Like

Does BitTorrent provide any way to ensure content always is retrievable from somewhere?

Here and there:

if we were to go all in with community load sharing, storage could purge the binary caches (reproducible output, not “valuable” sources) and reduce costs on that front as a result.

This avenue would make sense if there is a large amount of seeders able to mostly take over the existing hosting solution, or its benefits outweigh the costs of S3/Fastly.

I think a distributed storage model like this is the best path to a long-term-sustainable, no-monetary-cost solution. There is so much good will in the general user community. If it’s as simple as responding to an “uncomment this line to help support this community” note in the nixos-generate-config default config, I think we’d easily get distributed storage serving capacity adequate to handle the binary cache. And if I’m wrong about this and we only get, say 50% of the needed capacity from volunteers, that’s a 50% cost reduction on whatever service picks up the rest of the load.

On top of that, per the transparency report, it looks like CDN costs are far, far higher than storage costs:

  • Storage: ~€10k/month
  • Fastly: Estimated at over €50k/month (this is hard to take into account with a buffer)

and in my poor man’s ears, 600,000 €/year sounds like an astronomical sword of Damocles if the foundation were to pay it in full some day (it rarely ends well when people depend on the generosity of for-profit companies, there are endless such tales, as painful as it is to admit).

This is precisely what torrents excel at.

It is a well-known meme that people primarily use torrents to download their favourite Linux distributions, and it is the recommended method by many projects, such as Arch.

Don’t know if there is prior art in integrating the torrent protocol into the actual package managers, however. But in any case, its power has been used in many areas, big and small.

Regarding security, traditional hosted torrent files contain hashes for all data blocks. Clients detect corrupion/tampering thanks to that, and I do believe implement blocklisting for unreliable/malicious nodes.

If for some reason we don’t want to store the source of truth torrent metadata files on the official servers, magnets can do the trick.

I assume in this case the client gathers seeders’ responses and checks that the majority agrees on hashes, I would have to look into it further.

(On an unrelated note, torrents have support for web seeds as fallback, though it is redundant given Nix already has HTTP substitutes.)

8 Likes

For bittorrent I’d mainly be afraid for latency and how it deals with huge amount of tiny torrents. (one per /nix/store/$hash)

1 Like

Torrents need seeders, as long as all blocks are reachable by any one seeder (i.e. people can partially download subsets of the torrent, and it can still add up to 100% availability), then the full torrent can be completed.

Web seeds exist as fallback in case all seeders disappeared.

In our case, this would be the regular HTTP substituter, or someone in the swarm who rebuilt the derivation from source I guess !

1 Like

I don’t think any software will ever provide this without introducing redundancy: what’s there to stop the last remaining node having the document from going offline?

The best I can think of is to use something like par2 and treat a node going offline as corruption. You can exted the data to be stored with recovery files up that can provide x% of the total cache being lost. Then you must make sure less than n% nodes will ever go offline faster than you can add new nodes.

1 Like

I wonder if we could merge many paths into one (or a few) torrent, and still serve a portion of those blocks which are on disk.

Or if the protocol could be easily tweaked to our needs to be more in line with how the store functions.

I wonder if we could merge many paths into one (or a few) torrent, and still serve a portion of those blocks which are on disk.

This is what library genesis/sci-hub do: it would be too inefficient to have one torrent per book/article. The entire repository is 2.4 million books split into 2400 torrents of 1000 books each.

2 Likes

But how would people map paths they want to torrents?

EDIT: on the whole, I don’t think bittorrent was designed for use cases similar to ours, though it might work somehow.

2 Likes

Maybe some sort of hashmap could map the store path to a torrent.

Good question.

A terrible idea off the top of my head is to fetch torrent metadata and ask peers from the torrent which contains the path hashed path (?) of interest. Not optimal :slight_smile:

Yeah, the protocol uses a distributed hash table, so that sounds about right !

In 2005, first Vuze and then the BitTorrent client introduced distributed tracking using distributed hash tables which allowed clients to exchange data on swarms directly without the need for a torrent file.

Would it be possible to replace the torrent info hash with the nix store path?

I would hope so, yes !

We don’t need to maintain compatibility with regular torrent clients for what it’s worth.

1 Like

Torrents aren’t really designed to handle fine grained files, and there isn’t any great solution around that.

I think IPFS is more suitable, but it isn’t without issues. For example, finding the files in the DHT is really slow.

How could we integrate IPFS with nix without too much effort?

  • The Nix foundation could keep a table mapping nix hashes to IPFS hashes. This is a cheap operation that can be easily done with a redis server. It’s also easily implementable in the nix language.
  • The foundation should also keep a tracker to speed up the lookups on the DHT. Not sure about the details of this.
  • The users would enable an option as nixpkgs.exprimental-p2p = true to start a service capable of resolving IPFS addresses. Inevitably, using IPFS will leak some informations about the user to the public internet.

This is doable, but we need to write a Proof Of Concept to actually evaluate the solution’s effectiveness.

8 Likes

If we really want to use torrents, a possible approach could be:

  • The nix foundation provides a service mapping nix hashes to torrents magnet links and a tracker.
  • The torrents could be 50mb < X < 500mb bundles, created by analyzing the download patterns from the binary cache. Users who are downloading the GNOME Shell will probably need the file manager Nautilus as well, so those packages should be in the same bundle. Cluster analysis techniques can help us achieve this.
  • Bundles are going to contain a lot of duplicate data, especially considering that we need to keep available multiple versions of the same package (eg: GNOME shell v44 and v44.1). Torrent clients are able to selectively download specific files from a bundle, so in theory this is doable.
5 Likes

Does this mean we cannot take advantage of the very handy hard link option (nix.settings.auto-optimise-store) ?

1 Like

(I wrote my response in parallel to ranfdev above, so there is some overlap.)

I’m not an expert in this, but from what I can tell:

BitTorrent alone is not enough for this task. BitTorrent

  • is designed for distributing a static data set across multiple peers, where each peer strives to acquire the whole data set.
    • does not have functionality for the data set to vary over time (e.g. store paths to be added to a .torrent file).
  • was not designed for “sharding” a data set across multiple peers, where each peer strives to host only a subset of the data set.

Requirements

One would need to have software (on top of BitTorrent, or standalone) that handles the following:

  • Authenticity: A trusted producer (Hydra) produces a new torrent/storepath file to be stored.
    Trusted means that it is the source of truth based on which machines in the swarm decide whether or not to store a file (to ensure that the swarm only stores NixOS-related files).
  • Sharding: No single machine in the swarm can is expected to store all Hydra output due to its size.
    So some (optionally decentralised) database must decide which machines in the swarm should partake in which storepath.
  • Rebalancing: As nodes come online and disappear, some (optionally decentralised) algorithm needs to invoke the re-balancing of the above.
  • Trustless: Individual swarm participants do not need to be trusted.

“Fault tolerance” is achieved by making sure that the Sharding and Rebalancing steps always store all data across multiple machines.

Existing software that meets these requirements?

I did a quick Google for existing solutions, e.g. searching for “distributed untrusted object store”. Some results and rough analysis:

  • Ceph / MinIO / GlusterFS
    • These are open-source distributed network file system and object stores.
    • Are not trustless (thus not relevant here), but provide good examples for sharding and rebalancing.
  • Storj
  • IPFS
    • Does not seem to take care of sharding and rebalancing.
      From my research, IPFS nodes only store data that their respective users have chosen to “pin”.
      Thus also has no builtin fault tolerance or availability controls at all (HN discussion).

    • IPFS+Filecoin:
      Filecoin is made by the same people as IPFS.
      In their FAQ they describe:

      IPFS allows peers to store, request, and transfer verifiable data with each other, while Filecoin is designed to provide a system of persistent data storage.

      See also “IPFS and Filecoin” linked from Wikipedia.

      • Again here the question is whether one can run a private Filecoin network with the cryptocurrency aspect removed.
        Unclear if anybody has tried that (I found one question here.

In summary, the protocols for what’s needed seem to exist, but so far they seem to be entangled with public cryptocurrencies, and nobody is using them for community based file hosting yet.

11 Likes