Suggestion/Feature: use Bittorent or IPFS to download packages

7c6f434c · October 18, 2018, 4:55pm

If I only request chunks known to be theoretically buildable by Hydra or
present in r-ryantm Cachix cache, I am not too likely to leak a chunk
containing something specific to me.

Hmm… I think the solution of having an adapted P2P system would work
too? (cf. the hash(hash(…)) idea above) Maybe one already exists, but
IIRC anyway current P2P systems just don’t scale enough to handle
nixpkgs, so we’d need a new one anyway.

My point was that a single filter would allow us to use any protocol
as-is. Actually, e2dk did scale quite well back in the day…

That sounds more or less similar to the hash(hash(…)) idea above? To
make the full protocol I was thinking more explicit (checking it’s the
same as the one you were thinking of):

User searches the DHT for hash(hash(chunk))

The peers who claim having the chunk answer

User establishes secure connections to these peers. For each peer:

A secure connection is established between the user and the peer

Peer sends a nonce

User answers with hash(hash(chunk) || nonce)

User has now proven he knows the chunk’s hash and is therefore
allowed to download, Peer sends the chunk

Not the same one. This protocol, by the way, has an obvious MITM attack
which is cheap — that is why I said that a secure two-party random
string generation is needed, we need both client and server to be able
to ensure that nonce is good.

Of course, if you request just a single chunk and the attacker knows its
structure and it contains a weak enough password, the leak helps to do
offline brute-forcing on a GPU. Just requesting from the server won’t
do, though, because an honest server will make sure that random salt
agreement will end up with a different salt during the replay (and
a dishonest server being able to provide you with a file with your
password is bad enough without a second attacker).

That is an interesting other threat model… but here I’m not sure there’s
much that can be done against it. The inner hash can become bcrypt
or anything similar, though, which can make things harder.

Also, maybe something from the zero-knowledge proof domain could help
here? I’m not familiar with it, though

Well, there is a secure multi-party protocol that reveals only the
desired outputs. But we have a lot of paths to request, and a lot of
served paths, so it will be capital-E Expensive.

I am not sure how to do such an oblivious search in a reasonably secure
and efficient way. Although I guess I could ask a few people to see if
the current state of the art is indeed better nowadays (but I have
doubts it scales well).

Ekleog · October 18, 2018, 5:40pm

Hmm… I think the solution of having an adapted P2P system would work
too? (cf. the hash(hash(…)) idea above) Maybe one already exists, but
IIRC anyway current P2P systems just don’t scale enough to handle
nixpkgs, so we’d need a new one anyway.

My point was that a single filter would allow us to use any protocol
as-is. Actually, e2dk did scale quite well back in the day…

Well, I must say I didn’t do a survey, just took @vcunat’s word for bad
performance of IPFS… and even though implementations tend to go slower,
I do have some hope that protocols don’t also tend to include random
nonsense in a superlinear fashion.

Do you think ed2k would actually scale nixpkgs-scale? I mean, I have
currently ~4M files in my store, not even counting that they should be
further split into chunks to increase parallelism, and I have done a
garbage-collect relatively recently… I guess we’re hitting the billion
files if we include even just a year worth of nixpkgs bump, and with
many people on the network we’d likely get at least that.

That sounds more or less similar to the hash(hash(…)) idea above? To
make the full protocol I was thinking more explicit (checking it’s the
same as the one ynou were thinking of):

User searches the DHT for hash(hash(chunk))

The peers who claim having the chunk answer

User establishes secure connections to these peers. For each peer:

A secure connection is established between the user and the peer

Peer sends a nonce

User answers with hash(hash(chunk) || nonce)

User has now proven he knows the chunk’s hash and is therefore
allowed to download, Peer sends the chunk

Not the same one. This protocol, by the way, has an obvious MITM attack
which is cheap — that is why I said that a secure two-party random
string generation is needed, we need both client and server to be able
to ensure that nonce is good.

Well… I assumed that the “A secure connection is established between
the user and the peer” was a standard TLS-or-similar protocol, that
already checks that both client and server agree on the private key,
which is just as good as agreeing on the nonce.

However, the MitM you describe is indeed an issue, and it isn’t
prevented by making sure client and server agree on the nonce, as the
attacker could just relay the nonce-agreement messages.

A solution to this may be to use the DH-derived encryption key to derive
the nonce, because with this the attacker can’t be MitM’ing (because if
they’re MitM’ing, then the DH-derived key wouldn’t be the same).

Also, maybe something from the zero-knowledge proof domain could help
here? I’m not familiar with it, though

Well, there is a secure multi-party protocol that reveals only the
desired outputs. But we have a lot of paths to request, and a lot of
served paths, so it will be capital-E Expensive.

I am not sure how to do such an oblivious search in a reasonably secure
and efficient way. Although I guess I could ask a few people to see if
the current state of the art is indeed better nowadays (but I have
doubts it scales well).

Well… if they don’t answer I can ask people from my side, just tell me
if that’s required

xte · October 18, 2018, 6:09pm

Just my two cents: does anyone think about distributed fs instead of
P2P file-sharing solution?

Most of them are built with clusters in mind (i.e. in-LAN usage) but
few are build for internet scale like ceph, tahoe-lafs and few others,
few are nearly posix fs, many are meant to be used only via their APIs.

I have substantially ZERO experience with them, only old and rare
personal exploration, mostly for LAN usage however “at nose” (iow
by instinct) they seems to be more interesting than file-sharing
solution for such task…

– Ingmar

vcunat · October 18, 2018, 6:37pm

Yes, that’s my view as well. The real P2P systems seem relatively difficult to design in a way that uses resources efficiently. Hosting several “larger” mirrors around the world isn’t too hard, even if we wanted to avoid standard CDNs.

vcunat · October 18, 2018, 6:41pm

For reference, the implementer of the last solution tried IPFS first and ran into the bottlenecks (and notified upstream about them).

layus · October 18, 2018, 6:44pm

Well, NixOS does have the advantage of being also a configuration
management system. Meaning that turning the thing on could be a simple
on-off switch, while on deb-based distros it’s likely a big time
investment to get things to work.

Well… if I was able to painlessly share the downloads between my
computers over my LAN, I’d be happy to do it. But I don’t because I’m
too lazy to setup a cache.

Points taken !

You made me want to believe again that it is possible. Let’s see where it leads.

zimbatm · October 18, 2018, 7:49pm

This. It doesn’t have to be a globally-distributed network to be useful, just a bit more fluid than nix-copy-closure. It would be so cool if I could join my machines together; nix-store join <shared-secret> <host-ip>. In that context there is no need to hide keys as well.

7c6f434c · October 18, 2018, 8:59pm

Do you think ed2k would actually scale nixpkgs-scale? I mean, I have
currently ~4M files in my store, not even counting that they should be
further split into chunks to increase parallelism, and I have done a
garbage-collect relatively recently… I guess we’re hitting the billion
files if we include even just a year worth of nixpkgs bump, and with
many people on the network we’d likely get at least that.

I think we should index store paths. I have a mere 1.86M files right
now, but this corresponds to just 27k files. This sounds fine for e2dk
search performance if my memory serves me well (in the model of every
person sharing some amount of random stuff collected over years).

We could also compare large Bittorrent trackers with Hydra from the POV
of daily entry flow…

Not the same one. This protocol, by the way, has an obvious MITM attack
which is cheap — that is why I said that a secure two-party random
string generation is needed, we need both client and server to be able
to ensure that nonce is good.

Well… I assumed that the “A secure connection is established between
the user and the peer” was a standard TLS-or-similar protocol, that
already checks that both client and server agree on the private key,
which is just as good as agreeing on the nonce.

However, the MitM you describe is indeed an issue, and it isn’t
prevented by making sure client and server agree on the nonce, as the
attacker could just relay the nonce-agreement messages.

A solution to this may be to use the DH-derived encryption key to derive
the nonce, because with this the attacker can’t be MitM’ing (because if
they’re MitM’ing, then the DH-derived key wouldn’t be the same).

DH is expensive, I would just do a hash-based commitment to personal
nonces, then reveal the committed values, then xor.

Also, maybe something from the zero-knowledge proof domain could help
here? I’m not familiar with it, though

Well, there is a secure multi-party protocol that reveals only the
desired outputs. But we have a lot of paths to request, and a lot of
served paths, so it will be capital-E Expensive.

I am not sure how to do such an oblivious search in a reasonably secure
and efficient way. Although I guess I could ask a few people to see if
the current state of the art is indeed better nowadays (but I have
doubts it scales well).

Well… if they don’t answer I can ask people from my side, just tell me
if that’s required

Well, I guess I was too pessimistic: by just looking at IACR preprint
server one can find out that the so-called «Private set intersection» is
enough at least in the case of one server and one client…

Ekleog · October 19, 2018, 12:22pm

Do you think ed2k would actually scale nixpkgs-scale? I mean, I have
currently ~4M files in my store, not even counting that they should be
further split into chunks to increase parallelism, and I have done a
garbage-collect relatively recently… I guess we’re hitting the billion
files if we include even just a year worth of nixpkgs bump, and with
many people on the network we’d likely get at least that.

I think we should index store paths. I have a mere 1.86M files right
now, but this corresponds to just 27k files. This sounds fine for e2dk
search performance if my memory serves me well (in the model of every
person sharing some amount of random stuff collected over years).

I guess the only issue is finding a way to share files securely
then. Unless we want to setup only private-network build sharing, but
this’d reduce the use cases quite a lot IMO. And it’d also potentially
make setup more complex, because then we need a way to define the
boundaries of the sharing. (well, either that or use a model like
syncthing, maybe?)

Well, that or have hydra publish the list of hashes that should go on
the p2p network, but this sounds both less elegant and less efficient
(eg. for people who patch glibc in a way not built by hydra, for people
wanting to share armv7 builds (want want want), etc.)

We could also compare large Bittorrent trackers with Hydra from the POV
of daily entry flow…

Indeed

A solution to this may be to use the DH-derived encryption key to derive
the nonce, because with this the attacker can’t be MitM’ing (because if
they’re MitM’ing, then the DH-derived key wouldn’t be the same).

DH is expensive, I would just do a hash-based commitment to personal
nonces, then reveal the committed values, then xor.

Well, I assume every modern encrypted channel between User and Peer
would have already done a DH exchange to bring a bit of
forward-secrecy. So it shouldn’t cost anything.

Also, with the hash-based commitment, what blocks the attacker from just
relaying the hash-commitment messages and getting the file?

The only solution to avoid that IMO is to make the nonce depend on the
encryption key (so that it can’t be used outside of this encrypted
channel). And to avoid the attack where the attacker establishes an
encrypted channel with both ends and relays, for un-authenticated
connections I know of only DH to prevent that.

Well, I guess I was too pessimistic: by just looking at IACR preprint
server one can find out that the so-called «Private set intersection» is
enough at least in the case of one server and one client…

Hmm… I guess the issue is then that we’d need to compute the private set
intersection with everyone, which doesn’t really sound reasonable.

[1] appears to give a few more set operation primitives. With these
primitives I think I can see a way of doing the job:

Let the set of users be I, its set of local paths is U_i
Compute Ω = \union_i {(i, x) | x ∈ U_i}
When user j wants to download path P, it computes
Ω \inter {(i, P) | i ∈ I}
and then can recover the list of users having path P

Now, the open questions (I haven’t read the paper completely yet, the
above protocol just appears like only using primitives the paper
provides) are:

How slow would it be to update Ω when a user joins / leaves?
How slow would the set intersection be? (as it’s not a DHT anymore)
How bad could a malicious user mess with Ω? (replaying
previously-present users, DoS by making it arbitrarily large, etc.)
Is it actually possible to implement with what the paper gives?
Is it actually secure? (I think so, but have thought about it only
for like half an hour, so…)

[1] https://link.springer.com/content/pdf/10.1007%2F11535218_15.pdf

7c6f434c · October 19, 2018, 3:16pm

Well, that or have hydra publish the list of hashes that should go on
the p2p network, but this sounds both less elegant and less efficient
(eg. for people who patch glibc in a way not built by hydra, for people
wanting to share armv7 builds (want want want), etc.)

Well, a subcommunity can always define its own set of publically defined
safe paths and locally trusted keys.

A solution to this may be to use the DH-derived encryption key to derive
the nonce, because with this the attacker can’t be MitM’ing (because if
they’re MitM’ing, then the DH-derived key wouldn’t be the same).

DH is expensive, I would just do a hash-based commitment to personal
nonces, then reveal the committed values, then xor.

Well, I assume every modern encrypted channel between User and Peer
would have already done a DH exchange to bring a bit of
forward-secrecy. So it shouldn’t cost anything.

Oh well, propagating between the network layers is a different pain.

Also, with the hash-based commitment, what blocks the attacker from just
relaying the hash-commitment messages and getting the file?

Hm, next thing is to make sure our threat model is at all compatible
with blocking relaying… I wonder if using IP addresses in the nonce
would help or not (maybe not — with coordinated attackers in different
LANs)

The only solution to avoid that IMO is to make the nonce depend on the
encryption key (so that it can’t be used outside of this encrypted
channel). And to avoid the attack where the attacker establishes an
encrypted channel with both ends and relays, for un-authenticated
connections I know of only DH to prevent that.

Hm, you are probably right. Painful.

Well, I guess I was too pessimistic: by just looking at IACR preprint
server one can find out that the so-called «Private set intersection» is
enough at least in the case of one server and one client…

Hmm… I guess the issue is then that we’d need to compute the private set
intersection with everyone, which doesn’t really sound reasonable.

That’s true; but then maybe there is a better match…

How bad could a malicious user mess with Ω? (replaying
previously-present users, DoS by making it arbitrarily large, etc.)

Well, it is actually hard to defend against claims of having a lot of
paths — these claims can even be true (and still cheap for an attacker
to make).

Ekleog · October 19, 2018, 4:52pm

Well, that or have hydra publish the list of hashes that should go on
the p2p network, but this sounds both less elegant and less efficient
(eg. for people who patch glibc in a way not built by hydra, for people
wanting to share armv7 builds (want want want), etc.)

Well, a subcommunity can always define its own set of publically defined
safe paths and locally trusted keys.

Actually, thinking more about it… anyway we need hydra to push a file
that associates the .drv hash to the fixed-output hash. There is a much
simpler solution, then (though not optimal): just symmetrically encrypt
every derivation on the p2p network, and have hydra publish the keys
to decrypt its builds along with the .drv → fo hash mapping. And it
also avoids the case where weak secrets could be guessed, because an
attacker would just not have the encryption key for the derivation.

The only issue is, if we want to re-use only off-the-shelf tools, then
we might have issues because it’d require on-the-fly
encryption/decryption to not store each file twice on the disk. A FUSE
FS could do the job, but…

But here comes the thing I heard about just earlier today:
dat. From what I’ve heard, it’s doing
basically this “sync encrypted files” protocol.

In addition, it looks like it also supports syncing from https, making
degradation to CDN mode graceful and potentially (not sure about that
yet though) even taking a regular CDN as one of the peers, getting
maximum performance for everyone.

There is a Rust library in progress, meaning that C bindings are (at
least compared to a hand-rolled protocol) close.

Downside: cost of encryption (well… there’ll be encryption anyway) and
potential scalability questions. There is will to scale to nixpkgs
scale, but their IRC channel appeared not to know if it’d actually work
out… I guess few projects end up to this scale of data volume, so anyway
we’d likely be guinea pigs here.

The only solution to avoid that IMO is to make the nonce depend on the
encryption key (so that it can’t be used outside of this encrypted
channel). And to avoid the attack where the attacker establishes an
encrypted channel with both ends and relays, for un-authenticated
connections I know of only DH to prevent that.

Hm, you are probably right. Painful.

Painful indeed. Then, apart from the idea of dat (or a similar protocol)
above, I can see only this solution and ZKPs to handle this. And this
list is in order from most-reasonable to least-reasonable, I think…

How bad could a malicious user mess with Ω? (replaying
previously-present users, DoS by making it arbitrarily large, etc.)

Well, it is actually hard to defend against claims of having a lot of
paths — these claims can even be true (and still cheap for an attacker
to make).

Indeed… that’s likely something we can’t defend against well, so we’ll
need graceful degradation whatever the protocol (ie. giving up and
either fetching from CDN or building locally).

volth · December 2, 2019, 3:54pm

I want to talk about something simlper, which might be the first step towards the global P2P cache, but also has immediate value: about using a protocol more suitable for transfer of .nar files than HTTP and SSH.

The problems are:

there are huge .nar (graalvm is 2.5Gb single file). transfer of huge files tends to break even on datacenter wires and our current protocols has no means to resume the transfer (BitTorrent sends big files in 4MB chunks)
file transfer should have lower priority than regular SSH and HTTP (BitTorrents has uTP protocol which addresses this bep_0029.rst_post)
lack of P2P, transfer might be accelerated if there is a tracker bookkeeping locations of derivations

So what about making a P2P system suitable for private installations, an alternative not for cache.nixos.org but for it’s little brother GitHub - edolstra/nix-serve: A standalone Nix binary cache server
It looks simpler and doable, the security concerns could be postponed

(The main use case I have in mind is CI/CD, so to say, mass nix-copy-closure --to .... With 100+ target machines it is nightmare; the second case is smarter nix-build --substituers ... which able to find the files in the nearest rack/DC/continent)

andir · December 2, 2019, 5:33pm

I have an implementation for a local multicast/avahi based discovery of other Nix nodes (running my software) where they can share store paths that have been downloaded from hydra without adding another trust anchor. Obviously you are restricted to the local network segment (or multiple segments of mDNS forwarding exists in the network).

It works rather well. I have been using this non-stop on all of my machines for about 4 months now. It is a great feeling that most of the stuff you download comes from the machines next to you. Especially if the download speeds at your current location are very limited.
The only issue I am facing are buggy avahi bindings that aren’t really safe to use from rust. I am planning on migrating this to plain IPv6 (site-)local multicast instead. A while ago I got a globally unique local multicast address from IANA for this. I will try to get some work on during the holidays.

Currently I rely on a fork of Nix since I had to add one primitive to the store protocol which yet has to receive any kind of review or comment: add support for queryPathFromFileHash by andir · Pull Request #3099 · NixOS/nix · GitHub

volth · December 9, 2019, 4:21pm

I think with https://github.com/NixOS/nix/issues/3260 turning nix-serve into bittorrent tracker will be trivial

lewo · March 20, 2020, 11:20pm

Just to mention the latest IPFS release include the support of UnixFS metadata (to store perms and sticky bits). Previously, it was not possible to store these files attributes which are part of a nar file.

See js-ipfs 0.41.0 released | IPFS Blog & News for details

vcunat · March 21, 2020, 8:39am

I believe NARs intentionally do not store these things (or timestamps). Perhaps just the eXecutable bit would be useful from these, if I read the post right.

lewo · March 21, 2020, 11:23am

Ah yes, my bad, sorry. I didn’t remember correctly the issue with IPFS: it was on the executable bit and not on other ones.

davidak · March 22, 2020, 12:34am

The main issue was performance problems. And they still have performance problems. They focus on fixing them in 2020. GitHub - ipfs/roadmap: IPFS Project && Working Group Roadmaps Repo

xte · March 22, 2020, 10:50am

Hi,
just as a two cent idea: IPFS is a monster, barely usable just for
play, and despite it’s target of being fully P2P IPNS have not work
when Russian Federation have tried to cut itself off internet.
In general all fully distributed software can’t really scale and
can’t really work much (see GNUNet as an example).

BitTorrent on contrary prove to scale well, it’s not completely
distributed, but the tracker does not need much resources, so for
instance a simple bit-torrent client that share /nix/store and
optionally mirror nixpkgs or even an entire cache can be usable
without much complexity and can “run Nix{,OS} infrastructure” even
from a cheap VPS with a known domain attached. All needed tweaks
can be how much upload bandwidth and concurrent connection to accept.

Collaboration for development is more complex since while it’s easy
to share a git repo via torrent it’s not easy to handle commit and
propagate changes, but even only transfer much of cache load on Nix
users IMVHO can be a very good move, and a real start to being
distributed.

GitTorrent (abandoned? [1]), GHTorrent [2] does not spread nor evolve
much but they push in a similar direction but having resources only
for sharing a repo is a thing, having resources for serve thousand of
archive in storage and bandwidth terms is another.

[1]
https://blog.printf.net/articles/2015/05/29/announcing-gittorrent-a-decentralized-github/
GitHub - cjb/GitTorrent: A decentralization of GitHub using BitTorrent and Bitcoin

[2] https://ghtorrent.org/

davidak · March 22, 2020, 7:23pm

Yes, sadly IPFS has not reached it’s goal yet. But i don’t see a reason why they shouldn’t in the future. They have millions investment from Filecoin ICO, hopefully some smart people working on it and incorporate findings from academia. I’m still optimistic that it will be usable some day.

What do you think about an architecture like Tox Bootstrap Nodes?

Do you see a pragmatic solution that my computer uses packages from another computer on my local network to install packages? Computers should share and announce it to the other. Maybe some zeroconf magic? And nix-serve.