Suggestion/Feature: use Bittorent or IPFS to download packages

balsoft · October 15, 2018, 3:00pm

We already have most of what’s needed to use bittorrent or ipfs (or some similar distributed network) to download packages from instead of http. We have cryptographic hash of every package; hydra signs them so that it’s secure; lots of people download those packages and have their PCs plugged into internet. Why don’t we take advantage of those features of Nix and have an option to download packages from such networks, decreasing the pressure on cache.nixos.org and possibly decreasing download time on slow connections? (BTW, I don’t know much about the exact way to implement that nor am I experienced in such distributed networks. It’s gust an idea I had tonight)

Ekleog · October 15, 2018, 3:13pm

You may be interested in this

xte · October 15, 2018, 6:36pm

Personally I think IPFS is an unseful monster but we (as society) need
distributed solutions for tons of reasons, being independent from
someone else computer, being able to keep distro up in a world of
proprietary solution without now disappearing “free mirror service”
by universities, ISPs etc common in the past. Any FOSS users/developer
and IMVHO any citizen now must enforce distributed or at least
decentralized solutions instead of relaying on classic client-server or
proprietary platforms like GitHub, that’s why I do not see well
Discourse instead of a classic ml, I do not see well Nix GitHub usage
instead of Linux model around LKML etc.

However at actual state of tech I think “ic sunt leones”, tons of
partial solutions, proof of concept etc exists but nothing strong,
simple and well understood.

I do not have enough knowledge or time but definitively I appreciate
a sort of user’s shared NixOS, which means a distributed mailing
list (perhaps I2P mail, like I2P-Bote, for now) a distributed website
perhaps ZeroNet, a distributed storage with a build service + binary
cache on top so NixOS will really depend exclusively on it’s devs and
users and other paid/offered service play only a marginal role for this
last part I do not have ideas, however if something like Popcorn Time
exists and work I think the same can be done for distros…

– Ingmar

vcunat · October 16, 2018, 9:39am

I understand (and share) the sentiment, but there are usually some difficulties.

For example, I’m convinced it’s noticeably more difficult to use a mailing-list for an issue tracker than a real issue tracker. It’s similar with Discourse for a mailing-list – and yes, we have the option to self-host a Discourse instance (as it’s open-source), but… that would just take more human work and we prefer to spend it on other nix* issues. It’s similar with the binary cache – it’s not easy to build something comparable to “commercial CDNs”, so as long as that works very good and we have lots of much worse things to improve…

That doesn’t really stop anyone from working on such solutions, but just be prepared that there may not be enough incentive for most of the community to switch if the alternative is worse in some respects.

balsoft · October 17, 2018, 6:44am

I am not talking about “switching” - I just thought it would be awesome if we added an option to use “Distributed network” as a binary cache. Maybe it could be Bittorrent - use existing AWS infrastructure as a tracker, and expand nix-serve to allow to share your store (or perhaps parts of it) via bittorrent with other users. Just a thought, though, as I don’t know much about internals of bittorrent.

vcunat · October 17, 2018, 7:59am

It might work, there are various options to try. AWS has even option to serve as a bittorrent seed, but there seemed to be a risk that it might draw too much traffic from AWS and thus cost too much money.

IIRC the part of IPFS with bad performance was the one that tracks which file chunks are present on each machine.

For reference, we have a (partially finished) http-based solution with rsync among servers and planned mirrorbits for redirection to “suitable” server (closest one that works well). It synces from cache.nixos.org. This even ran for a few months on two servers IIRC, but (almost) noone used it, and currently it’s probably not actively used or developed.

layus · October 17, 2018, 8:42am

A recurring argument is that this is a security issue in some cases, because you can then know what packages are in a given store, and deduce the running version of, say, web servers hosted there.

It reduces the usefulness of the common nix users “[who] download those packages and have their PCs plugged into internet”.

Ekleog · October 17, 2018, 2:22pm

A recurring argument is that this is a security issue in some cases, because you can then know what packages are in a given store, and deduce the running version of, say, web servers hosted there.

This is only a security issue when considering security through
obscurity as a “security”. Often web servers can be fingerprinted,
etc. It’s actually pretty rare (euphemism) that a server leaks no
information about its version, even when configured specifically not to.

Now, yes, leaking the full store path does leak more information. But
relying on it for security sounds like a mistake to me, from a security
point of view.

Yes, this should likely not be turned on by default. But I do think
there’d be enough users turning it on, assuming it’s left commented in
the file generated by “nixos-generate-config” telling users it’d make
downloads faster in exchange for leaking the versions of software you
use to more people than just hydra, for it to make a significant
decrease in cache accesses – assuming the caches are actually able to
serve IPFS (or whatever).

(Oh, and yes, I’m assuming we don’t give an option to easily turn on the
IPFS substituter without also turning on the serving daemon. Such
“antisocial” behaviour cannot be banned, but I think it’s better to push
users using the substituter to also serve the things they download this
way.)

balsoft · October 17, 2018, 2:43pm

The main security question is: how easy is it do differentiate files downloaded from hydra cache and those built locally? It seems to me like the only sane way to do this is to disallow any of the locally-built packages to be pulled from the machine (otherwise we get nasty security implications)

balsoft · October 17, 2018, 2:51pm

If I understand you correctly, performance of Bittorrent should be better as it is more optimised?

vcunat · October 17, 2018, 3:08pm

Possibly. I don’t really know. We need to serve very many *.nar files, and their list grows quite quickly. I don’t think bittorrent is really designed for a similar scheme either, but maybe it could work well.

Ekleog · October 17, 2018, 3:21pm

The main security question is: how easy is it do differentiate files downloaded from hydra cache and those built locally? It seems to me like the only sane way to do this is to disallow any of the locally-built packages to be pulled from the machine (otherwise we get nasty security implications)

So long as the system used for sending the files is actually a
content-addressed database that does not leak which paths you’re
downloading (note: I think IPFS doesn’t fit this bill), you should be
safe: it’s required to already know the hash of the content (hence the
content) to download the information.

The problem is, I’m not sure there is actually a content-addressed
distributed system that doesn’t leak the hashes of chunks downloaded
currently.

Which, indeed, is a big issue I hadn’t heard of before. Thanks!

7c6f434c · October 17, 2018, 7:31pm

downloads faster in exchange for leaking the versions of software you
use to more people than just hydra, for it to make a significant

Note that the leak is not limited to the time of the build, though.

Of course, it is not too hard to fetch things you don’t run to make such
fingerprint measurements a more interesting task.

7c6f434c · October 17, 2018, 7:42pm

The problem is, I’m not sure there is actually a content-addressed
distributed system that doesn’t leak the hashes of chunks downloaded
currently.

Well, leaking from which party to which? Is it a leak if a client can
check which server agrees to provide a chunk?

Securely checking who has a specific chunk while leaking this chunk only
to the selected provider could be rather expensive…

7c6f434c · October 17, 2018, 7:38pm

The main security question is: how easy is it do differentiate files downloaded from hydra cache and those built locally? It seems to me like the only sane way to do this is to disallow any of the locally-built packages to be pulled from the machine (otherwise we get nasty security implications)

It is useless to distribute files without a widely trusted signature; so either something has been fetched, or reproduced perfectly, or should not be distributed anyway.

Ekleog · October 18, 2018, 1:00am

The problem is, I’m not sure there is actually a content-addressed
distributed system that doesn’t leak the hashes of chunks downloaded
currently.

Well, leaking from which party to which? Is it a leak if a client can
check which server agrees to provide a chunk?

Securely checking who has a specific chunk while leaking this chunk only
to the selected provider could be rather expensive…

The issue is that if the client leaks the hash of the chunk it’s looking
for to other people than people who already have it, then these other
people can then request the same hash (unless additional security
protections are set, like searching the chunk from hash(hash(…)) and
then once authenticated with the server check both know the hash… but
I don’t know of any replication system that advertises actually doing
that).

This in turns means security issues, as this hash can be the hash of a
chunk that contains eg. a password, and then anyone can request it from
the builder’s store. #47860 in way worse, basically.

7c6f434c · October 18, 2018, 6:57am

Well, leaking from which party to which? Is it a leak if a client can
check which server agrees to provide a chunk?

Securely checking who has a specific chunk while leaking this chunk only
to the selected provider could be rather expensive…

The issue is that if the client leaks the hash of the chunk it’s looking
for to other people than people who already have it, then these other
people can then request the same hash (unless additional security
protections are set, like searching the chunk from hash(hash(…)) and
then once authenticated with the server check both know the hash… but
I don’t know of any replication system that advertises actually doing
that).

This in turns means security issues, as this hash can be the hash of a
chunk that contains eg. a password, and then anyone can request it from
the builder’s store. #47860 in way worse, basically.

To solve this scenario it is probably enough to bite the bullet and sync
the list of paths signed by every widely used public key. I.e. the list
of paths built by Hydra is smaller than the channel tarball anyway, and
we could try doing some incremental tricks.

If I only request chunks known to be theoretically buildable by Hydra or
present in r-ryantm Cachix cache, I am not too likely to leak a chunk
containing something specific to me.

But even then I leak my upgrade patterns.

(and another story is that I can find out the list of all people willing
to serve a specific Hydra-built chunk)

If you want to do a chunk search with moderate disclosure, you could
probably broadcast 4 characters from the hash then run a relatively
expensive oblivious match check with a random small subset of servers
who claim to have something starting with these 4 characters. For
oblivious check you could note that fraction 6e-8 (i.e. on out of 64^4)
is usually one path or maybe two, so agreeing on a random salt and
asking the server to search a salted hash in the list is not that
expensive anymore.

Of course, if you request just a single chunk and the attacker knows its
structure and it contains a weak enough password, the leak helps to do
offline brute-forcing on a GPU. Just requesting from the server won’t
do, though, because an honest server will make sure that random salt
agreement will end up with a different salt during the replay (and
a dishonest server being able to provide you with a file with your
password is bad enough without a second attacker).

layus · October 18, 2018, 12:36pm

I did not intend to start a complete discussion about security and adequate technologies to mitigate risks. I just personally tried to implement such a scheme for .deb packages years ago, only to discover that there already existed some projects doing it very well [apt-p2p][debtorrent], but none gained traction within debian or ubuntu community. I cannot stress this enough: there are several perfectly working implementation that nobody uses.

Before arguing more, we should wonder why this idea pops every once in a while, gets sometimes implemented but never used ? And what makes Nix different from deb-based distros ?

From what I could gather, there is simply no traction to switch to p2p when there are reliable mirrors everywhere on the planet. These mirrors are a kind of p2p by themselves, just less decentralised.
And then, all these nasty issues of information disclosure kick in, and the project stall and dies before really getting used.

I understand how this idea appeals to a programmer, but in real-life we have CDNs, and torrents are only used for iso images, to allow recovery of network instability. Somehow, there must be a reason.

Ekleog · October 18, 2018, 2:54pm

To solve this scenario it is probably enough to bite the bullet and sync
the list of paths signed by every widely used public key. I.e. the list
of paths built by Hydra is smaller than the channel tarball anyway, and
we could try doing some incremental tricks.

If I only request chunks known to be theoretically buildable by Hydra or
present in r-ryantm Cachix cache, I am not too likely to leak a chunk
containing something specific to me.

Hmm… I think the solution of having an adapted P2P system would work
too? (cf. the hash(hash(…)) idea above) Maybe one already exists, but
IIRC anyway current P2P systems just don’t scale enough to handle
nixpkgs, so we’d need a new one anyway.

But even then I leak my upgrade patterns.

(and another story is that I can find out the list of all people willing
to serve a specific Hydra-built chunk)

If you want to do a chunk search with moderate disclosure, you could
probably broadcast 4 characters from the hash then run a relatively
expensive oblivious match check with a random small subset of servers
who claim to have something starting with these 4 characters. For
oblivious check you could note that fraction 6e-8 (i.e. on out of 64^4)
is usually one path or maybe two, so agreeing on a random salt and
asking the server to search a salted hash in the list is not that
expensive anymore.

That sounds more or less similar to the hash(hash(…)) idea above? To
make the full protocol I was thinking more explicit (checking it’s the
same as the one you were thinking of):

User searches the DHT for hash(hash(chunk))
The peers who claim having the chunk answer
User establishes secure connections to these peers. For each peer:
- A secure connection is established between the user and the peer
- Peer sends a nonce
- User answers with hash(hash(chunk) || nonce)
- User has now proven he knows the chunk’s hash and is therefore
  allowed to download, Peer sends the chunk

Maybe some P2P protocol already does this? I must say I haven’t checked,
but it sounds quite safe under the assumed threat model (ie. anyone
listening on the network can no longer get a private file without
actually already knowing its hash, which is proof of knowing the file’s
contents given the hash is no longer transmitted to anyone else)

(There’s actually no need for Peer to prove it knows the chunk by
sending a hash-with-the-nonce, because anyway Peer will send the chunk
to User and not the opposite, so only User needs to prove it already
“knows” the chunk.)

Of course, if you request just a single chunk and the attacker knows its
structure and it contains a weak enough password, the leak helps to do
offline brute-forcing on a GPU. Just requesting from the server won’t
do, though, because an honest server will make sure that random salt
agreement will end up with a different salt during the replay (and
a dishonest server being able to provide you with a file with your
password is bad enough without a second attacker).

That is an interesting other threat model… but here I’m not sure there’s
much that can be done against it. The inner hash can become bcrypt
or anything similar, though, which can make things harder.

Also, maybe something from the zero-knowledge proof domain could help
here? I’m not familiar with it, though

Ekleog · October 18, 2018, 3:09pm

I did not intend to start a complete discussion about security and adequate technologies to mitigate risks. I just personally tried to implement such a scheme for .deb packages years ago, only to discover that there already existed some projects doing it very well [[apt-p2p]][[debtorrent]], but none gained traction within debian or ubuntu community. I cannot stress this enough: there are several perfectly working implementation that nobody uses.

Well, it is an interesting discussion also, these working
implementations are for debian, which doesn’t build private files (well…
most of the time, and I’d guess when it doesn’t these implementations
aren’t safe to use).

OTOH, nix also builds the configuration, including private stuff,
meaning we do need some security-specific improvements.

Before arguing more, we should wonder why this idea pops every once in a while, gets sometimes implemented but never used ? And what makes Nix different from deb-based distros ?

Well, NixOS does have the advantage of being also a configuration
management system. Meaning that turning the thing on could be a simple
on-off switch, while on deb-based distros it’s likely a big time
investment to get things to work. Actually, we could even make it a
1-character deletion, if it came commented in the default
nixos-generate-config.

Here, the drawback (requiring a specific P2P system) becomes the
advantage: people much more likely to actually do it.

And the important thing about P2P systems is that people actually do
it. Without it there’s no reason to use them, as it’ll usually just be
slower than downloading from a CDN.

From what I could gather, there is simply no traction to switch to p2p when there are reliable mirrors everywhere on the planet. These mirrors are a kind of p2p by themselves, just less decentralised.
And then, all these nasty issues of information disclosure kick in, and the project stall and dies before really getting used.

I understand how this idea appeals to a programmer, but in real-life we have CDNs, and torrents are only used for iso images, to allow recovery of network instability. Somehow, there must be a reason.

Well… if I was able to painlessly share the downloads between my
computers over my LAN, I’d be happy to do it. But I don’t because I’m
too lazy to setup a cache.

The whole idea would be to make turning on P2P much less work than
setting up a cache (because everyone would detect everyone else anyway),
and potentially even more secure too (because #47860, people just can’t
be trusted to actually keep their caches private when they need be).

So yes, there must be a reason. I believe this reason is complexity of
setting up the P2P system and non-discoverability of it by newcomers,
two issues NixOS could solve.