Suggestion/Feature: use Bittorent or IPFS to download packages

7c6f434c · October 18, 2018, 8:59pm

Do you think ed2k would actually scale nixpkgs-scale? I mean, I have
currently ~4M files in my store, not even counting that they should be
further split into chunks to increase parallelism, and I have done a
garbage-collect relatively recently… I guess we’re hitting the billion
files if we include even just a year worth of nixpkgs bump, and with
many people on the network we’d likely get at least that.

I think we should index store paths. I have a mere 1.86M files right
now, but this corresponds to just 27k files. This sounds fine for e2dk
search performance if my memory serves me well (in the model of every
person sharing some amount of random stuff collected over years).

We could also compare large Bittorrent trackers with Hydra from the POV
of daily entry flow…

Not the same one. This protocol, by the way, has an obvious MITM attack
which is cheap — that is why I said that a secure two-party random
string generation is needed, we need both client and server to be able
to ensure that nonce is good.

Well… I assumed that the “A secure connection is established between
the user and the peer” was a standard TLS-or-similar protocol, that
already checks that both client and server agree on the private key,
which is just as good as agreeing on the nonce.

However, the MitM you describe is indeed an issue, and it isn’t
prevented by making sure client and server agree on the nonce, as the
attacker could just relay the nonce-agreement messages.

A solution to this may be to use the DH-derived encryption key to derive
the nonce, because with this the attacker can’t be MitM’ing (because if
they’re MitM’ing, then the DH-derived key wouldn’t be the same).

DH is expensive, I would just do a hash-based commitment to personal
nonces, then reveal the committed values, then xor.

Also, maybe something from the zero-knowledge proof domain could help
here? I’m not familiar with it, though

Well, there is a secure multi-party protocol that reveals only the
desired outputs. But we have a lot of paths to request, and a lot of
served paths, so it will be capital-E Expensive.

I am not sure how to do such an oblivious search in a reasonably secure
and efficient way. Although I guess I could ask a few people to see if
the current state of the art is indeed better nowadays (but I have
doubts it scales well).

Well… if they don’t answer I can ask people from my side, just tell me
if that’s required

Well, I guess I was too pessimistic: by just looking at IACR preprint
server one can find out that the so-called «Private set intersection» is
enough at least in the case of one server and one client…

Ekleog · October 19, 2018, 12:22pm

Do you think ed2k would actually scale nixpkgs-scale? I mean, I have
currently ~4M files in my store, not even counting that they should be
further split into chunks to increase parallelism, and I have done a
garbage-collect relatively recently… I guess we’re hitting the billion
files if we include even just a year worth of nixpkgs bump, and with
many people on the network we’d likely get at least that.

I think we should index store paths. I have a mere 1.86M files right
now, but this corresponds to just 27k files. This sounds fine for e2dk
search performance if my memory serves me well (in the model of every
person sharing some amount of random stuff collected over years).

I guess the only issue is finding a way to share files securely
then. Unless we want to setup only private-network build sharing, but
this’d reduce the use cases quite a lot IMO. And it’d also potentially
make setup more complex, because then we need a way to define the
boundaries of the sharing. (well, either that or use a model like
syncthing, maybe?)

Well, that or have hydra publish the list of hashes that should go on
the p2p network, but this sounds both less elegant and less efficient
(eg. for people who patch glibc in a way not built by hydra, for people
wanting to share armv7 builds (want want want), etc.)

We could also compare large Bittorrent trackers with Hydra from the POV
of daily entry flow…

Indeed

A solution to this may be to use the DH-derived encryption key to derive
the nonce, because with this the attacker can’t be MitM’ing (because if
they’re MitM’ing, then the DH-derived key wouldn’t be the same).

DH is expensive, I would just do a hash-based commitment to personal
nonces, then reveal the committed values, then xor.

Well, I assume every modern encrypted channel between User and Peer
would have already done a DH exchange to bring a bit of
forward-secrecy. So it shouldn’t cost anything.

Also, with the hash-based commitment, what blocks the attacker from just
relaying the hash-commitment messages and getting the file?

The only solution to avoid that IMO is to make the nonce depend on the
encryption key (so that it can’t be used outside of this encrypted
channel). And to avoid the attack where the attacker establishes an
encrypted channel with both ends and relays, for un-authenticated
connections I know of only DH to prevent that.

Well, I guess I was too pessimistic: by just looking at IACR preprint
server one can find out that the so-called «Private set intersection» is
enough at least in the case of one server and one client…

Hmm… I guess the issue is then that we’d need to compute the private set
intersection with everyone, which doesn’t really sound reasonable.

[1] appears to give a few more set operation primitives. With these
primitives I think I can see a way of doing the job:

Let the set of users be I, its set of local paths is U_i
Compute Ω = \union_i {(i, x) | x ∈ U_i}
When user j wants to download path P, it computes
Ω \inter {(i, P) | i ∈ I}
and then can recover the list of users having path P

Now, the open questions (I haven’t read the paper completely yet, the
above protocol just appears like only using primitives the paper
provides) are:

How slow would it be to update Ω when a user joins / leaves?
How slow would the set intersection be? (as it’s not a DHT anymore)
How bad could a malicious user mess with Ω? (replaying
previously-present users, DoS by making it arbitrarily large, etc.)
Is it actually possible to implement with what the paper gives?
Is it actually secure? (I think so, but have thought about it only
for like half an hour, so…)

[1] https://link.springer.com/content/pdf/10.1007%2F11535218_15.pdf

7c6f434c · October 19, 2018, 3:16pm

Well, that or have hydra publish the list of hashes that should go on
the p2p network, but this sounds both less elegant and less efficient
(eg. for people who patch glibc in a way not built by hydra, for people
wanting to share armv7 builds (want want want), etc.)

Well, a subcommunity can always define its own set of publically defined
safe paths and locally trusted keys.

A solution to this may be to use the DH-derived encryption key to derive
the nonce, because with this the attacker can’t be MitM’ing (because if
they’re MitM’ing, then the DH-derived key wouldn’t be the same).

DH is expensive, I would just do a hash-based commitment to personal
nonces, then reveal the committed values, then xor.

Well, I assume every modern encrypted channel between User and Peer
would have already done a DH exchange to bring a bit of
forward-secrecy. So it shouldn’t cost anything.

Oh well, propagating between the network layers is a different pain.

Also, with the hash-based commitment, what blocks the attacker from just
relaying the hash-commitment messages and getting the file?

Hm, next thing is to make sure our threat model is at all compatible
with blocking relaying… I wonder if using IP addresses in the nonce
would help or not (maybe not — with coordinated attackers in different
LANs)

The only solution to avoid that IMO is to make the nonce depend on the
encryption key (so that it can’t be used outside of this encrypted
channel). And to avoid the attack where the attacker establishes an
encrypted channel with both ends and relays, for un-authenticated
connections I know of only DH to prevent that.

Hm, you are probably right. Painful.

Well, I guess I was too pessimistic: by just looking at IACR preprint
server one can find out that the so-called «Private set intersection» is
enough at least in the case of one server and one client…

Hmm… I guess the issue is then that we’d need to compute the private set
intersection with everyone, which doesn’t really sound reasonable.

That’s true; but then maybe there is a better match…

How bad could a malicious user mess with Ω? (replaying
previously-present users, DoS by making it arbitrarily large, etc.)

Well, it is actually hard to defend against claims of having a lot of
paths — these claims can even be true (and still cheap for an attacker
to make).

Ekleog · October 19, 2018, 4:52pm

Well, that or have hydra publish the list of hashes that should go on
the p2p network, but this sounds both less elegant and less efficient
(eg. for people who patch glibc in a way not built by hydra, for people
wanting to share armv7 builds (want want want), etc.)

Well, a subcommunity can always define its own set of publically defined
safe paths and locally trusted keys.

Actually, thinking more about it… anyway we need hydra to push a file
that associates the .drv hash to the fixed-output hash. There is a much
simpler solution, then (though not optimal): just symmetrically encrypt
every derivation on the p2p network, and have hydra publish the keys
to decrypt its builds along with the .drv → fo hash mapping. And it
also avoids the case where weak secrets could be guessed, because an
attacker would just not have the encryption key for the derivation.

The only issue is, if we want to re-use only off-the-shelf tools, then
we might have issues because it’d require on-the-fly
encryption/decryption to not store each file twice on the disk. A FUSE
FS could do the job, but…

But here comes the thing I heard about just earlier today:
dat. From what I’ve heard, it’s doing
basically this “sync encrypted files” protocol.

In addition, it looks like it also supports syncing from https, making
degradation to CDN mode graceful and potentially (not sure about that
yet though) even taking a regular CDN as one of the peers, getting
maximum performance for everyone.

There is a Rust library in progress, meaning that C bindings are (at
least compared to a hand-rolled protocol) close.

Downside: cost of encryption (well… there’ll be encryption anyway) and
potential scalability questions. There is will to scale to nixpkgs
scale, but their IRC channel appeared not to know if it’d actually work
out… I guess few projects end up to this scale of data volume, so anyway
we’d likely be guinea pigs here.

The only solution to avoid that IMO is to make the nonce depend on the
encryption key (so that it can’t be used outside of this encrypted
channel). And to avoid the attack where the attacker establishes an
encrypted channel with both ends and relays, for un-authenticated
connections I know of only DH to prevent that.

Hm, you are probably right. Painful.

Painful indeed. Then, apart from the idea of dat (or a similar protocol)
above, I can see only this solution and ZKPs to handle this. And this
list is in order from most-reasonable to least-reasonable, I think…

How bad could a malicious user mess with Ω? (replaying
previously-present users, DoS by making it arbitrarily large, etc.)

Well, it is actually hard to defend against claims of having a lot of
paths — these claims can even be true (and still cheap for an attacker
to make).

Indeed… that’s likely something we can’t defend against well, so we’ll
need graceful degradation whatever the protocol (ie. giving up and
either fetching from CDN or building locally).

volth · December 2, 2019, 3:54pm

I want to talk about something simlper, which might be the first step towards the global P2P cache, but also has immediate value: about using a protocol more suitable for transfer of .nar files than HTTP and SSH.

The problems are:

there are huge .nar (graalvm is 2.5Gb single file). transfer of huge files tends to break even on datacenter wires and our current protocols has no means to resume the transfer (BitTorrent sends big files in 4MB chunks)
file transfer should have lower priority than regular SSH and HTTP (BitTorrents has uTP protocol which addresses this bep_0029.rst_post)
lack of P2P, transfer might be accelerated if there is a tracker bookkeeping locations of derivations

So what about making a P2P system suitable for private installations, an alternative not for cache.nixos.org but for it’s little brother GitHub - edolstra/nix-serve: A standalone Nix binary cache server
It looks simpler and doable, the security concerns could be postponed

(The main use case I have in mind is CI/CD, so to say, mass nix-copy-closure --to .... With 100+ target machines it is nightmare; the second case is smarter nix-build --substituers ... which able to find the files in the nearest rack/DC/continent)

andir · December 2, 2019, 5:33pm

I have an implementation for a local multicast/avahi based discovery of other Nix nodes (running my software) where they can share store paths that have been downloaded from hydra without adding another trust anchor. Obviously you are restricted to the local network segment (or multiple segments of mDNS forwarding exists in the network).

It works rather well. I have been using this non-stop on all of my machines for about 4 months now. It is a great feeling that most of the stuff you download comes from the machines next to you. Especially if the download speeds at your current location are very limited.
The only issue I am facing are buggy avahi bindings that aren’t really safe to use from rust. I am planning on migrating this to plain IPv6 (site-)local multicast instead. A while ago I got a globally unique local multicast address from IANA for this. I will try to get some work on during the holidays.

Currently I rely on a fork of Nix since I had to add one primitive to the store protocol which yet has to receive any kind of review or comment: add support for queryPathFromFileHash by andir · Pull Request #3099 · NixOS/nix · GitHub

volth · December 9, 2019, 4:21pm

I think with https://github.com/NixOS/nix/issues/3260 turning nix-serve into bittorrent tracker will be trivial

lewo · March 20, 2020, 11:20pm

Just to mention the latest IPFS release include the support of UnixFS metadata (to store perms and sticky bits). Previously, it was not possible to store these files attributes which are part of a nar file.

See js-ipfs 0.41.0 released | IPFS Blog & News for details

vcunat · March 21, 2020, 8:39am

I believe NARs intentionally do not store these things (or timestamps). Perhaps just the eXecutable bit would be useful from these, if I read the post right.

lewo · March 21, 2020, 11:23am

Ah yes, my bad, sorry. I didn’t remember correctly the issue with IPFS: it was on the executable bit and not on other ones.

davidak · March 22, 2020, 12:34am

The main issue was performance problems. And they still have performance problems. They focus on fixing them in 2020. GitHub - ipfs/roadmap: IPFS Project && Working Group Roadmaps Repo

xte · March 22, 2020, 10:50am

Hi,
just as a two cent idea: IPFS is a monster, barely usable just for
play, and despite it’s target of being fully P2P IPNS have not work
when Russian Federation have tried to cut itself off internet.
In general all fully distributed software can’t really scale and
can’t really work much (see GNUNet as an example).

BitTorrent on contrary prove to scale well, it’s not completely
distributed, but the tracker does not need much resources, so for
instance a simple bit-torrent client that share /nix/store and
optionally mirror nixpkgs or even an entire cache can be usable
without much complexity and can “run Nix{,OS} infrastructure” even
from a cheap VPS with a known domain attached. All needed tweaks
can be how much upload bandwidth and concurrent connection to accept.

Collaboration for development is more complex since while it’s easy
to share a git repo via torrent it’s not easy to handle commit and
propagate changes, but even only transfer much of cache load on Nix
users IMVHO can be a very good move, and a real start to being
distributed.

GitTorrent (abandoned? [1]), GHTorrent [2] does not spread nor evolve
much but they push in a similar direction but having resources only
for sharing a repo is a thing, having resources for serve thousand of
archive in storage and bandwidth terms is another.

[1]
https://blog.printf.net/articles/2015/05/29/announcing-gittorrent-a-decentralized-github/
GitHub - cjb/GitTorrent: A decentralization of GitHub using BitTorrent and Bitcoin

[2] https://ghtorrent.org/

davidak · March 22, 2020, 7:23pm

Yes, sadly IPFS has not reached it’s goal yet. But i don’t see a reason why they shouldn’t in the future. They have millions investment from Filecoin ICO, hopefully some smart people working on it and incorporate findings from academia. I’m still optimistic that it will be usable some day.

What do you think about an architecture like Tox Bootstrap Nodes?

Do you see a pragmatic solution that my computer uses packages from another computer on my local network to install packages? Computers should share and announce it to the other. Maybe some zeroconf magic? And nix-serve.

andir · March 22, 2020, 7:36pm

I did put some work into that a few months ago. It worked reasonably
well but I had that on weird idea that required me to add a patch to the
Nix daemon: GitHub - andir/local-nix-cache: A poor and hacky attempt at re-serving local nix packages that came from trusted sources

I do not recommend using that for anything important. In the mean time I
have had an idea to use IPv6 site-local multicast to exchange messages
between Nix-local-cache nodes.

xte · March 24, 2020, 9:26am

Hi,

Yes, sadly IPFS has not reached it’s goal yet. But i don’t see a
reason why they shouldn’t in the future.

Mh, IMVHO because:

it’s already a monster
no one else in the history reach a point of a fully scalable
distributed solution, even in an ideal IPv6 scenario with a
global address for any peer, without any kind of NAT or other
obstacles, including dummy addresses via privacy extensions…

We know MANY fully-decentralyzed and free solutions, from email to
usenet, passing through a tons of less used/known solutions, they
scale, they work well, their complexity is an order of magnitude
simple than fully-distributed solutions, their performance tend to
be excellent. Of course they demand a bunch of servers, reachable
and known, but hey a domain name does not cost that much, at least
in the western world (witch is for now the biggest FOSS population
in the world) and personal servers from VPS to universities, voluntary
mirrors by various FOSS users etc are easy and numerous enough to be
an answer to avoid depend on a single entity/megacorp cloud not in
an hypothetical future but today at a price cheap enough to be easy
to migrate to a distributed solution IF we (society ad a whole)
ever create one…

What do you think about an architecture like Tox Bootstrap Nodes?

I know them too superficially to say something, for the little I know
they are all promise, mostly failed from the start though…

Do you see a pragmatic solution that my computer uses packages from
another computer on my local network to install packages? Computers
should share and announce it to the other. Maybe some zeroconf magic?
And nix-serve.

My knowledge of Nix is limited to answer BUT IMVHO it can be done, like
other package-managers do, and by itself it far reduce the dependency on
a central server: just a big company LAN instead of generate ton of
traffic to download identical bits can generate a single download and
spread it internally with FAR better performance and FAR little load
for Nix{,OS} infrastructure.

Recently Framasoft publish a nice vignette [1] that should be considered
with care by ANYONE interested in IT. The idea that “the cloud” can
scale, be cheap, be reliable, be friendly, at any scale, forever and
ever is like the ancient Simon’s Chronicle’s “hey we can use the network
as the backup, keeping pushing bit around instead of wasting disk
space”. A simple crazy idea that only people without ANY IT knowledge
and also not much logic reasoning capabilities can have and even if that
is evidently true still many, included high skilled people tend to
forget or not think about that.

A FOSS project that relay on centralized services is like a “free”
citizen in dictatorship, free only to the extent of the dictator
leash, working not for freedom but for the dictator itself that can
benefit from “free work” and say “hey, look, this is an opponent, it
speak, it be alive, so I’m not that bad”.

[1] https://framablog.org/wp-content/uploads/2020/03/installer-nos-instances.png

davidak · March 25, 2020, 12:50am

that don’t mean it will never happen i’m still optimistic, but not for the near future.

In this context, it’s just a program that downloads files from the network.

But we have many files in the binary cache, like millions. I don’t remember exactly, but maybe it was 80 TB. I can’t mirror that at home, so it would be great to just mirror some parts and others do other parts and together we have most of it.

So i configure my server to share it’s nix store and tell others to use it as a mirror? That should work and i think some do that, but it needs manual setup. It would be great if mirrors are found automatically. But we might just maintain a list somewhere. Like a GitHub wiki.

xte · March 25, 2020, 8:27am

Hi,

In this context, it’s just a program that downloads files from the
network.

the issue is “how to reach someone to download from”, that always the
issue, not only because of NAT and the absence of widespread IPv6 with
a global static address per device (not counting privacy extensions)
but because even with all algorithms one can made the “real bootstrap”
of a fully distributed service is flood the entire network per any
request. “supernodes”, trackers, “bootstrap nodes”, various kind of
“metadata prefetching” do mitigate but means or not to be fully
distributed or to have a constand dDOS from peers reaching files…

But we have many files in the binary cache, like millions. I don’t
remember exactly, but maybe it was 80 TB. I can’t mirror that at home,
so it would be great to just mirror some parts and others do other
parts and together we have most of it.

No need for that. Just torrent out their own /nix/store. If I have
something I offer it, so others do. This means that the cache is a mere
backup and a tracker for any contributing Nix{,OS} user.

So i configure my server to share it’s nix store and tell others to
use it as a mirror? That should work and i think some do that, but it
needs manual setup. It would be great if mirrors are found
automatically. But we might just maintain a list somewhere. Like a
GitHub wiki.

Well… On LAN a fully distributed system might work, if the LAN is
little things like avahi prove to be effective enough, on internet scale
IMVHO only torrent prove to be a solution, all others do not offer
usable performance… About generic mirrors: many distros do have them.
So perhaps is only a matter of popularity and age: ancient distros are
born in an era where mirror the distro you use if you can was common,
these days people do not think about that, they tend to consider the
network as “the nature”…

– Ingmar

domenkozar · March 25, 2020, 8:40am

Just for the record, 90% of what Cachix served in February (~10TB) comes from CDN cache at basically unlimited speed. It’s going to be hard to beat that.

xte · March 25, 2020, 9:56am

Hi,

Just for the record, 90% of what Cachix served in February (~10TB)
comes from CDN cache at basically unlimited bandwidth. It’s going to
be hard to beat that.

It’s not a matter of performance: CDNs are servers operated by third
parties that today might work well, might offer free services for
certain projects etc, but today.

Change form that model to another if someday it can’t be used anymore
is not a quick thing. Have “another way” operational and tested means
switch for a potential disaster to a potential marginal issue.

BTW if I can update my NixOS systems from a single personal server
no CDN can beat my LAN speed. And if that server is actually any of
my NixOS instances with a single line in my config, an open port with
a simple service I can trust well… It’s even better.

Framasoft recently publish a nice vignette [1], it teach a lesson many
should learn IMO…

[1] https://framablog.org/wp-content/uploads/2020/03/installer-nos-instances.png

vcunat · March 25, 2020, 1:08pm

Overall, after years of following this nix+CDN thread, I see problems in the motivation/difficulty ratio:

Designing a good system with IPFS aims is hard (apparently). And if someone manages it, can they expect lots of profit to recover the large investment? Centralized CDNs seem both easier to design and easier to monetize.
If the new solution won’t allow us to (eventually) shut down the current cache implementation, the motivation is rather lowered, unless we can significantly outperform it in some way.
For example, we already had a running prototype of homemade CDN service (not based on IPFS but relatively usual design of https servers updated through rsync). It only served binaries from recent channels (not all those 100+TB)… nice, but typically you wouldn’t get better service from it than from our official CDN – therefore almost noone used it and it cost money to run, so understandably it was shut down after some time.
Speed of LAN is certainly superior, but I think such use cases have far simpler solutions than building a decentralized CDN