"Simple" overlay network / VPN on NixOS?

waffle8946 · December 23, 2024, 5:46pm

I’m looking to have a way to connect to some machines on my home network from my portable devices irrespective of my location and what network I’m on. Presumably, this requires some sort of an overlay network/VPNs. It seems many others are using mesh networks to achieve this.

I don’t really understand how networking works (I barely understand how to set up nginx), and while I did look into setting up headscale, the documentation is completely lacking on the headscale and NixOS sides and overwhelming on the tailscale side. For example, the first thing I ran into when trying to enable the module was an error telling me I need to set dns.base_domain which is evidently different from server_url, neither of which are clearly explained in their respective module options nor in the headscale docs. I’d like to not use tailscale as it is closed-source after all.

I imagine others have already set up something really simple like this before, so rather than going in circles for hours, I ask: is there a simple ready-to-use configuration that doesn’t require sifting through docs to get going?

EDIT: I should also mention, I have a few domains I could use if needed.

VTimofeenko · December 23, 2024, 6:26pm

The wiki article on tailscale seems reasonably detailed. Headscale (I believe) is a self-hosted version of tailscale that uses the same vpn protocol(WireGuard) under the hood and provides similar features.

You might want to start with the tailscale; I’d expect there would be more materials on setting it up compared to headscale.

I know you didn't ask for a solution that requires sifting through docs, but IMHO using plain WG is worth it.

As someone who is hand-rolling a number of these mesh networks, I would strongly suggest using wireguard(WG, NixOS wiki, archwiki). I’ve used openvpn and ipsec in the past and there was always something wrong with my setup; probably due to a mountain of PEBCAKs.

The beauty of WG is that it very neatly plugs into the network stack, allowing you to bring the routing/firewall rules/DNS entries as you want them to be.

My WG provisioning code pre-dates *scale so I have personally never found a good reason to use them, so treat the following as kind of an “Old man yelling at the cloud”.

The value add of *scale is that they manage a few things on top of vanilla WG, namely:

DNS resolution
Key management
ACLs to restrict access within the network

Which is great, but often they introduce “spooky action at a distance” into the system thus making the config harder to reason about instead of easier.

Even if you do go for *scale, I would suggest setting up two machines talking over WG without *scale – just to get a feel for the tech.

I don’t want to hijack this thread with my rambling, so I’ll stop here, but I’d be happy to expand on this (and update the NixOS wiki article so that it’d be reasonably useful for someone who has not used WG before)

waffle8946 · December 23, 2024, 6:31pm

That’s my understanding as well, yes. But the wiki article only appears to refer to tailscale module options, and in any case, that’s only the client side of the setup, which is presumably the “easy” part - the server side is left up to the reader.

That’s fair, I’ll look into using wg itself, and appreciate the links! I do agree that *scale seemed very “magical”, which might make things harder to debug in the future.

VTimofeenko · December 23, 2024, 6:48pm

Yeah, that’s probably the reason for headscale docs complexity/brevity.

I do agree that *scale seemed very “magical”, which might make things harder to debug in the future.

To belabor the point a little more: I am not advocating for everyone to hand-roll their own mesh network solution (unless they want to, of course). The DNS management by *scale alone is pretty darn convenient.

What I am specifically suggesting is to just try out running WG on two nodes and setting up ping between them. That alone should make the wireguard configuration settings much clearer. If you go this way – you might want to try imperative management first using wg or wg-quick rather than using NixOS options.

If you want to dig deeper:

Bring up a proper firewall(nftables recommended) on both nodes and let them connect to each other over telnet/netcat
Add a third node, and turn one of the first node into a ‘hub’ of sorts. So that connections flow A <> B <> C without A and C connecting directly
Set up a mesh-only DNS that would allow resolving all nodes names into addresses

crertel · December 23, 2024, 11:24pm

This article is probably the fastest way I’ve seen to accomplish this. I just ran through it again last night as I assimilated some new boxen.

Basic punchline:

Get a tailscale account (I authed via Github, do whatever makes you happy).
Add tailscale to your packages.
Add config.services.tailscale.port to your UDP firewall if you use one, and add tailscale0 to your trustedInterfaces if you want.
Set services.tailscale.enable = true.
Cut a new one-time token from Tailscale for joining your network (it’ll last 90 days, which is configurable).
Add a new one-shot systemd service (coming up after the tailscale.service and network-pre.target) to start and connect to tailscale if it isn’t already running using the token from (5).

After that, your system should come up and be visible on the Tailscale machines page.

If you want something a bit heavier duty, the Nebula support in NixOS is quite good.

numinit · December 23, 2024, 11:42pm

To add on about Nebula, it’s sort of like a fully self-hosted version of Tailscale, and includes its own mesh DNS resolver, and good Prometheus support for metrics. You’d probably use it, for example, if you want to put lighthouses (nodes visible by everyone which distribute information about local addresses on the mesh) on infrastructure you run, or if you would like to do more complex meshing setups.

I used Tinc years ago, and the performance of Nebula is much better, especially considering all the ZFS replication I was doing over it. This article by Jim Salter was what got me interested enough in it to write the NixOS module and use it for most of my personal infrastructure over the past 5 years, so happy to answer questions too

nosferatu · December 24, 2024, 2:29am

I have been a massive fan of netbird
Used it in an enterprise environment where traditional SSL-based VPNs were limiting and needed an overlay network with SSO/zero-trust.

waffle8946 · January 4, 2025, 11:11pm

Appreciate all the answers and suggestions! I did go through the official WG overview to understand what’s happening, I do feel like things are a bit clearer in that respect.

In practical terms, I did try to set up nebula, and while the lighthouse server has a public IP (and I have no problem ssh’ing into it), the nebula handshake seems to time out. Googling seems to mention basic setup errors (e.g. using incorrect IPs) which I don’t appear to have made - I used the nebula IPs everywhere (except for the contents of the static host map, of course), and the nebula service seems to otherwise provide no clear errors. I even went a step further and made all the nodes punchy, didn’t seem to make a difference. Could be something funky with the VPS provider for the lighthouse, but I guess I’ll give up for now, setting up a VPN seems too advanced with little option for debugging.

numinit · January 4, 2025, 11:53pm

I’ve had certain VPS providers block some UDP ports. Had more luck trying 500 or 4500 since those are classically used by IPSec.

If you post your config, I can help debug if I see anything obvious.

waffle8946 · January 5, 2025, 1:50am

Good point. So, turns out I did in fact make a basic setup error - failing to verify my provider’s firewall allowed traffic on UDP 4242. After correcting that, the tunnel to the lighthouse works, but the clients still fail trying to see each other.

I’m guessing it’s some quirk of my home network, I’ll have to check router settings maybe… (It’s running freshtomato, don’t know if that matters.) I don’t think it’s a NixOS-side config question any longer, but for completeness, I’ll post my configuration. I even tried (temporarily) disabling the HOSTY firewall completely, with no improvement.

Config follows:

CA keys were generated with

nebula-cert ca -encrypt -name CA -out-qr CA.png -argon-memory 10485760 -duration 100h

plus passphrase, of course. Aggressive expiration was for testing purposes, I still have a couple days left.

HOSTX is the lighthouse, hosted on hetzner cloud with the publicly routable IP HOSTX_PUBLIC_IP.

# HOSTX's config
{
  services.nebula.networks.NAME = {
    enable = true;
    ca = "/etc/nebula/CA.crt";
    # nebula-cert sign -ca-crt CA.crt -ca-key CA.key -name HOSTX -ip '192.168.100.1/24'
    cert = "/etc/nebula/HOSTX.crt";
    key = "/etc/nebula/HOSTX.key";
    isLighthouse = true;
  
    settings = {
      punchy = {
        punch = true;
        respond = true;
      };
    };
  };
}

HOSTY is a NixOS client

# HOSTY's config
{
  services.nebula.networks.NAME = {
    enable = true;
    ca = "/etc/nebula/CA.crt";
    # nebula-cert sign -ca-crt CA.crt -ca-key CA.key -name HOSTY -ip '192.168.100.2/24'
    cert = "/etc/nebula/HOSTY.crt";
    key = "/etc/nebula/HOSTY.key";
    lighthouses = [ "192.168.100.1" ];
    staticHostMap = {
      "192.168.100.1" = [ "HOSTX_PUBLIC_IP:4242" ];
    };
    settings = {
      punchy = {
        punch = true;
        respond = true;
      };
    };
  };
}

CA.crt and the respective client key/cert pairs were sent to both machines from the signing host.

HOSTZ is an android client using the official nebula app, which generated its own key pair for me, so I simply copied the public key (HOSTZ.pub, below) to my signing host, signed it with the CA:

nebula-cert sign -ca-crt CA.crt -ca-key CA.key -name HOSTZ -ip '192.168.100.3/24' -in-pub HOSTZ.pub

after which I copied HOSTZ.crt back to the client. Then, I added a single entry under Lighthouses/Static Hosts, identical to the staticHostMap above.

Since I’ve already set up SSH from HOSTZ (android) to HOSTY (nixos) in the past, which does work without using nebula (when the machines are on the same network), I decided to use that as a quick test of connectivity.
From termux running on HOSTZ I attempted to run:

ssh USER@192.168.100.2

which resulted in (HOSTY journal):

TIME HOSTY nebula[433028]: time="TIME" level=info msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=3933893443 localIndex=3933893443 remoteIndex=0 udpAddrs="[HOSTZ_PUBLIC_IP:PORT]" vpnIp=192.168.100.3
TIME HOSTY nebula[433028]: time="TIME" level=info msg="Handshake timed out" durationNs=5609451322 handshake="map[stage:1 style:ix_psk0]" initiatorIndex=3933893443 localIndex=3933893443 remoteIndex=0 udpAddrs="[HOSTZ_PUBLIC_IP:PORT]" vpnIp=192.168.100.3

and symmetrical messages show in the HOSTZ logs in the nebula app.

numinit · January 5, 2025, 3:04am

One recommendation is to set isRelay to true on your lighthouse. I always double up lighthouses as relays. Then you’d add relays = [ "192.168.100.1" ] on the client. This helps handle cases where X can get to Y and Z but Y can’t get to Z for whatever reason.

You may also want to configure a firewall that’s permissive to start with:

firewall = {
    inbound  = [ { port = "any"; proto = "any"; host = "any"; } ];
    outbound = [ { port = "any"; proto = "any"; host = "any"; } ];
  };

numinit · January 5, 2025, 3:13am

Another related thing, if you’re using a hostname for the underlay connection make sure to add it to networking.hosts to make sure it always resolves, especially if it’s on a device with occasionally limited network connection.

waffle8946 · January 5, 2025, 5:27am

So far I only used IPs, but I’ll keep that note in mind if I were to switch to hostnames, thanks.

Ah for some reason I assumed that it would be permissive by default, or that there wouldn’t be a firewall at that level by default? In any case, I now just tried with isRelay = true; on the lighthouse (HOSTX) and adding the firewall config as mentioned (on lighthouse HOSTX and NixOS client HOSTY), it now shows messages like

time="TIME" level=info msg="Attempt to relay through hosts" localIndex=3099701361 relays="[192.168.100.1]" remoteIndex=0 vpnIp=192.168.100.2
time="TIME" level=info msg="Re-send CreateRelay request" localIndex=3099701361 relay=192.168.100.1 remoteIndex=0 vpnIp=192.168.100.2
time="TIME" level=info msg="send CreateRelayRequest" initiatorRelayIndex=1288521606 relay=192.168.100.1 relayFrom=192.168.100.3 relayTo=192.168.100.2

in quick succession, in HOSTZ logs, until the handshake times out.

sirphobos · January 5, 2025, 8:51am

Note that only the control server part of the tailscale infrastructure is closed-source. And that is what headscale implements.
For me - the NAT traversal without speed drops because of relaying traffic that tailscale does is the most important feature, compared to manually set up WG.

In “theory” it should be “trivial” to switch control servers of your existing network, but I have never attempted to set up headscale myself…

Also worth noting that even if control server was opensource - it is them who control it, so in theory an evil agent that gained control of tailscale company servers - could add machines to your network. And to counter that - there is a feature of signing your keys (they call it “locking” i think) - in that case even though you have no control of control server - only you will be able to add signed keys to it (since it’s open sourced client that is checking those signatures).

waffle8946 · January 5, 2025, 1:23pm

Sure, you’re right, it’s closed-source, which precludes self-hosting, and moreover I must provide some data to the company to even use it (even if the data happens to be falsified), etc. I figured that was implied. And headscale is underdocumented since they assume all their users are coming from tailscale, I guess.

sirphobos · January 5, 2025, 1:37pm

when using headscale (it’s just they key server) - you will still use tailscale client, it is all open-source.
there is no excuse for headscale to be underdocumented - people coming from tailscale’s proprietary control server have no idea how to set it up since it is the only component in standard setup that users have no control over.

waffle8946 · January 5, 2025, 1:39pm

Yes, I don’t find it easy to understand what’s going on (one example was mentioned in my original question), which is what brought me here looking for alternatives in the first place

Though at this point, it’s more of a networking (NAT? firewall?) issue, so my choice of overlay network no longer appears to matter. At least it no longer feels like magic

malikwirin · January 5, 2025, 1:50pm

Depending on your usecase, making specific services available to specific clients Yggdrasil would be an interesting option

sirphobos · January 5, 2025, 2:48pm

I have an impression since long ago that yggdrasil is even harder to setup than manual WG mesh…
Is it now actually?
and does yggdrasil solve the NAT traversal problems of the common user? meaning that in most cases all of my machines are behind some NATs, so there is no central globally accessible node.

malikwirin · January 5, 2025, 3:43pm

It does if you include as transfer networks, next to clearnet and local networks, tor and i2p