How to test a NixOS server upgrade?

I’m self-hosting a NixOS server with several services. It’s a simple setup: a Hetzner VPS, configured with a Flake.

Generally it works very well, but recently it happened to me twice that after an upgrade something was broken. My process is to manually test everything important and rollback if necessary. But this leads to downtime between upgrade and rollback.

What strategies are people using to make sure upgrades (or other changes to configuration) work before they apply them to production systems?

I was thinking it would be nice to run my system in a VM or some kind of container and test it there. I know there is the nixos-build-vms command. Before hacking some solution around it, I’d like to get advice. Considerations:

  1. CPU architecture (the server is aarch64, my laptop x86_64 and some problems may be architecture specific),
  2. stateful data (like databases),
  3. networking (e.g. obtaining Let’s Encrypt certificates require public access to the server; there’s also a Wireguard VPN to my home server and laptop).
1 Like

I use build-vm as a sort-of staging environment. It’s perfect precisely for this, but also to experiment with feature branches and such.

No way around spending money on a real staging machine if you want to test behavior on real hardware.

For the kind of smoke testing you’re likely going to do, I’d imagine functional testing on x86 will normally be enough for your average VPS web-serverey task, though.

… should never be replicated outside of production environments. There’s no way to assert state doesn’t cause issues with a test/staging environment, outside of writing a test suite for the software you’re deploying (which ideally you’d do upstream).

Best practice here IMO is to use btrfs/zfs and make a snapshot part of your update flow. Regular backups are essential, too, so you should have an exhaustive list of significant state for that purpose anyway, at which point snapshotting it for quick downgrades in the event that is necessary should be quite doable.

Erase-your-darlings/impermanence & co. can help, too.

Has to be changed on a case-by-case basis. You can apply a special module which only evaluates for your VM instance to fudge the networking a little.

E.g. let’s encrypt certs can be setup to generate self-signed certs with the NixOS module (and do so by default as a fallback if you can’t verify your domain).

Personally I set up some wonky domain smudging, as well as a local DNS resolver and a bridge network to assign an IP to the VM, to allow fully testing my nginx subdomains, as well.


Bottom line, it doesn’t match production perfectly, of course. Test environments never do. Some things cannot really be tested (e.g. actual confirmation of your domain ownership by let’s encrypt).

This is perfectly ok, though, you can get close enough to catch most real issues, and test all but the gnarliest, external service-dependent features.

3 Likes

I made my own ad-hoc thing based on NixOS tests that does this, including iptables trickery to make the same global IP addresses work within the test and step-ca to emulate LetsEncrypt and such. I don’t depend on many external services, which makes the network-related part a lot easier.

I run all these tests on x86_64, so machines that are usually a foreign architecture simply get changed to also be x86_64. I test state migrations (e.g. from one major version of a distributed software to the next, with a data format change) using specialisations, which can be activated at any point during the test. Secrets are replaced with dummy values via a NixOS module that implements the interface of agenix, the secrets manager I use on the real machines.

The NixOS test machinery can also run all the VMs in an interactive way, which allows you to manually test things on your local machine. Very useful for debugging errors in complex changes.

1 Like