Nixops - automatic rollback?

supermarin · February 24, 2021, 2:50pm

I’m interested in thoughts on this problem:
The other day I was rolling out updates with manual scripts using nix-build, nix-copy-closure followed by switch-to-configuration.
After updating the script to support more than 1 prod box, I ended up deploying completely wrong network settings (IPs for machine A onto machine B). This rendered the machine completely unusable, couldn’t even SSH into it. This made me take the whole prod box down, re-provision a server, update DNS and redeploy. I’ve now switched to nixops to avoid as many similar mistakes as possible, but it doesn’t address this issue.

The problem: is there a way I’m missing to rollback the machine to the previous state? This particular example was on DigitalOcean. They do have a recovery console, but it turns on too late to access GRUB (I’m not even sure if it’d be populated with previous generations).

A potential solution I’m thinking of is automatic rollback: once a deploy is complete, a countdown starts that waits for the ops machine to call back in and do a healthcheck. The healthcheck effectively kills the countdown process and everything stays the same. If the target machine doesn’t hear back and countdown runs out, an automatic nixos-rebuild --rollback is executed directly on the target machine.

What are your thoughts on the above - and is there a common practice that I’m missing?

sorki · February 24, 2021, 3:15pm

There’s Nixus which can do automatic rollbacks.

Previous profiles are stored in /nix/var/nix/profiles/ and you can also use that directly to switch to one of them - e.g. /nix/var/nix/profiles/system-123-link/bin/switch-to-configuration switch

I’ve been considering adding healthchecks as systemd services, so even switch would tell you when something is not up to spec (and you could possibly react to that with rollback).

Tangentially related there’s [WIP] nixos/systemd-boot: boot counting and automatic fallback by danielfullmer · Pull Request #84204 · NixOS/nixpkgs · GitHub which you might find interesting as well (requires EFI and systemd-boot).

nrdxp · February 25, 2021, 1:08am

There is also deploy-rs which has this feature built in and enabled by default.