I’m interested in thoughts on this problem:
The other day I was rolling out updates with manual scripts using nix-build, nix-copy-closure followed by switch-to-configuration.
After updating the script to support more than 1 prod box, I ended up deploying completely wrong network settings (IPs for machine A onto machine B). This rendered the machine completely unusable, couldn’t even SSH into it. This made me take the whole prod box down, re-provision a server, update DNS and redeploy. I’ve now switched to nixops to avoid as many similar mistakes as possible, but it doesn’t address this issue.
The problem: is there a way I’m missing to rollback the machine to the previous state? This particular example was on DigitalOcean. They do have a recovery console, but it turns on too late to access GRUB (I’m not even sure if it’d be populated with previous generations).
A potential solution I’m thinking of is automatic rollback: once a deploy is complete, a countdown starts that waits for the ops machine to call back in and do a healthcheck. The healthcheck effectively kills the countdown process and everything stays the same. If the target machine doesn’t hear back and countdown runs out, an automatic nixos-rebuild --rollback is executed directly on the target machine.
What are your thoughts on the above - and is there a common practice that I’m missing?