Sorry, forgot to answer this one:
@matthewcroughan: Is it possible you could share the way you implement that timebomb?
Sharing the solution below, and hopefully it’ll make sense why it shouldn’t live in nixos-rebuild.
The timebomb itself is implemented in systemd. It activates on deploy and machine restart, and shuts off after executing.
If the revision was previously marked as healthy, even when the timebomb executes, it’s a noop.
systemd.services.rollback = {
enable = true;
description = "automatically rollback to the previous rev if unhealthy";
script = ''
#!${pkgs.bash}/bin/bash
sleep 5
healthy=$(readlink /srv/revisions/healthy)
latest=$(readlink /srv/revisions/latest)
if [[ "$latest" != "$healthy" ]]; then
echo "!!! ERROR !!! $latest is unhealthy!" >> /var/log/rollback.log
echo "rolling back to $healthy" >> /var/log/rollback.log
nixos-rebuild --rollback switch >> /var/log/rollback.log
else
echo "$latest is healthy. nothing to do." >> /var/log/rollback.log
fi
'';
wantedBy = ["multi-user.target"];
};
This part is irrelevant, just posting it for for the full context:
# Build staging and prod, so we know of failures as early as possible
- name: build all boxes
run: |
nix build .#nixosConfigurations.staging.config.system.build.toplevel -o staging
nix build .#nixosConfigurations.production.config.system.build.toplevel -o production
# Copy deployments to staging & prod. Activate only staging.
- name: deploy staging
if: github.ref == 'refs/heads/main'
run: |
nix-copy-closure --to example.com staging
readlink staging | xargs -I {} ssh example.com "ln -s {} /srv/revisions/$GIT_SHA"
ssh example.com "/srv/revisions/$GIT_SHA/bin/switch-to-configuration switch"
curl -i https://staging.example.com/ping | grep -i x-api-version | cut -f2 -d' ' | tr -d '\r' | xargs -I {} ssh example.com "unlink /srv/revisions/healthy && ln -sf /srv/revisions/{} /srv/revisions/healthy"
Note the most important bit: after nixos configuration is deployed, I don’t just use $GIT_SHA to mark as healthy, but go full circle and curl
the web server to see if it’s been restarted and running the newest version. This ensures that:
- nixos has been rebuilt, updated, and deployed
- the newest code is up and running
- using SSH to mark /srv/revisions/healthy enusures the machine is still accessible
…continued
- name: copy deployment to production, don't activate
if: github.ref == 'refs/heads/main'
run: |
nix-copy-closure --to api.example.com production
readlink production | xargs -I {} ssh api.example.com "ln -s {} /srv/revisions/$GIT_SHA"
Then another irrelevant part, ommitted from above - I also symlink $GIT_SHA to latest, and have a local shell.nix that exposes some shorthands like ops deploy api
. latest is there to be offered as a default argument, otherwise you can just use any git SHA you want to deploy.