System inoperable after automatic upgrades

iridium184 · August 6, 2024, 8:18am

Hi!

I have a NixOS machine that’s running 24/7. It’s currently running NixOS 24.05 (with a small number of packages pulled from unstable - none of them touch the core installation and are likely to be relevant here). Also, the system is using lanzaboote and agenix.

I enabled automatic upgrades in my configuration.nix:

system.autoUpgrade = {
  enable = true;
  flake = "/etc/nixos";
  flags = [ "--update-input" "nixpkgs" ];
  randomizedDelaySec = "1h";
};

This is usually working fine. However, I now had the situation twice that the system became inoperable after performing that update during the night, meaning: The network connection went down and it no longer responded to pings.

After resetting the system and looking at the journal from the previous boot (journalctl -b -1), this is what I get:

https://faui2k11.de/random/journal-trimmed.txt

The following lines look particularly relevant:

Aug 06 05:11:00 pandora systemd[1]: nixos-rebuild-switch-to-configuration.service: Failed to open /run/systemd/transient/nixos-rebuild-switch-to-configuration.service: No such file or directory
[...]
Aug 06 05:11:00 pandora systemd[1]: nixos-rebuild-switch-to-configuration.service: Failed to open /run/systemd/transient/nixos-rebuild-switch-to-configuration.service: No such file or directory
[...]
Aug 06 05:11:05 pandora systemd[1]: Reexecuting requested from client PID 625338 ('systemctl') (unit nixos-rebuild-switch-to-configuration.service)...
Aug 06 05:11:05 pandora systemd[1]: Reexecuting.
Aug 06 05:11:05 pandora systemd[1]: systemd 255.9 running in system mode (+PAM +AUDIT -SELINUX +APPARMOR +IMA +SMACK +SECCOMP +GCRYPT -GNUTLS +OPENSSL +ACL +BLKID +CURL +ELFUTILS +FIDO2 +IDN2 -IDN +IPTC +KMOD +LIBCRYPTSETUP +LIBFDISK +PCRE2 +PWQUALITY +P11KIT +QRENCODE +TPM2 +BZIP2 +LZ4 +XZ +ZLIB +ZSTD +BPF_FRAMEWORK -XKBCOMMON +UTMP -SYSVINIT default-hierarchy=unified)
Aug 06 05:11:05 pandora systemd[1]: Detected architecture x86-64.
Aug 06 05:11:05 pandora systemd[1]: bpf-lsm: LSM BPF program attached
Aug 06 05:12:35 pandora systemd[1]: Failed to fork off sandboxing environment for executing generators: Protocol error
Aug 06 05:12:35 pandora systemd[1]: Freezing execution.
Aug 06 05:13:00 pandora dbus-daemon[679]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
Aug 06 05:13:00 pandora nixos-upgrade-start[625362]: Failed to execute operation: Connection timed out

Any advice on how I could debug this further?

Thanks!

iridium184 · September 19, 2024, 8:31am

Bumping this thread, as I just observed the same situation again (the system went down while performing an automated update during the night).

Two additional observations I made in the meantime:

Aug 06 05:11:00 pandora systemd[1]: nixos-rebuild-switch-to-configuration.service: Failed to open /run/systemd/transient/nixos-rebuild-switch-to-configuration.service: No such file or directory

This line seems to be a red herring. I observed it a few times for upgrades that went through smoothly as well.

The system is also running a Prometheus Node exporter, so I could look up a few metrics from right before it crashed. I could confirm that the tmpfs at /run did not overflow (note that /tmp is not on tmpfs on this machine). Reason I suspected that is because some Googling found instances of systemd having issues with tmp directories, but it sounds to me like they are unrelated.

Again, any ideas and pointers would be appreciated

iridium184 · November 17, 2024, 9:47pm

I might just be talking to myself here, but having a (public) log might also be useful at some point

This just happened again on my laptop machine, when manually running nixos-rebuild switch.

The output of nixos-rebuild switch looked like this:

[agenix] chowning...
setting up /etc...
restarting systemd...
Error: Failed to reset failed units

Caused by:
    Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.

The journal contained the all-familiar “Protocol error”:

Nov 17 22:30:54 hyperion systemd[1]: Failed to fork off sandboxing environment for executing generators: Protocol error
Nov 17 22:30:54 hyperion systemd[1]: Freezing execution.

It’s worth noting that my laptop is running NixOS unstable, so this was not magically fixed by the rewrite of switch-to-configuration.

Shados · December 19, 2024, 6:35am

I’ve occasionally seen the same failure mode (systemd[1]: Failed to fork off sandboxing environment for executing generators: Protocol error during a systemd upgrade re-exec leading to it becoming non-responsive), but only on occasions where I’ve ended up upgrading systemd, downgrading systemd, then upgrading it again (generally as a result of deploy-rs’ automatic rollbacks on activation failure). Probably unrelated to your case?

iridium184 · January 20, 2025, 8:49pm

I indeed think that it’s unrelated. I’m now pretty certain that my bug is somehow related to NFS mounts, and have created a more detailed bugreport in the github issue tracker: systemd: "Freezing execution" when upgrading · Issue #375376 · NixOS/nixpkgs · GitHub .