Nixos-rebuild stuck in an activation loop

damscal · November 3, 2024, 6:21pm

Hi,

I recently had to restore one of my systems from a few months old backup. I’m trying to reapply my most recent tested configuration. The build is successful, but the activation gets stuck in an endless loop. The strange thing is that the configuration used to get activated fine before I restored the system from backup.
I’m using flakes. Nixos version 24.05. aarch64. It is a virtual private server on oracle cloud, therefore I cannot operate on the boot screen. I’m using code-server (i.e. remote vscode) and ssh to access the vps. Shortly after I run nixos-rebuild test both code-server and ssh get disconnected, so I saved the output of nixos-rebuild to a file and force reboot the system from oracle cloud console.

This is the output:

stopping the following units: acme-finished-nextcloud.deeporange.onira.world.target, acme-fixperms.service, acme-lockfiles.service, acme-nextcloud.deeporange.onira.world.timer, audit.service, code-server.service, kmod-static-nodes.service, logrotate-checkconf.service, mount-pstore.service, network-local-commands.service, network-setup.service, nextcloud-cron.timer, nscd.service, phpfpm-nextcloud.service, phpfpm.slice, phpfpm.target, postgresql.service, redis-nextcloud.service, resolvconf.service, run-wrappers.mount, systemd-binfmt.service, systemd-bootctl.socket, systemd-creds.socket, systemd-hostnamed.socket, systemd-modules-load.service, systemd-oomd.service, systemd-oomd.socket, systemd-sysctl.service, systemd-timesyncd.service, systemd-udevd-control.socket, systemd-udevd-kernel.socket, systemd-udevd.service, systemd-update-done.service, systemd-vconsole-setup.service, systemd-zram-setup@zram0.service, var-lib-nextcloud.mount, var-lib-postgresql.mount
NOT restarting the following changed units: -.mount, getty@tty1.service, nix.mount, serial-getty@ttyAMA0.service, systemd-fsck@dev-disk-by\x2dlabel-ESP.service, systemd-journal-flush.service, systemd-logind.service, systemd-random-seed.service, systemd-remount-fs.service, systemd-update-utmp.service, systemd-user-sessions.service
activating the configuration...
removing group ‘redis-nextcloud’
removing group ‘postgres’
removing group ‘nextcloud’
removing user ‘redis-nextcloud’
removing user ‘postgres’
removing user ‘nextcloud’
setting up /etc...
removing obsolete symlink ‘/etc/systemd/user-generators’...
removing obsolete file ‘/etc/mtab’...
restarting systemd...
restarting sysinit-reactivation.target
reloading the following units: dbus.service, firewall.service, reload-systemd-vconsole-setup.service, srv.mount, var-lib-acme.mount, var-lib-nixos.mount, var-lib-systemd.mount, var-log.mount
restarting the following units: acme-deeporange.onira.world.timer, dhcpcd.service, home-manager-damscal.service, nginx.service, nix-daemon.service, persist.mount, sshd.service, swap.mount, systemd-journald.service
activating the configuration...
setting up /etc...
restarting sysinit-reactivation.target
restarting the following units: acme-deeporange.onira.world.timer, dhcpcd.service, home-manager-damscal.service, nginx.service, nix-daemon.service, persist.mount, sshd.service, swap.mount, systemd-journald.service
activating the configuration...
setting up /etc...
restarting sysinit-reactivation.target
restarting the following units: acme-deeporange.onira.world.timer, dhcpcd.service, home-manager-damscal.service, nginx.service, nix-daemon.service, persist.mount, sshd.service, swap.mount, systemd-journald.service
activating the configuration...
setting up /etc...

The last part keeps looping endlessly.
Any ideas of what might be causing this?
Thank you.

waffle8946 · November 3, 2024, 6:38pm

Please share config particularly anything around activation scripts.
Also might be good to run a store repair in case.

damscal · November 3, 2024, 10:01pm

performed store repair to no avail…

my configurations are in this repo: GitHub - damscal/nix-garden at master

this particular system is colled deeporange. It is defined among others in flake.nix, its configuration modules are in ./nixos/configurations/deeporange/ and in ./nixos/common-assets/.
I believe to have only one activation script, in the module ./nixos/common-assets/optin-persistence.nix. I tried to run nixos-rebuild commenting out that activation script, but the result was the same. Similarly, I tried to disable the entire modules nextcloud-server.nix, code-server.nix, sops.nix and ephemeral-btrfs.nix.

Is there a way to retrieve a kind of more verbose activation log to understand what’s happening here?

ElvishJerricco · November 3, 2024, 10:50pm

restarting persist.mount seems really odd, though I can’t think of any reason that would lead to this loop behavior.

damscal · November 4, 2024, 6:23pm

I’ve nailed down the issue. It was in fact persist.mount.

Apparently the fileSystems.<name>.device option doesn’t always support specifying the location of block devices by partlabel. Specifying it by label fixed it.

ElvishJerricco · November 4, 2024, 6:28pm

No, that’s not correct. There’s nothing special about by-partlabel that would make it unusable. I asked around on matrix about this yesterday and it sounds like this is actually a bizarre symptom of this issue, which has been resolved and a fix will be in nixos-unstable before long.

damscal · November 4, 2024, 6:43pm

Good to know! In my case I confirmed it was the one and only line that was causing the endless activation loop. Maybe it’s some kind of edge case… my persist mount is a btrfs subvolume with neededForBoot set to true.

ElvishJerricco · November 4, 2024, 6:45pm

I suspect you had changed the fileSystems configuration somehow and that was causing switch-to-configuration to want to restart those mount units, leading to the aforementioned bug / PR, which leads to the looping.

damscal · November 4, 2024, 6:56pm

Yes, after my last backup I had changed the fs label, so when I restored from the backup as a quick fix I changed the fileSystem config to use by-partlabel instead of changing the fs label again. However for some reason by-partlabel is not working in my case.

ElvishJerricco · November 4, 2024, 6:58pm

Like I said, it’s not because you used by-partlabel, it’s because changing the fileSystems configuration led to the issue fixed by the PR I linked. by-partlabel is not the problem.

damscal · November 4, 2024, 7:50pm

I see, then do you know if there is a way to change my fileSystems config (e.g. using by-partlabel) without falling into the bug? I mean other than switching to nixos-unstable and waiting for the fix

ElvishJerricco · November 4, 2024, 7:54pm

You can probably do nixos-rebuild boot instead of nixos-rebuild switch and then reboot.