The `/run/current-system` symlink is not created on boot, which results in a kernel panic during init

dsferruzza · June 18, 2023, 12:28pm

Hello!

I have a dedicated server (from Scaleway) on which I installed NixOS many years ago.
Yesterday, I upgraded NixOS from 22.11 to 23.05 and rebooted.
As the server was not getting back up, I noticed that the init process was not found which resulted in a kernel panic:

I rebooted in rescue mode, used nixos-enter to chroot into my system and tried to investigate/fix things.
I noticed that the /run/current-system symlink was not there, which breaks a lot of stuff.
This might be because I was in a chroot (so systemd was not up) but I believe this is also the case when booting normally.
After long hours, I managed to mitigate the issue by adding the following in /etc/nixos/configuration.nix:

system.activationScripts = {
  fixboot.text = ''
    ln -sfn "$(readlink -f "$systemConfig")" /run/current-system
  '';
};

This is basically the same command that runs at the end of the activation script, but sooner.
For the record, here is the end of my generated activation script (/run/current-system/activate):

# Make this configuration the current configuration.
# The readlink is there to ensure that when $systemConfig = /system
# (which is a symlink to the store), /run/current-system is still
# used as a garbage collection root.
ln -sfn "$(readlink -f "$systemConfig")" /run/current-system

# Prevent the current configuration from being garbage-collected.
mkdir -p /nix/var/nix/gcroots
ln -sfn /run/current-system /nix/var/nix/gcroots/current-system

exit $_status

This fixes the issue and allows my server to boot normally, but I really don’t understand why I need to do this.
The issue is clearly caused by /run/current-system not being there when systemd is supposed to run and take over, but I have no idea why.
Maybe the normal symlink command does not run?
Maybe the symlink is deleted before the end of the init script?
Also, this may not be related to the upgrade of NixOS from 22.11 to 23.05 because I hadn’t rebooted my server for a while (it is possible that I didn’t reboot after previous upgrades); so maybe something broke a while ago .

Does anyone have any idea of what could be the cause or how to further investigate?

For the record, here is how I chroot from rescue mode (it was not easy to figure this out):

reboot the server in rescue mode using Scaleway’s dashboard (an Ubuntu rescue image is selected)
SSH into it
sudo dpkg-reconfigure locales (add 232. fr_FR ISO-8859-1 and 233. fr_FR.UTF-8 UTF-8; for some reasons I need an ISO-8859-1 locale to be able to mount my /boot partition)
export LANG=POSIX
export LC_ALL=POSIX
bash <(curl -L https://nixos.org/nix/install)
. /home/david/.nix-profile/etc/profile.d/nix.sh
nix-env -f '<nixpkgs>' -iA nixos-install-tools (to have nixos-enter)
sudo mount /dev/sda9 /mnt
sudo mount /dev/sda1 /mnt/boot
sudo $(which nixos-enter)
PATH=$PATH:/nix/var/nix/profiles/system/sw/bin/ (I need this because at this point there is no /run/current-system so $PATH is quite empty and I cannot run any command)
ln -sfn "$(readlink -f /nix/var/nix/profiles/system)" /run/current-system
su david
at this point I can run most commands as usual (if they don’t rely on systemd) which allows me to run nixos-rebuild boot

R-VdP · June 18, 2023, 2:25pm

I’ve seen this happen before when somehow the activation script exits early, in my case there was an activation snippet that ran exit 1, and therefore the script never reached the part where it was supposed to create the symlink.

Do you have any custom activation snippets? Could you try deactivating them and see if that solves the issue?

I’m talking about this option: https://search.nixos.org/options?channel=unstable&show=system.activationScripts&from=0&size=50&sort=relevance&type=options&query=system.activationscripts

dsferruzza · June 18, 2023, 4:00pm

Indeed!

I had an old module to enable systemd’s user linger feature and it was existing in some cases:

# Stop when the system is not running, e.g. during nixos-install
[[ -e /run/booted-system ]] || exit 0
# ... (rest of the script)

I guess this was wrapped at some point so that it would not exit activation script, or something like that.
Anyway as services.logind.killUserProcesses is false by default I got rid of the whole thing and it’s working!

Thanks a lot!