Grub2 error upgrading from 23.11 to 24.05

Roo · July 13, 2024, 12:00am

I’m running NixOS on pretty old hardware, and while I didn’t take notes - I’m pretty sure I followed the manual partition / install process.

I did get it to install just fine - but given the error I’m seeing, and my hazy recollection of the install process it’s entirely possible I painted myself into this corner. Simply wiping the boot drive and starting again may be the best path forward (and a strength of NixOS being configuration based). Still, I’m stubborn and let’s see what happens.

Possibly related - this machine wouldn’t even successfully run a sudo nixos-rebuild switch as it seemed to be out of storage space (the boot drive is a 60GB SSD). I “fixed” this by running a manual sudo nix-collect-garbage -d which took ages but seemed to free up gobs of space.

I’m doing an upgrade from 23.11 to 24.05

$ sudo nix-channel --add https://channels.nixos.org/nixos-24.05 nixos
$ sudo nixos-rebuild switch --upgrade
unpacking channels...
building Nix...
building the system configuration...
updating GRUB 2 menu...
installing the GRUB 2 boot loader on /dev/sda...
Installing for i386-pc platform.
/nix/store/54aaggq6lpvdi9l7aqzwwa03b5l06mln-grub-2.12/sbin/grub-install: warning: this GPT partition label contains no BIOS Boot Partition; embedding won't be possible.
/nix/store/54aaggq6lpvdi9l7aqzwwa03b5l06mln-grub-2.12/sbin/grub-install: error: embedding is not possible, but this is required for cross-disk install.
/nix/store/2js95l6lpapk9bkic375ssdby06sxvkr-install-grub.pl: installation of GRUB on /dev/sda failed: Inappropriate ioctl for device
warning: error(s) occurred while switching to the new configuration

boo… no joy

The super nice thing is that I can very easily recover - back to a sane booting configuration of 23.11 by doing

$ sudo nix-channel --add https://channels.nixos.org/nixos-23.11 nixos
$ sudo nixos-rebuild switch --upgrade
unpacking channels...
building Nix...
building the system configuration...
updating GRUB 2 menu...
activating the configuration...
setting up /etc...
reloading user units for roo...
setting up tmpfiles

Now maybe because the previous upgrade failed, this attempt to ‘switch back’ is really a no-op? But it does seem to go through all the steps based on what is logged

What I just realized (rubber duck debugging here) - is that /dev/sda - is a tragically old IDE drive – smartctl claims

  9 Power_On_Hours          -O--CK   001   001   000    -    147655

Yup, 16.8 years of power on time… I’ve got it in the system more for a laugh than actual storage. It’s only 40GB (it was one of my very old boot drives way back)

Hmm… but looking at my /etc/nixos/configuration.cfg file I see

  # Use the GRUB 2 boot loader.
  boot.loader.grub.enable = true;
  # boot.loader.grub.efiSupport = true;
  # boot.loader.grub.efiInstallAsRemovable = true;
  # boot.loader.efi.efiSysMountPoint = "/boot/efi";
  # Define on which hard drive you want to install Grub.
  boot.loader.grub.device = "/dev/sda"; # or "nodev" for efi only
  boot.loader.grub.configurationLimit = 10;

well - that’s very interesting

The correct boot drive is /dev/sdc

$ sudo fdisk -l /dev/sdc
Disk /dev/sdc: 55.9 GiB, 60022480896 bytes, 117231408 sectors
Disk model: SPCC Solid State
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x6ed051ef

Device     Boot     Start       End   Sectors  Size Id Type
/dev/sdc1  *         2048 101607423 101605376 48.4G 83 Linux
/dev/sdc2       101607424 117229567  15622144  7.4G 82 Linux swap / Solaris

Weird how this ‘bad’ nixos config hasn’t been a problem until the upgrade?

Roo · July 13, 2024, 12:06am

Well - that was it. Fixing the /etc/nixos/configuration.nix to point grub at /dev/sdc was exactly the problem.

This time the upgrade went smoothly

$ sudo nix-channel --add https://channels.nixos.org/nixos-24.05 nixos
$ sudo nixos-rebuild switch --upgrade
unpacking channels...
building Nix...
building the system configuration...
these 3 derivations will be built:
  /nix/store/9rsq51p0hqdwihzfqdd1nq4kciwxhwaj-grub-config.xml.drv
  /nix/store/8an8w2rxp2yyjykfblf0lqqfb77qlzdd-install-grub.sh.drv
  /nix/store/crxc3cgcwrnx3i5gpr46rw62686cbc5k-nixos-system-backup-24.05.2704.a046c1202e11.drv
building '/nix/store/9rsq51p0hqdwihzfqdd1nq4kciwxhwaj-grub-config.xml.drv'...
building '/nix/store/8an8w2rxp2yyjykfblf0lqqfb77qlzdd-install-grub.sh.drv'...
building '/nix/store/crxc3cgcwrnx3i5gpr46rw62686cbc5k-nixos-system-backup-24.05.2704.a046c1202e11.drv'...
updating GRUB 2 menu...
installing the GRUB 2 boot loader on /dev/sdc...
Installing for i386-pc platform.
Installation finished. No error reported.
stopping the following units: NetworkManager-wait-online.service, NetworkManager.service, audit.service, cron.service, kmod-static-nodes.service, logrotate-checkconf.service, mount-pstore.service, network-interfaces.target, network-local-commands.service, network-setup.service, nscd.service, nullmailer.service, prometheus-node-exporter.service, resolvconf.service, systemd-modules-load.service, systemd-oomd.service, systemd-oomd.socket, systemd-sysctl.service, systemd-timesyncd.service, systemd-vconsole-setup.service, zfs-import-backup.service, zfs-mount.service, zfs-share.service, zfs-zed.service
NOT restarting the following changed units: getty@tty1.service, systemd-fsck@dev-disk-by\x2did-ata\x2dST340014A_3JX10Y8Q\x2dpart1.service, systemd-fsck@dev-disk-by\x2did-ata\x2dWDC_WD40EZRZ\x2d00GXCB0_WD\x2dWCC7K7ALHREE\x2dpart1.service, systemd-journal-flush.service, systemd-logind.service, systemd-random-seed.service, systemd-remount-fs.service, systemd-udev-settle.service, systemd-update-utmp.service, systemd-user-sessions.service, user-runtime-dir@1000.service, user@1000.service
activating the configuration...
removing group ‘systemd-journal-gateway’
removing user ‘systemd-journal-gateway’
setting up /etc...
removing obsolete symlink ‘/etc/pulse/client.conf’...
restarting systemd...
reloading user units for roo...
restarting sysinit-reactivation.target
reloading the following units: dbus.service, firewall.service, reload-systemd-vconsole-setup.service
restarting the following units: nixos-upgrade.timer, sshd.service, systemd-journald.service, zfs-scrub.timer, zpool-trim.timer
starting the following units: NetworkManager-wait-online.service, NetworkManager.service, audit.service, cron.service, kmod-static-nodes.service, logrotate-checkconf.service, mount-pstore.service, network-local-commands.service, network-setup.service, nscd.service, nullmailer.service, prometheus-node-exporter.service, resolvconf.service, systemd-modules-load.service, systemd-oomd.socket, systemd-sysctl.service, systemd-timesyncd.service, systemd-vconsole-setup.service, zfs-import-backup.service, zfs-mount.service, zfs-share.service, zfs-zed.service
the following new units were started: NetworkManager-dispatcher.service, sysinit-reactivation.target, systemd-hostnamed.service, systemd-tmpfiles-resetup.service

Problem solved.

Roo · July 17, 2024, 1:00pm

Super weird… checked in on that machine today… and it has rebooted for an update…

Now /dev/sda is root (again?!)

Odd… but sure - not a big deal. Note to other - sometimes devices change around – in the past this has only happened when I’ve done hardware changes (add/remove drives) – but apparently there are other scenarios.

This is why talking about devices by a more unique indicator by-id is safer.

Roo · July 19, 2024, 2:03pm

Oh no - appears to be a race condition (?) – root is once again /dev/sdc

Shawn8901 · July 19, 2024, 6:43pm

Thats possibly a nice reading material.

https://wiki.archlinux.org/title/persistent_block_device_naming

Citing the problem

If your machine has more than one drive sharing a naming scheme, the order in which their corresponding device nodes are added is arbitrary. This may result in block device names (e.g. /dev/sda and /dev/sdb, /dev/nvme0n1 and /dev/nvme1n1, /dev/mmcblk0 and /dev/mmcblk1) switching around on each boot, culminating in an unbootable system, kernel panic, or a block device disappearing.

As you face such problems, it might be an idea to switch away from /dev/sdx