Bootloader (Grub) Repair with ZFS root install

I have a nixos build on an encrypted ZFS root using the following guide:

https://openzfs.github.io/openzfs-docs/Getting%20Started/NixOS/Root%20on%20ZFS.html

I managed to overwrite grub while exploring another distro.

Looking around, I found this which has suggestions for recovering the bootlader. I can mount the boot pool, but when I import the root pool (with successful key load), I cannot seem the pool.

I must admit I don’t like Grub, but certainly don’t understand it enough to go it alone so I need a guide.

Can anyone help me understand what I am doing wrong? While I would like to understand the fix, I am happy just copying scripts at the moment to get back into my system.

TIA

John

Oops - should have added this for the bootloader Bootloader - NixOS Wiki

Some assumptions, please clarify any differences:

  • you’ve booted nixos install media off some other media (usb etc) when taking the steps above and having these issues
  • you’re running manual commands (zpool import bpool, zpool import -l rpool, etc)
  • bpool imports succesfully, as you’ve said
  • you’re using zfs encryption (not, say, luks)
  • rpool has some kind of issue, but you are prompted for key load as part of pool import

So it’s not clear to me what you can’t “see” about the pool. Can you run a zpool status and confirm the pool exists, and a zfs list -o space to confirm your data is there.

I suspect the issue is mounting the datasets in the right place, because at least the base OS datasets are mountpoint=legacy and need to be manually mounted. You might also have some mount ordering issues if some other datasets are defg

Happy to talk through more, but in general the install guide you have is close to what you want anyway. Basically, you want to:

  • boot the install media
  • skip damaging steps like partitioning and creating zfs pools
  • instead, import the existing pools in the same way; in particular with zpool import -R /mnt ...
  • mount at least the filesystem root, nix store, and boot/efi partitions, the same as during installation
  • redo (at least) the boot loader installation that nixos-install does. You could update your channel in the running installer system, and just rerun the install, really, but a smaller subset would be something like:
   ln -sfn /proc/mounts /mnt/etc/mtab
   nixos-enter --root /mnt
   NIXOS_INSTALL_BOOTLOADER=1 nixos-rebuild switch

The latter article you linked on the bootloader covers that in more detail, clearly the issue is in the middle bit, but I’m not clear on what state you’re in at that point.

Thanks for the comments. The assumptions are correct. My issue turned out to be that the import did not ask for a key for the encrypted rpool - I’d obviously picked up some bad info from somewhere and was expecting that! Finding the zfs load-key command fixed that.

nixos-enter works. ```
NIXOS_INSTALL_BOOTLOADER=1 nixos-rebuild switch will rebuild then fail on updating GRUB 2 menu
(NIXOS_INSTALL_BOOTLOADER=1 /nix/var/nix/profiles/system/bin/switch-to-configuration boot gives the same result…as expected)

My output is:

updating GRUB 2 menu…
mount: /boot/efis/*: can’t find in /etc/fstab.
mount: /boot/efi: special device /boot/efis/nvme-Samsung_SSD_980_AAA-part1 does not exist.
dmesg(1) may have more information after failed mount system call.
installing the GRUB 2 boot loader on /dev/disk/by-id/nvme-Samsung_SSD_980_AAA…
Installing for i386-pc platform.
/nix/store/9zhx9svcp271w0vpr91nli8q3bhh88np-grub-2.06/sbin/grub-install: error: unknown filesystem.
/nix/store/qy924qlwcr93kx61mn4wb2a3nn49c7iv-install-grub.pl: installation of GRUB on /dev/disk/by-id/nvme-Samsung_SSD_980_AAA failed: No such file or directory

The disk is part of a 3 disk mirror - the others (BBB and CCC) are not mentioned in the output. The partition is mounted:

/dev/nvme0n1p1 on /mnt/boot/efis/nvme-Samsung_SSD_AAA-part1 type vfat (rw,relatime,fmask=0022,dmask=0022,codepage=437,iocharset=iso8859-1,shortname=mixed,errors=remount-ro)

Strangely, /dev/nvme1n1p1 and /dev/nvme1n1p1 are mounted similarly, but I also have the following:

/dev/nvme1n1p1

dmesg was no help whatsoever. So at least I have got to the “I hate GRUB and don’t understand it” phase.

Yes. Perhaps it helps to clarify that, technically, it is not the pool that is encrypted, it’s a dataset-level property. Typically, and in the install instructions, the root dataset is encrypted, and all the others are expected to inherit from that. But you could, if you wanted, have an unencrypted child dataset. It’s also important to understand that some pool metadata is not encrypted (in the same way as if you stacked on top of LUKS, for example). This has some consequences, and some advantages, that are covered in the zfs manpages.

Yes, or the zpool import -l flag which implies that, but again may depend on your mount setup.

As for the GRUB issues after this point, it would probably help if you could show:

  • your partitioning
  • your nixos config, especially all the boot.* and filesystems.* entries
  • mount properties of zfs filesystems zfs list -o space,mounted,canmount,mountpoint (at least the datasets relevant for install)
  • actual mounts df -h at the time you run the nixos-enter

Also, this catches my as somewhat suspicious or unexpected:

Many thanks for continuing to help.
My 3 disks have the following partition structure:
p5 - grub2.core.img
p1 - fat32 EFI (boot, esp)
p2 - zfs bpool
p4 - unknown (swap)
p3 - zfs rpool

When i enter nixos-enter, zfs and import bppol, list -o space,mounted,canmount,mountpoint results:

NAME AVAIL USED USEDSNAP USEDDS USEDREFRESERV USEDCHILD MOUNTED CANMOUNT MOUNTPOINT
bpool 3.50G 126M 0B 96K 0B 126M no off /mnt/boot
bpool/nixos 3.50G 125M 0B 96K 0B 125M no off none
bpool/nixos/root 3.50G 125M 0B 125M 0B 0B yes on /mnt/boot
rpool 2.02T 3.06T 0B 192K 0B 3.06T no off /mnt
rpool/nixos 2.02T 3.06T 0B 192K 0B 3.06T no off /mnt
rpool/nixos/home 2.02T 2.99T 168K 658G 0B 2.35T no on /home
rpool/nixos/root 2.02T 69.3G 869M 68.5G 0B 0B yes on /mnt
rpool/nixos/var 2.02T 2.00G 0B 192K 0B 2.00G no off /var
rpool/nixos/var/lib 2.02T 1.23G 408K 1.23G 0B 0B no on /var/lib
rpool/nixos/var/log 2.02T 789M 840K 788M 0B 0B no on /var/log

df -h results:

Filesystem Size Used Avail Use% Mounted on
rpool/nixos/root 2.1T 69G 2.1T 4% /
devtmpfs 3.2G 0 3.2G 0% /dev
tmpfs 32G 8.0K 32G 1% /dev/shm
tmpfs 16G 8.0K 16G 1% /run
tmpfs 32G 456K 32G 1% /run/wrappers
bpool/nixos/root 3.7G 125M 3.6G 4% /mnt/boot

I’ve had to stop investigating any more for the moment, but the error:

mount: /boot/efi: special device /boot/efis/nvme-Samsung_SSD_980_AAA-part1 does not exist

looks as though it is because the UUIDs have changed for the disks, so they are not being mounted.

Having read this
https://grahamc.com/blog/nixos-on-zfs
I wonder if I might be better off starting again and simplifying things (this was my first install of Nixos, and all my important data is store on other devices).

Hosted by Flying Circus.