Bootloader (Grub) Repair with ZFS root install

I have a nixos build on an encrypted ZFS root using the following guide:

https://openzfs.github.io/openzfs-docs/Getting%20Started/NixOS/Root%20on%20ZFS.html

I managed to overwrite grub while exploring another distro.

Looking around, I found this which has suggestions for recovering the bootlader. I can mount the boot pool, but when I import the root pool (with successful key load), I cannot seem the pool.

I must admit I don’t like Grub, but certainly don’t understand it enough to go it alone so I need a guide.

Can anyone help me understand what I am doing wrong? While I would like to understand the fix, I am happy just copying scripts at the moment to get back into my system.

TIA

John

Oops - should have added this for the bootloader Bootloader - NixOS Wiki

Some assumptions, please clarify any differences:

  • you’ve booted nixos install media off some other media (usb etc) when taking the steps above and having these issues
  • you’re running manual commands (zpool import bpool, zpool import -l rpool, etc)
  • bpool imports succesfully, as you’ve said
  • you’re using zfs encryption (not, say, luks)
  • rpool has some kind of issue, but you are prompted for key load as part of pool import

So it’s not clear to me what you can’t “see” about the pool. Can you run a zpool status and confirm the pool exists, and a zfs list -o space to confirm your data is there.

I suspect the issue is mounting the datasets in the right place, because at least the base OS datasets are mountpoint=legacy and need to be manually mounted. You might also have some mount ordering issues if some other datasets are defg

Happy to talk through more, but in general the install guide you have is close to what you want anyway. Basically, you want to:

  • boot the install media
  • skip damaging steps like partitioning and creating zfs pools
  • instead, import the existing pools in the same way; in particular with zpool import -R /mnt ...
  • mount at least the filesystem root, nix store, and boot/efi partitions, the same as during installation
  • redo (at least) the boot loader installation that nixos-install does. You could update your channel in the running installer system, and just rerun the install, really, but a smaller subset would be something like:
   ln -sfn /proc/mounts /mnt/etc/mtab
   nixos-enter --root /mnt
   NIXOS_INSTALL_BOOTLOADER=1 nixos-rebuild switch

The latter article you linked on the bootloader covers that in more detail, clearly the issue is in the middle bit, but I’m not clear on what state you’re in at that point.

Thanks for the comments. The assumptions are correct. My issue turned out to be that the import did not ask for a key for the encrypted rpool - I’d obviously picked up some bad info from somewhere and was expecting that! Finding the zfs load-key command fixed that.

nixos-enter works. ```
NIXOS_INSTALL_BOOTLOADER=1 nixos-rebuild switch will rebuild then fail on updating GRUB 2 menu
(NIXOS_INSTALL_BOOTLOADER=1 /nix/var/nix/profiles/system/bin/switch-to-configuration boot gives the same result…as expected)

My output is:

updating GRUB 2 menu…
mount: /boot/efis/*: can’t find in /etc/fstab.
mount: /boot/efi: special device /boot/efis/nvme-Samsung_SSD_980_AAA-part1 does not exist.
dmesg(1) may have more information after failed mount system call.
installing the GRUB 2 boot loader on /dev/disk/by-id/nvme-Samsung_SSD_980_AAA…
Installing for i386-pc platform.
/nix/store/9zhx9svcp271w0vpr91nli8q3bhh88np-grub-2.06/sbin/grub-install: error: unknown filesystem.
/nix/store/qy924qlwcr93kx61mn4wb2a3nn49c7iv-install-grub.pl: installation of GRUB on /dev/disk/by-id/nvme-Samsung_SSD_980_AAA failed: No such file or directory

The disk is part of a 3 disk mirror - the others (BBB and CCC) are not mentioned in the output. The partition is mounted:

/dev/nvme0n1p1 on /mnt/boot/efis/nvme-Samsung_SSD_AAA-part1 type vfat (rw,relatime,fmask=0022,dmask=0022,codepage=437,iocharset=iso8859-1,shortname=mixed,errors=remount-ro)

Strangely, /dev/nvme1n1p1 and /dev/nvme1n1p1 are mounted similarly, but I also have the following:

/dev/nvme1n1p1

dmesg was no help whatsoever. So at least I have got to the “I hate GRUB and don’t understand it” phase.

Yes. Perhaps it helps to clarify that, technically, it is not the pool that is encrypted, it’s a dataset-level property. Typically, and in the install instructions, the root dataset is encrypted, and all the others are expected to inherit from that. But you could, if you wanted, have an unencrypted child dataset. It’s also important to understand that some pool metadata is not encrypted (in the same way as if you stacked on top of LUKS, for example). This has some consequences, and some advantages, that are covered in the zfs manpages.

Yes, or the zpool import -l flag which implies that, but again may depend on your mount setup.

As for the GRUB issues after this point, it would probably help if you could show:

  • your partitioning
  • your nixos config, especially all the boot.* and filesystems.* entries
  • mount properties of zfs filesystems zfs list -o space,mounted,canmount,mountpoint (at least the datasets relevant for install)
  • actual mounts df -h at the time you run the nixos-enter

Also, this catches my as somewhat suspicious or unexpected:

Many thanks for continuing to help.
My 3 disks have the following partition structure:
p5 - grub2.core.img
p1 - fat32 EFI (boot, esp)
p2 - zfs bpool
p4 - unknown (swap)
p3 - zfs rpool

When i enter nixos-enter, zfs and import bppol, list -o space,mounted,canmount,mountpoint results:

NAME AVAIL USED USEDSNAP USEDDS USEDREFRESERV USEDCHILD MOUNTED CANMOUNT MOUNTPOINT
bpool 3.50G 126M 0B 96K 0B 126M no off /mnt/boot
bpool/nixos 3.50G 125M 0B 96K 0B 125M no off none
bpool/nixos/root 3.50G 125M 0B 125M 0B 0B yes on /mnt/boot
rpool 2.02T 3.06T 0B 192K 0B 3.06T no off /mnt
rpool/nixos 2.02T 3.06T 0B 192K 0B 3.06T no off /mnt
rpool/nixos/home 2.02T 2.99T 168K 658G 0B 2.35T no on /home
rpool/nixos/root 2.02T 69.3G 869M 68.5G 0B 0B yes on /mnt
rpool/nixos/var 2.02T 2.00G 0B 192K 0B 2.00G no off /var
rpool/nixos/var/lib 2.02T 1.23G 408K 1.23G 0B 0B no on /var/lib
rpool/nixos/var/log 2.02T 789M 840K 788M 0B 0B no on /var/log

df -h results:

Filesystem Size Used Avail Use% Mounted on
rpool/nixos/root 2.1T 69G 2.1T 4% /
devtmpfs 3.2G 0 3.2G 0% /dev
tmpfs 32G 8.0K 32G 1% /dev/shm
tmpfs 16G 8.0K 16G 1% /run
tmpfs 32G 456K 32G 1% /run/wrappers
bpool/nixos/root 3.7G 125M 3.6G 4% /mnt/boot

I’ve had to stop investigating any more for the moment, but the error:

mount: /boot/efi: special device /boot/efis/nvme-Samsung_SSD_980_AAA-part1 does not exist

looks as though it is because the UUIDs have changed for the disks, so they are not being mounted.

Having read this
https://grahamc.com/blog/nixos-on-zfs
I wonder if I might be better off starting again and simplifying things (this was my first install of Nixos, and all my important data is store on other devices).

Hello. I’m the author of the OpenZFS guide. You should just opened an issue at our repo instead.

Anyway, I’m documenting the solution for posterity. The bootloader, which you presumably had overwritten, is of no importance and can be generated/obtained easily elsewhere. The all-important initrd images are all stored in the boot zfs pool, therefore perfectly safe and sound as long as the ZFS pool was not overwritten.

First, download grub rescue image from my repo, and then write the image to a disk. If you don’t want to use my prebuilt binary, you can also build from the flake in that repo.

Boot computer from it.
Now the NixOS boot menu should be loaded. You are done.

Although I’m not a GRUB fanatic like Gordon Matzigkeit, I still appreciate GNU GRUB as a well engineered piece of software. I don’t really get the hate against it, which I’ve also encountered elsewhere on the internet.

2 Likes

After booting into NixOS, execute

nixos-rebuild boot --install-bootloader

to reinstall GRUB. Whatever bootloader currently inside /boot/efis/disk1/EFI/BOOT/BOOTX64.efi will be overwritten, so make sure to back up that file first.

@ne9z - welcome and thanks for the solution.
I didn’t realise the OpenZFS project was as broad reaching as bootloader issues as well…

I followed your link (and beyond) and there is more to GRUB than I realised - I have been a “user” since the very beginning,and had many GRUB issues over the years (most of them due to Windows!). I think I have become stuck in my bootloader ways and so we drifted apart.

As I look to unlocking ZFS remotely (via ssh at startup or similar), I am not finding systemd-boot an easy alternative either, and I am not wedded to any particular technology or approach.

I’ll be looking again at GRUB, and suggest others who have changed the Nixos default boot do so as well - there may be some surprises :slight_smile:

@ne9z - welcome and thanks for the solution.
I didn’t realise the OpenZFS project was as broad reaching as bootloader issues as well…

No, OpenZFS project definitely does not have anything to do with
bootloader issues: it is just me, being interested in writing Root on
ZFS guides, inevitably have to deal with bootloaders.

I found GRUB to be the most suitable solution, because it can boot from
ZFS pool, while systemd-boot does not support anything other than fat32.

To be able to natively boot from ZFS is crucial for Root on ZFS: say you
have a server at a remote location. You want it to be able to boot,
even if one of the disks completely fails and becomes unreadable. GRUB
is able to do that, with a redundant ZFS boot pool. Guess what happens
with fat32 (and systemd-boot).

I followed your link (and beyond) and there is more to GRUB than I
realised - I have been a “user” since the very beginning,and had many
GRUB issues over the years (most of them due to Windows!). I think I
have become stuck in my bootloader ways and so we drifted apart.

Windows is easy to boot – if you know how to do it. Assuming you are
using UEFI. It is known that Windows likes to overwrite the fallback
bootloader location with its own bootloader (Windows Boot Manager):

   esp/EFI/BOOT/BOOTX64.EFI

With that in mind, all you have to do is tell NixOS to not install GRUB
to that location (this is the default in the Root on ZFS guide, but
unsuitable if you are using any other OS), that can be done with the
updated, flake-based guide, by setting this option to false.

Then you can choose to boot whether NixOS, Windows, or some other system
inside the UEFI firmware boot menu. This has nothing to do with GRUB
anymore. You can also directly manipulate EFI boot entries with
efibootmgr:

View all current entries:

  efibootmgr

Creating an entry for archlinux:

  efibootmgr --create --gpt --part=1 --loader="\EFI\archlinux\grubx64.efi" \
    --label "ArchLinux" /dev/disk/by-id/ata-MYDISK-part1

You can also set which bootloader should be default, which should be
booted only once on the next boot, or even configure netboot if your
motherboard firmware supports it. See man page for details.

As I look to unlocking ZFS remotely (via ssh at startup or similar), I
am not finding systemd-boot an easy alternative either, and I am not
wedded to any particular technology or approach.

Remote unlock does not have anything to do with bootloaders, it is
mainly the job of initrd, this was documented in the guide but later
removed in favor of brevity
.

I’ll be looking again at GRUB, and suggest others who have changed the
Nixos default boot do so as well - there may be some surprises
:slight_smile:

I hope the updated, flake-based guide would be easier to follow for new
users. Still, it is recommended to learn Nix via Nix Pills to be able to take full advantage of the Nix ecosystem.

2 Likes

Update: I added bpool auto-discovery to GRUB rescue image. Now, for recovery from missing, broken, or destroyed bootloader, there are three steps:

  1. First, download grub rescue image from my repo.
    If you don’t want to use my prebuilt binary, you can also build from the flake in that repo.

  2. Write the image to a disk.

  3. Boot computer from it.

Now the NixOS boot menu should be loaded. You are done.