ZFS issues upon selecting previous NixOS version at boot menu

I run encrypted ZFS with NixOS on my desktop. I recently bought some hard drives to use as bulk media storage as my phone and laptop were running out of space. I read up on ZFS documentation, ran sudo zpool create storage mirror /dev/sda /dev/sdb (unencrypted intentionally - it’s on my todo list to switch my desktop to be fully unencrypted so I can reboot remotely), and everything seemed to go well. ZFS auto-mounted the storage at /storage without explicit configuration from me. This was all fine and dandy and worked well before and after reboots.

This morning I was attempting to upgrade NixOS, but hit several problems (broken package builds, but more annoyingly bugs in nixos-rebuild). In my process of trying to get builds to work, I rolled back to an earlier NixOS install from the boot menu (likely one before I had set up the ZFS configuration) and rebuilt (without the broken package) from there. The rebuild (and upgrade) succeeded. A little while later, I noticed /storage was empty. This freaked me out quite a deal.

I started trying to figure out why ZFS didn’t load in my drives. Nothing came up immediately in any of the ZFS systemd service statuses. While fdisk recognized the physical drives existed, zpool status, zfs list, and zfs mount did not. I started looking around to see if anyone else had this problem and could not find anything: though it’s a little hard to search for. Then I tried adding the /storage drive to fileSystems in my hardware configuration file: fileSystems."/storage" = { device = "storage"; fsType = "zfs"; };. This was a very bad idea. Upon running nixos-rebuild --upgrade my desktop crashed to tell me “You are in emergency mode.”

So, some stuff I could really use some help with:

  1. Why does ZFS not recognize that my two drives at /dev/sda and /dev/sdb should form a mirrored pool? fdisk certainly recognizes that these drives exist. But zfs list, zpool status, etc show no indication that they do.

  2. What was it that likely happened to get my drives into this unreadable state to begin with? Was it rolling back to an earlier boot version from the boot menu? Was it rebuilding the system? Was it upgrading the system?

  3. What did adding the drives to fileSystems do that made my computer immediately choke and die?

  4. How can I avoid this in the future, recover my data now, and get ZFS to auto-mount my mirrored storage drives at /storage?

Any help or suggestions for debugging would be appreciated. I’m really very worried I’ve accidentally corrupted the data on these drives.

The first thing to try is rebooting. Sometimes the relevant kernel modules don’t work properly between a rebuild and a reboot (although I haven’t seen that cause exactly your symptoms, but anyway it’s worth a try).

Oooohkay. Previously I was able to reboot “fine”, only without /storage working. Now I am getting dropped into emergency mode on rebooting. systemd is telling me Failed to mount /storage and dropping me into an emergency shell to check systemctl status storage.mount which appears… fine?

Damn. I don’t know what’s going on.

1 Like

I’ve selected an old version from the boot menu and was able to detect and import my data with sudo zpool import. It’s okay! Phew! Now to figure out all the other problems…

1 Like

So, it doesn’t really make sense to me that you were able to reboot and have /storage show up fine before all of this craziness started. NixOS doesn’t automatically import all pools on the system on boot, so the pool wouldn’t be in zpool list out of the box. You have to either have one of the pool’s datasets in fileSystems, or you have to have the pool in boot.zfs.extraPools, for the NixOS zfs module to know to import it at boot. Without either of those things done, rebooting should have resulted in /storage not mounting.

The problem with putting a dataset in fileSystems like you did is that it ought to have mountpoint=legacy, or else it can causes a race condition with zfs-mount.service. Usually this isn’t an issue and it’ll boot ok anyway, albeit with the wrong mount settings since a non-legacy mountpoint was mounted with fstab instead of zfs-mount.service. But it can lead to full system crashes during boot.

So, the recommendation is to either use legacy mountpoints and have the datasets in fileSystems, or make sure to leave them out of fileSystems and put the pool name in boot.zfs.extraPools.

3 Likes

Have you set the mount property of all the datasets on storage to legacy? If not then you can get race conditions (which btw you might not have noticed with your previous setup because waiting to authorise encrypted datasets can avoid race conditions sometimes).

Thus is likely the answer. Personally I use extraPools except for boot critical pools (e.g /nix)