Since updating ZFS recently, one of my volumes fails to mount at boot

mrobbetts · January 19, 2021, 3:03am

I have a machine with two ZFS pools: the root volume (rpool, on an SSD) and a larger data volume (store, on a pair of mirrored HDDs)

When booting, rpool mounts just fine. However, store does not. All of the dependent ZFS mounts obviously also fail, as do the services that depend on them. Thus, systemd drops me into the emergency shell.

However, in order to proceed, all I need to do is a quick:

# mount -a
# systemctl default

and the system finishes coming up as normal. (Note though that systemctl restart store.mount still fails at this point). Looking at what happened once finished booting, I see:

# journalctl -b -u store.mount
-- Journal begins at Sat 2019-09-28 01:31:53 PDT, ends at Mon 2021-01-18 18:49:44 PST. --
Jan 18 03:11:35 braid systemd[1]: Mounting /store...
Jan 18 03:11:35 braid mount[2584]: filesystem '/store' cannot be mounted, unable to open the dataset
Jan 18 03:11:35 braid systemd[1]: store.mount: Mount process exited, code=exited, status=1/FAILURE
Jan 18 03:11:35 braid systemd[1]: store.mount: Failed with result 'exit-code'.
Jan 18 03:11:35 braid systemd[1]: Failed to mount /store.

This, combined with the fact that the system used to work just fine(with, IIRC, zfs-0.8.x) until I updated it, leads me to believe there might be a bug in NixOS’ dependency logic or something like that.

Does anyone have any wisdom for me? I’m a long-time ZFS user, but I haven’t seen a problem like this before.

FWIW, my mountpoints are all legacy, except for rpool - is this right?:

# zfs list
NAME                             USED  AVAIL     REFER  MOUNTPOINT
rpool                            110G  4.51G      192K  none
rpool/home                      58.9G  4.51G     58.9G  legacy
rpool/root                      50.9G  4.51G      192K  none
rpool/root/nixos                50.9G  4.51G     50.9G  legacy
store                           2.49T   151G     17.6G  legacy
store/applications               124G   151G      124G  legacy
store/configfiles               16.1G   151G     16.1G  legacy
store/containers                6.56G   151G      516M  legacy
...

and my hardware-configuration.nix is up to date.

jonringer · January 19, 2021, 9:09am

I think you need to add a neededForBoot = true; to the filesystems that you need for booting.

In the emergency shell, you can do lsmod | grep zfs to see if the needed modules were loaded.

But it’s unusal that you can just manually mount, then continue.

mrobbetts · January 19, 2021, 11:06am

Well. Only the rpool is really needed for booting (it contains my root filesystem). The store pool only contains filesystems backing containers and such, which the host system doesn’t rely on. It’s just that systemd considers the system degraded if anything in the default target fails.

Also, the booting-specific parts (i.e. rpool) seem to work fine, even without a neededForBoot = true; anywhere!

Is that a new field? Also, where does it go - in hardware-configuration.nix, next to device and fsType?

Philipp-M · January 21, 2021, 5:41pm

I’ve got the same issue (recently on nixos unstable), I don’t need any of the volumes to boot, but the boot fails and I get into the emergency console.
I think the issue started at around the 6th January or so.
I currently have locked the nixos-unstable channel to the version before that to get my system booting.

There is only one volume that successfully mounts and that is a volume that doesn’t use a legacy mountpoint (maybe that’s a workaround for now, but then it isn’t possible to declaratively mount them in nixos isn’t it?).

Maybe worth an issue at github?

Emiller88 · January 21, 2021, 5:45pm

I’ve had this issue for a while. My work around was to add options = [ "nofail" ] to any pool that was on the one that has trouble importing on boot

fileSystems."/data/media/archive" = {
    device = "datatank/media/archive";
    fsType = "zfs";
    options = [ "nofail" ];
  };

Then manually importing the pool later.

Philipp-M · January 21, 2021, 5:53pm

For a while means, it is fixed for you now without this option?
Because I’d like to avoid manually mounting the filesystems, I rather temporarily use zfs mounting system.

But it might be interesting what the underlying cause is, maybe a zfs version update?:
https://github.com/NixOS/nixpkgs/commit/e44a4abdbeb479a6cff041a80be9ec9e2cfa78ce#diff-6b938a2bb5aecd0c3f770bd30c95987c65fa12bd8189bfbadad3b82857cc2ea3

Well I guess I’m trying it now.

Philipp-M · January 21, 2021, 6:05pm

I just tested the commit before this update and after this update, and indeed this causes the issue.

Emiller88 · January 21, 2021, 6:09pm

No sorry, more of a workaround to be able to boot with the failing pools in case anyone was interested.

I would like to avoid manually importing the pools. nixos-rebuild test does the import automatically once the pool is imported so it’s just zpool import datatank and a rebuild test for me.

Philipp-M · January 21, 2021, 6:28pm

FWIW, I’ve created an issue:
https://github.com/NixOS/nixpkgs/issues/110376

mrobbetts · March 22, 2021, 1:41am

Sorry about the delay. This issue is resolved for me just after updating again.

Thanks for the input!