Troubles with ZFS and debugging a systemd service

purefan · August 8, 2024, 7:04am

Hello!

A short while ago I manually entered commands and created a zpool and dataset, it was working fine but I screwed up and after jumping around generations and rebooting, the zpool and dataset were gone, so I thought it wise to create them via a systemd service, code here, and I have a few questions…

Is this how others are handling their pools?
how can I get feedback from that script into journalctl?
journalctl did capture a permission denied error in the first echo line, what?? thats /tmp/setup-zfs.log

N.1 feels strange that they can vanish but at the same time its strange to me to create them on boot/switch because the directory can still exist (this happened to me today)

N.2 echo was not outputting to journalctl which is why I ended up appending to /tmp/setup-zfs.log

mightyiam · August 8, 2024, 8:26am

I don’t see how jumping around between generations would make a zpool vanish. So I lost you there.

zpools vanishing sounds like a severe issue, that should definitely not occur. I wouldn’t try to work around it. I would definitely try to understand how it could ever occur. Possibly malfunctioning hardware?

jficz · August 8, 2024, 3:07pm

zpools don’t just randomly vanish - they’re very stateful and unless your’e using something like Disko, you should not need to “manage” them in any way outside of manually invoking zpool create... exactly once (per each new pool).

I think you’re trying to fix the wrong issue here - you need to find out why the pool vanished in the first place. Can you give more precise description of what you did? Step by step, ideally with commands and/or config changes.

purefan · August 9, 2024, 2:25pm

Thank you both for your replies, this is my first time working with ZFS and I think what happened is that in one of the generations I did not mount the dataset with a fileSystems."<name>" block, which if Im not mistaken, takes care of auto-importing the pool during boot. So because of restarts I could not see the pool auto-mount point (/pool-01).

I dont think I ran other zpool commands other than zpool status, but I do remember that status gave me a no pools found message, which prompted me to try to recreate the pool and think it vanished

Justinsaccount · August 10, 2024, 12:29am

I use

boot.zfs.extraPools = ["pool-01"];

Since I have a lot of filesystems and it’s easier than defining all of them: boot.zfs.extraPools

purefan · August 10, 2024, 2:51pm

ok Im running into a situation at least similar to before.
I created the pool and dataset by entering the commands:

sudo zpool create tank /dev/sda /dev/sdb /dev/sdc
sudo zpool add tank log /dev/sdd
sudo zpool add tank cache /dev/sde
sudo zfs create tank/apps

then it seemed to work fine and I made some changes to configuration.nix, eventually added these lines and upon rebuild switch I got this error in journalctl:

aug 10 16:42:31 titan systemd[1]: Dependency failed for Local File Systems.
aug 10 16:42:31 titan systemd[1]: Dependency failed for /mnt/apps.
aug 10 16:42:31 titan systemd[1]: zfs-import.target: Job zfs-import.target/start failed with result ‘dependency’.
aug 10 16:42:31 titan systemd[1]: Dependency failed for ZFS pool import target.
aug 10 16:42:31 titan systemd[1]: Failed to start Import ZFS pool “tank”.
aug 10 16:42:31 titan systemd[1]: zfs-import-tank.service: Failed with result ‘exit-code’.
aug 10 16:42:31 titan systemd[1]: zfs-import-tank.service: Main process exited, code=exited, status=1/FAILURE
aug 10 16:42:31 titan zfs-import-tank-start[26376]: Try running ‘modprobe zfs’ as root to manually load them.
aug 10 16:42:31 titan zfs-import-tank-start[26376]: The ZFS modules cannot be auto-loaded.
zfs-import-tank-start[26043]: Pool tank in state MISSING, waiting

now my system is broken but I dont mind, this is a new machine and Im fine reinstalling NixOS if needed, I still have one ssh connection open (new ones are refused) so I will try to get the pool working and maybe rebuild switch will work

for reference my configuration.nix looks (partly) like this:

boot = {
    loader = {
      systemd-boot.enable = true;
      efi.canTouchEfiVariables = true;
    };
    supportedFilesystems = [ "zfs" ];
  };
networking = {
    hostName = "titan";
    hostId = "00000001";
    networkmanager = {
      enable = true;
    };
  };
fileSystems = {
    "/mnt/apps" = {
      device = "tank/apps";
      fsType = "zfs";
    };
  };
environment.systemPackages = with pkgs; [
    vim 
    wget
    zfs
  ];

purefan · August 10, 2024, 2:58pm

now with a pretty minimalist configuration.nix I am getting this error during rebuild switch

building '/nix/store/a7i7jvvd91z9blq74fbn2fyqb93zb3px-nixos-system-titan-24.05.3407.12bf09802d77.drv'...
Failed to connect to bus: Connection refused
'/nix/store/iq3m0pnic6kbazw2splxhlwb5jgxcl9x-system-path/bin/busctl --json=short call org.freedesktop.systemd1 /org/freedesktop/systemd1 org.freedesktop.systemd1.Manager ListUnitsByPatterns asas 0 0' exited with value 1 at /nix/store/wd458gxy5qhwsk6q7wbskp55i3dsp5q7-nixos-system-titan-24.05.3407.12bf09802d77/bin/switch-to-configuration line 145.
warning: error(s) occurred while switching to the new configuration

full configuration.nix:

{ config, lib, pkgs, ... }:

{
  imports =
    [ 
      ./hardware-configuration.nix
    ];

  boot = {
    loader = {
      systemd-boot.enable = true;
      efi.canTouchEfiVariables = true;
    };
  };

  time.timeZone = "Europe/Stockholm";

  i18n = {
    defaultLocale = "en_US.UTF-8";

    extraLocaleSettings = {
      LC_ADDRESS = "sv_SE.UTF-8";
      LC_IDENTIFICATION = "sv_SE.UTF-8";
      LC_MEASUREMENT = "sv_SE.UTF-8";
      LC_MONETARY = "sv_SE.UTF-8";
      LC_NAME = "sv_SE.UTF-8";
      LC_NUMERIC = "sv_SE.UTF-8";
      LC_PAPER = "sv_SE.UTF-8";
      LC_TELEPHONE = "sv_SE.UTF-8";
      LC_TIME = "sv_SE.UTF-8";
    };
  };
  console.keyMap = "sv-latin1";
  networking = {
    hostName = "titan";
    hostId = "00000001";
    networkmanager = {
      enable = true;
    };
  };

  users.users.purefan = {
    isNormalUser = true;
    description = "purefan";
    extraGroups = [ "networkmanager" "wheel" ];
    packages = with pkgs; [ ];
    openssh = {
      authorizedKeys = {
        keys = [
          "some key"
        ];
      };
    };
  };

  services = {
    openssh = {
      enable = true;
      settings = {
        PasswordAuthentication = true;
      };
    };
  };

  environment.systemPackages = with pkgs; [
    vim 
    wget
    zfs
  ];

  system.stateVersion = "24.05"; # Did you read the comment?
}

Could it be related to this machine not having any desktop environment? Its meant to be a server only so I didnt install any

Justinsaccount · August 10, 2024, 3:05pm

Not sure if it’s the cause, but the only thing that I have that you are missing is

boot.kernelPackages = config.boot.zfs.package.latestCompatibleLinuxPackages;

But if that was the problem I’d expect for things to fail after a reboot, not just after a switch.

purefan · August 10, 2024, 5:00pm

Ok some better insight, after reinstalling NixOS fresh I just enabled ssh and added the zfs package:

boot.supportedFilesystems = [ "zfs" ];
  networking.hostName = "titan"; # Define your hostname.

  # Enable networking
  networking.networkmanager.enable = true;
  networking.hostId = "00000002";

environment.systemPackages = with pkgs; [
    vim 
    wget
    zfs
  ];

nixos-rebuild switch + reboot and no errors, but of course no zpool was detected.

Then I tried to add the filesystem:

fileSystems = {
    "/mnt/apps" = {
      device = "tank/apps";
      fsType = "zfs";
    };
  };

which even then seemed like it shouldnt work (since zpool list doesnt show anything)

Now I get the same busctl error from before and Im basically locked out. Will reinstall the OS again and this time I will make a new zpool hopefully overwriting the previous one

jficz · August 15, 2024, 9:48am

This is perhaps where the confusion is coming from - neither of those will show anything unless a pool is imported. To list available but not imported pools, run zpool import (without any parameters).

As for the dbus error - I don’t think it’s related to ZFS.

For reference, I only have this in my config zfs-related:

  fileSystems."/" = {
    device = "tank/os/nixos";
    fsType = "zfs";
  };

  boot.fs.requestEncryptionCredentials = true;

  zfs.autoScrub.enable = true;

No other settings should be needed, using fsType = "zfs" should enable all the necessary settings.

There’s also the thing with networking.hostId. I see you changed it from 01 to 02.

It needs to be static and set (i.e. nixos-rebuild switch) before you create and import the pool for the first time. You can still import the pool if you change/set hostId after pool creation but you may need to force import it once. IIRC nixos boot scripts do a force import but I’m not 100% sure so try that manually first, maybe it will help.

Another problem I see is here:

 zfs-import-tank-start[26043]: Pool tank in state MISSING, waiting

I’ve never seen this error but it suggest there is indeed something wrong with your zpool. Can you post exactly how you created it and what the command output was?

I’ll assume you did this for all tries for now:

If you had a zpool on those devices before, zfs may refuse to create the zpool saying something about “existing filesystem present”. You should either wipe the first few MB of the disks or use -f in zpool create.

Also, unrelated, you really don’t want to layout your zpool like this, not unless all of your sd* are in fact some kind of hardware raid devices. With this setup, if you lose any of sd[abc], you will lose the entire pool. If you lose the log device, you could possibly lose some data in transit. Also it doesn’t really make sense to create a log device unless that device has significantly higher performance than the “storage” devices and even then you’re probably not going to see the benefits unless your workload is synchronous (i.e. databases, VMs,…) Also itt almost never makes sense to use the cache device either - L2ARC works rather differently than people usually think. You are much more likely to benefit from using the special device instead - look it up

purefan · August 21, 2024, 10:04pm

This is perhaps where the confusion is coming from - neither of those will show anything unless a pool is imported. To list available but not imported pools, run zpool import (without any parameters).

As for the dbus error - I don’t think it’s related to ZFS.

this bit cleared things up for me! with the only difference Im doing boot.zfs.extraPools and not the fileSystems way, dont know if there is a better way but really happy this is working after a few reboots.

Also very thankful for the clarification about the dbus because that was throwing me off.

All the best!

ElvishJerricco · August 21, 2024, 10:25pm

The difference between fileSystems vs extraPools shouldn’t cause the import to fail like you were seeing, so I’m rather surprised that made any difference. That said, a subsequent issue would have been that having a fileSystems entry for a non-legacy ZFS dataset can lead to confusing race conditions with zfs-mount.service (which mounts ZFS datasets with mountpoint=/a/path, which is the default mountpoint). Basically, nixos sees any fileSystems entries that have fsType = "zfs";, and generates a systemd service to import the pool along with lines in /etc/fstab to mount the datasets. For extraPools it just creates the import service. The import service should be fine (but this is the part you were seeing errors with), but then fstab and zfs-mount.service would have a race condition on actually mounting datasets in the right order.

This is one of a few caveats with ZFS that I should really add to the nixos wiki page…

jficz · August 21, 2024, 10:35pm

@ElvishJerricco good points, thanks for describing that coherently!

As for “should’t make a difference”, I think the OP probably did a lot of random trial-and-error stuff (good for you @purefan, I’m sure you learned a lot)

I’m just happy to have another ZFS user out there