Anyone have a configuration.nix setup to make systemd monitor `zpool status` and output an alert if errors are found?

bgibson · January 14, 2022, 4:23am

Is anyone using systemd to monitor their zpool status and alert if an error is found? Currently I just manually check zpool status from time to time, but I’d like to automate that.

peterhoeg · January 14, 2022, 6:01am

I can highly recommend prometheus. The prometheus-node-exporter will be able to let you know if there is an issue with any filesystem (IMHO, you want to know if there is a filesystem error of any kind - whether or not it’s zfs doesn’t really matter) in addition to providing specific zfs metrics.

nixinator · January 14, 2022, 5:42pm

you got a Nixos configuration you wish to share for this…

bgibson · January 14, 2022, 9:03pm

Thanks, I’m checking it out, but this is just for my desktop workstation, so Prometheus may be overkill for that. The only filesystem I have on my workstation is ZFS, plus FAT32 for the EFI boot sector. All my important data is on ZFS so that’s primarily what I’m concerned about.

M12 · January 15, 2022, 1:01pm

I would go for an easier way, create a simple shell script, that does zpool status, redirect the output to a file, grep the file for an error, and if so, send an e-mail

And use cron to do that at certain intervals… https://search.nixos.org/options?channel=21.11&show=services.cron.enable&from=0&size=50&sort=relevance&type=packages&query=cron

Yes,I am older

bgibson · January 15, 2022, 11:37pm

Is it possible to configure all that in configuration.nix? I like keeping all this stuff there so it’s automatic on rebuilds and new installations.

For example, I have a systemd Tailscale service configured in configuration.nix, and was thinking I could do something similar to check for ZFS errors.

M12 · January 16, 2022, 12:12pm

a lot can be done in configuration.nix, for examples:

https://nixos.wiki/wiki/Cron

now I am trying this:


# Enable cron service
  services.cron = {
    enable = true;
    systemCronJobs = [
      "*/10 * * * *     jane    ${pkgs.zfs}/bin/zpool status 2>&1 | grep -ozP 'state:\sONLINE\n(.*\n.*){1,}errors:\sNo\sknown\sdata\serrors\n' || ${pkgs.zfs}/bin/zpool status 2>&1 | swaks --body -"
    ];
  };

swaks is used for sending mail, then you do not need a local mail server, see https://jetmore.org/john/code/swaks/faq.html, with a .swaksrc file with the --to and --from mail addresses defined, but you can define --to and --from also on the command line, just like the --body option

for clarification: the commands after the || are only executed if the grep command returns a non zero, that is when there is no exact match, and the - at the end let’s the piping work

(edited because of my wrong zpool return value assumption)

nixinator · January 16, 2022, 8:54pm

on a side note, if you need inspiration, and see what people are doing with crons in nixOS try.

which has all manner of cron examples!

aanderse · January 16, 2022, 9:12pm

see what people are doing with crons in nixOS

Replacing them with systemd services and timers?

nixinator · January 16, 2022, 9:21pm

i’ve yet to discover kool kid systemd chops…but i guess systemd is here to stay, so i for one welcome our new overlords…

M12 · January 17, 2022, 12:11am

OK, searched a little around, also read your suggestion, I came to this, sort of translated the cron job to systemd…

systemd.timers.zpool-check = {
  description = "check zpool status timer";
  wantedBy = [ "timers.target" ];
  partOf = [ "zpool-check.service" ];
  timerConfig = {
    OnCalendar = "*:0/10:0";
  };
};

systemd.services.zpool-check = {
  description = "check zpool status service";
  wantedBy = [ "multi-user.target" ];
  serviceConfig.Type = "oneshot";
  script = with pkgs; ''
    ${pkgs.swaks}/bin/swaks --body "test systemd service" --from jane@jungle.nl --to jane@jungle.nl
    out=$( ${pkgs.zfs}/bin/zpool status 2>&1 ) || echo $out | ${pkgs.swaks}/bin/swaks --from jane@jungle.nl --to jane@jungle.nl --body -
  '';
};

hmm, still a little to do, my assumption that zpool status would return non zero on error seems false…

[jane@nixos:~]$ zpool status
  pool: rpool
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 0B in 00:00:00 with 69 errors on Mon Jan 17 01:37:38 2022
config:

        NAME        STATE     READ WRITE CKSUM
        rpool       DEGRADED     0     0     0
          sda2      DEGRADED     0     0   252  too many errors
errors: List of errors unavailable: permission denied

errors: 27 data errors, use '-v' for a list

[jane@nixos:~]$ echo $?
0

[jane@nixos:~]$

Thanks to https://datto.engineering/post/causing-zfs-corruption for helping me out!

peterhoeg · January 17, 2022, 2:06am

This is a start:

services.prometheus.exporters.node.enable = true;

We use prometheus extensively, so we have a lot of configuration around that - it isn’t super easy to extract meaningful snippets from that.

peterhoeg · January 17, 2022, 2:08am

@M12 , you can drop systemd.timers.zpool-check completely and add startAt = "*:0/10:0"; to systemd.services.zpool-check. This gives you the timer “for free”.

nixinator · January 17, 2022, 11:16am

god i hate it when unix commands don’t use the the standard way to return success/failure with $?

very annoying.

M12 · January 17, 2022, 2:55pm

OK for now it is:

systemd.timers.zpool-check = {
    description = "check zpool status timer";
    wantedBy = [ "timers.target" ];
    partOf = [ "zpool-check.service" ];
    timerConfig = {
      OnCalendar = "*:0/10:0";
    };
  };

systemd.services.zpool-check = {
  description = "check zpool status service";
  wantedBy = [ "multi-user.target" ];
  serviceConfig.Type = "oneshot";
  script = with pkgs; ''
    # ${pkgs.swaks}/bin/swaks --body "test systemd service" --from jane@jungle.nl --to jane@jungle.nl
    ${pkgs.zfs}/bin/zpool status 2>&1 | grep -ozP 'state:\sONLINE\n(.*\n.*){1,}errors:\sNo\sknown\sdata\serrors\n' || ${pkgs.zfs}/bin/zpool status 2>&1 | ${pkgs.swaks}/bin/swaks --from jane@jungle.nl --to jane@jungle.nl --body -
  '';
};

that sends an e-mail if NOT the words: state: ONLINE AND errors: No known data errors, are in the output of the zpool status command…

And the shorter version, thanks to @peterhoeg


systemd.services.zpool-check = {
    description = "check zpool status service";
    wantedBy = [ "multi-user.target" ];
    serviceConfig.Type = "oneshot";
    startAt = "*:0/10:0";
    script = with pkgs; ''
      # ${pkgs.swaks}/bin/swaks --body "test systemd service" --from jane@jungle.nl --to jane@jungle.nl
      ${pkgs.zfs}/bin/zpool status 2>&1 | grep -ozP 'state:\sONLINE\n(.*\n.*){1,}errors:\sNo\sknown\sdata\serrors\n' || ${pkgs.zfs}/bin/zpool status 2>&1 | ${pkgs.swaks}/bin/swaks --from jane@jungle.nl --to jane@jungle.nl --body -
    '';
  };

the script line starting with # can be used to test the e-mail sending and the timer, by removing the #

disclaimer: I am not responsible for any problems or data loss, by using this, that is your own choice

bgibson · January 17, 2022, 9:43pm

Thank you all, very much appreciated!

M12 · January 18, 2022, 1:01pm

And if you are worried, changes in zpool status are not picked up by these two checks (see above), then maybe this is something more secure, using shasum to detect a change in output:

first calculate your hash:

jane@nixos ~> zpool status | shasum
eb7333f252ea5e39d0759b9aae9e4f7026035cb7  -

use that value in your grep argument, output is true when the same

jane@nixos ~> zpool status | shasum | grep 'eb7333f252ea5e39d0759b9aae9e4f7026035cb7  -'
eb7333f252ea5e39d0759b9aae9e4f7026035cb7  -

I changed the hash on purpose (yes on the wrong side), now a false is returned

jane@nixos ~> zpool status | shasum | grep 'eb7333f252ea5e39d0759b9aae9e4f7026035cb6  -'
jane@nixos ~ [0|0|1]>

of course when you change something in your zpool, you have to change the shasum value in the test, but I am sure you get e-mail alerts, if you forgot

uep · January 18, 2022, 11:03pm

you can use zpool status -x to at least simplify the output