Is anyone using systemd to monitor their zpool status
and alert if an error is found? Currently I just manually check zpool status
from time to time, but I’d like to automate that.
I can highly recommend prometheus. The prometheus-node-exporter will be able to let you know if there is an issue with any filesystem (IMHO, you want to know if there is a filesystem error of any kind - whether or not it’s zfs doesn’t really matter) in addition to providing specific zfs metrics.
you got a Nixos configuration you wish to share for this…
Thanks, I’m checking it out, but this is just for my desktop workstation, so Prometheus may be overkill for that. The only filesystem I have on my workstation is ZFS, plus FAT32 for the EFI boot sector. All my important data is on ZFS so that’s primarily what I’m concerned about.
I would go for an easier way, create a simple shell script, that does zpool status
, redirect the output to a file, grep
the file for an error, and if so, send an e-mail
And use cron
to do that at certain intervals… https://search.nixos.org/options?channel=21.11&show=services.cron.enable&from=0&size=50&sort=relevance&type=packages&query=cron
Yes,I am older
Is it possible to configure all that in configuration.nix
? I like keeping all this stuff there so it’s automatic on rebuilds and new installations.
For example, I have a systemd Tailscale service configured in configuration.nix
, and was thinking I could do something similar to check for ZFS errors.
a lot can be done in configuration.nix, for examples:
now I am trying this:
# Enable cron service
services.cron = {
enable = true;
systemCronJobs = [
"*/10 * * * * jane ${pkgs.zfs}/bin/zpool status 2>&1 | grep -ozP 'state:\sONLINE\n(.*\n.*){1,}errors:\sNo\sknown\sdata\serrors\n' || ${pkgs.zfs}/bin/zpool status 2>&1 | swaks --body -"
];
};
swaks
is used for sending mail, then you do not need a local mail server, see https://jetmore.org/john/code/swaks/faq.html, with a .swaksrc
file with the --to
and --from
mail addresses defined, but you can define --to and --from also on the command line, just like the --body option
for clarification: the commands after the ||
are only executed if the grep
command returns a non zero, that is when there is no exact match, and the -
at the end let’s the piping work
(edited because of my wrong zpool
return value assumption)
on a side note, if you need inspiration, and see what people are doing with crons in nixOS try.
which has all manner of cron examples!
see what people are doing with crons in nixOS
Replacing them with systemd
services and timers?
i’ve yet to discover kool kid systemd chops…but i guess systemd is here to stay, so i for one welcome our new overlords…
OK, searched a little around, also read your suggestion, I came to this, sort of translated the cron job to systemd…
systemd.timers.zpool-check = {
description = "check zpool status timer";
wantedBy = [ "timers.target" ];
partOf = [ "zpool-check.service" ];
timerConfig = {
OnCalendar = "*:0/10:0";
};
};
systemd.services.zpool-check = {
description = "check zpool status service";
wantedBy = [ "multi-user.target" ];
serviceConfig.Type = "oneshot";
script = with pkgs; ''
${pkgs.swaks}/bin/swaks --body "test systemd service" --from jane@jungle.nl --to jane@jungle.nl
out=$( ${pkgs.zfs}/bin/zpool status 2>&1 ) || echo $out | ${pkgs.swaks}/bin/swaks --from jane@jungle.nl --to jane@jungle.nl --body -
'';
};
hmm, still a little to do, my assumption that zpool status
would return non zero on error seems false…
[jane@nixos:~]$ zpool status
pool: rpool
state: DEGRADED
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
scan: scrub repaired 0B in 00:00:00 with 69 errors on Mon Jan 17 01:37:38 2022
config:
NAME STATE READ WRITE CKSUM
rpool DEGRADED 0 0 0
sda2 DEGRADED 0 0 252 too many errors
errors: List of errors unavailable: permission denied
errors: 27 data errors, use '-v' for a list
[jane@nixos:~]$ echo $?
0
[jane@nixos:~]$
Thanks to https://datto.engineering/post/causing-zfs-corruption for helping me out!
This is a start:
services.prometheus.exporters.node.enable = true;
We use prometheus extensively, so we have a lot of configuration around that - it isn’t super easy to extract meaningful snippets from that.
@M12 , you can drop systemd.timers.zpool-check
completely and add startAt = "*:0/10:0";
to systemd.services.zpool-check
. This gives you the timer “for free”.
god i hate it when unix commands don’t use the the standard way to return success/failure with $?
very annoying.
OK for now it is:
systemd.timers.zpool-check = {
description = "check zpool status timer";
wantedBy = [ "timers.target" ];
partOf = [ "zpool-check.service" ];
timerConfig = {
OnCalendar = "*:0/10:0";
};
};
systemd.services.zpool-check = {
description = "check zpool status service";
wantedBy = [ "multi-user.target" ];
serviceConfig.Type = "oneshot";
script = with pkgs; ''
# ${pkgs.swaks}/bin/swaks --body "test systemd service" --from jane@jungle.nl --to jane@jungle.nl
${pkgs.zfs}/bin/zpool status 2>&1 | grep -ozP 'state:\sONLINE\n(.*\n.*){1,}errors:\sNo\sknown\sdata\serrors\n' || ${pkgs.zfs}/bin/zpool status 2>&1 | ${pkgs.swaks}/bin/swaks --from jane@jungle.nl --to jane@jungle.nl --body -
'';
};
that sends an e-mail if NOT the words: state: ONLINE
AND errors: No known data errors
, are in the output of the zpool status command…
And the shorter version, thanks to @peterhoeg
systemd.services.zpool-check = {
description = "check zpool status service";
wantedBy = [ "multi-user.target" ];
serviceConfig.Type = "oneshot";
startAt = "*:0/10:0";
script = with pkgs; ''
# ${pkgs.swaks}/bin/swaks --body "test systemd service" --from jane@jungle.nl --to jane@jungle.nl
${pkgs.zfs}/bin/zpool status 2>&1 | grep -ozP 'state:\sONLINE\n(.*\n.*){1,}errors:\sNo\sknown\sdata\serrors\n' || ${pkgs.zfs}/bin/zpool status 2>&1 | ${pkgs.swaks}/bin/swaks --from jane@jungle.nl --to jane@jungle.nl --body -
'';
};
the script line starting with # can be used to test the e-mail sending and the timer, by removing the #
disclaimer: I am not responsible for any problems or data loss, by using this, that is your own choice
Thank you all, very much appreciated!
And if you are worried, changes in zpool status
are not picked up by these two checks (see above), then maybe this is something more secure, using shasum to detect a change in output:
first calculate your hash:
jane@nixos ~> zpool status | shasum
eb7333f252ea5e39d0759b9aae9e4f7026035cb7 -
use that value in your grep argument, output is true
when the same
jane@nixos ~> zpool status | shasum | grep 'eb7333f252ea5e39d0759b9aae9e4f7026035cb7 -'
eb7333f252ea5e39d0759b9aae9e4f7026035cb7 -
I changed the hash on purpose (yes on the wrong side), now a false
is returned
jane@nixos ~> zpool status | shasum | grep 'eb7333f252ea5e39d0759b9aae9e4f7026035cb6 -'
jane@nixos ~ [0|0|1]>
of course when you change something in your zpool
, you have to change the shasum
value in the test, but I am sure you get e-mail alerts, if you forgot
you can use zpool status -x
to at least simplify the output