Does anyone have any recommended monitoring tools used to monitor your NixOS servers? My need extends a tiny bit further than something like https://uptimerobot.com/ - would be nice to be able to alert on high cpu usage, memory usage, etc.
NixOS has good support for Prometheus and I use it on some of my servers. Essentially you install the prometheus and alertmanager on a “check” server and then the node exporter on the servers you need to monitor. It’s not super easy, but it’s powerful.
We use prometheus extensively at $DAY_JOB and then I use 80% of that to monitor things at home as well (obviously not at the same scale…)
I really like the separation of data capture and alerting as well as visualizing things with grafana. Cannot recommend it strongly enough.
We used sensu before, but this is so much better.
i’ve had good success with the zabbix … give it a whirl.
Thanks for the suggesions! I found https://christine.website/blog/prometheus-grafana-loki-nixos-2020-11-20 (a great guide), but I’d like a simpler way to access grafana from my own computer. Any suggestions there? I don’t want to expose grafana on the internet. SSH port forwarding or something?
In my experience Tailscale has been a convenient option for those kinds of scenarios where you don’t want to open up ports to the internet at large.
There’s also services.icingaweb2.enable and its forks that see a fair bit of use and have a bunch of options available, though I still need to get around to configuring one of them to try personally.
I like the minimalism of monit.
We use Grafana alerts + Prometheus + Loki and are happy with it so far.
It comes with automatic metrics for CPU, memory, temperature, and so on, so you just need to set your alert thresholds.
grafana supports authentication either by itself or through oauth. For work we authenticate against Azure AD. Personally, I just use wireguard and then don’t expose grafana externally.
To add to nh2’s comment about grafana alerts. We looked at that too, but decided against it as Prometheus’s alert manager allows you to do grouping which means getting 1 notification saying 35 instances are down rather than 35 notifications…
Thank you for all the suggestions! I’ve gone with prometheus and grafana. Now I just need to figure out wireguard/tailscale
I’d also suggest netdata for its simple setup.
Prometheus + telegraf is my current favorite. Telegraf can handle tons of things that you would usually set up dedicated exporters for in one binary: https://github.com/Mic92/dotfiles/blob/c3449c4f7562857c54e99a5bd4bb3d14219ac68b/nixos/modules/telegraf.nix
It can even do blackbox exporter like functionality: https://github.com/Mic92/dotfiles/blob/c3449c4f7562857c54e99a5bd4bb3d14219ac68b/nixos/modules/telegraf.nix
Than prometheus query language is very generic so you don’t have to repeat yourself setting up alerts: https://github.com/Mic92/dotfiles/blob/c3449c4f7562857c54e99a5bd4bb3d14219ac68b/nixos/modules/telegraf.nix
I tested netdata, zabbix, influxdb, grafana and icinga2 before, I sticked with prometheus.
my generic telegraf setup is now also available in srvos: https://github.com/numtide/srvos/blob/bf8ce44e0d1a380565c51bd6a707a75ac21c1a9a/nixos/mixins/telegraf.nix
FYI: This thread seduced me into trying Monit, thanks for the hint! After some initial issues, it was a real pleasure to set up to monitor my laptop and home server. If you are interested, I posted a summary of my setup. Cheers!