Explaining modern server monitoring stacks for self-hosting with NixOS

8 Likes

I don’t know how to refresh the open graph information in discourse to fix the typo :sweat_smile:

Add some random query string to the link?

2 Likes

It worked! that was a good idea :smiley:

TIL about collectd. It’s probably time to revisit metrics :slight_smile: Thanks!

there are also Telegraf as an agent, very minimal.

And netdata, but it can works standalone providing SO MANY METRICS without any configuration pain if you don’t want to centralize metrics.

2 Likes

I’m using plain telegraf with prometheus without a dashboard. Years of operation learned me a lesson: I don’t have a spare display for staring at the metrics, a handful of alerts for the most critical events are more than enough.

Metrics are often useful to understand a problem. It doesn’t replace alerting.

1 Like

Wow, netdata is a lot more powerful than when I tested it ~8 years ago. It now has persistent data storage on disk, instead of keeping everything in memory :sweat_smile:

And it recently got machine learning based anomaly detection. I’m curious to see if it gives good results. My issue with netdata is that you can’t access the logs if the system is down :smiley:

  services.netdata.enable = true;
  services.netdata.config = {
      global = {
          "page cache size" = 32; # max 32MB o memory
          "update every" = 30; # 30s interval
      };
      ml = {
          "enabled" = "yes"; # machine learning
      };
  };
2 Likes

Thanks for sharing various options for server monitoring in a clear way! The second option - VictoriaMetrics + vmagent - can be simplified further by removing vmagent from the configuration and using a sole VictoriaMetrics for metrics scraping - see these docs. This should reduce memory usage by another 13MB. Also try playing with -memory.allowedBytes command-line option at VictoriaMetrics if you need reducing memory usage even more.Here is the -help description for this option:

-memory.allowedPercent float
     Allowed percent of system memory VictoriaMetrics caches may occupy.
     See also -memory.allowedBytes. Too low a value may increase cache miss rate usually resulting in higher CPU and disk IO usage.
     Too high a value may evict too much data from OS page cache which will result in higher disk IO usage
     (default 60)
1 Like

Wow cool! I’m going to try, thank you :slight_smile:

1 Like