NixOS service not using patched versions of Slurm binaries

Hi all, first time on here so feel free to tell me if I’ve broken some etiquette rules.

For the past month I have been trying to package prometheus-slurm-exporter for Nix but have yet to have it merged. Personally I’m only interested in the package but I was not allowed to have it merged without also providing NixOS system integration tests. So in order to provide those I ended up refactoring the prometehus-slurm-exporter code base to 1. provide errors when subprocesses failed, 2. have actual unit tests and 3. separate the unit tests from the system tests that use Slurm binaries. Having done that I started working on the NixOS system integration tests last week and wrote a systemd prometehus-slurm-exporter service for them to run.

The issue is that the subprocesses that prometheus-slurm-exporter fail to run because there is not /etc/slurm.conf available on the system. Looking at the source code that defines the Slurm service nixos/modules/services/computing/slurm/slurm.nix it seems like the Slurm commands are put in a wrapper that prepends SLURM_CONF="${etcSlurm}/slurm.conf" to each of the command invocations. Now this obviously works for when sinfo is run from a shell that has the PATH set up like Nix wants it to, but the Go service prometheus-slurm-exporter seems to be missing this PATH trickery.

So does anyone know how to set up the PATH for Go processes, or how to get the ${etcSlurm} path so that I can patch the prometheus-slurm-exporter source to prepend it to all subprocess invocations, or some other way to address this issue?

And another question: Is it possible to find the source code of the prometheus-slurm-exporter during the NixOS testing infrastructure run and run make systemtest?

The nixpkgs pull request can be found here: prometheus: Added package prometheus-slurm-exporter by Rovanion · Pull Request #112010 · NixOS/nixpkgs · GitHub

Unfortunately there is no straight forward way to access ${etcSlurm} in the way the module is written at the moment. However, the wrapped binaries are in the system path once you turn on slurm. The prometheus-slurm-exporter service should be able to pick them up from them system path.

The prometheus-slurm-exporter service should be able to pick them up from them system path.

What do I do if the don’t? Below is the relevant lines from nix-build nixos/tests/prometheus-exporters.nix -A slurm.

slurm: waiting for success: curl -sSf localhost:9341/metrics | grep 'slurm_cpus_idle' | grep -qv ' 0$'
slurm # [    7.139514] prometheus-slurm-exporter[766]: 2021/03/10 07:41:51 main.CPUsGetMetrics: The command 'sinfo -h -o %C' failed with the following error:
slurm # [    7.141484] prometheus-slurm-exporter[766]: sinfo: error: resolve_ctls_from_dns_srv: res_nsearch error: Host name lookup failure
slurm # [    7.142946] curl: (52) Empty reply from server
slurm # prometheus-slurm-exporter[766]: sinfo: error: fetch_config: DNS SRV lookup failed
slurm # [    7.153901] prometheus-slurm-exporter[766]: sinfo: error: _establish_config_source: failed to fetch config
slurm # [    7.155871] prometheus-slurm-exporter[766]: sinfo: fatal: Could not establish a configuration source
slurm # [    7.159427] prometheus-slurm-exporter[766]: 2021/03/10 07:41:51 main.CPUsGetMetrics: The subprocess sinfo terminated with: exit status 1

The DNS errors above is due to sinfo trying to look up the slurm.config in a DNS SRV record. It does this if neither of /etc/slurm.conf and /run/slurm.conf are present. Is it possible to interactively connect to the VM that runs the tests? I could try running strace on the prometheus-slurm-exporter service to see which files it actually tries to run.

Looking at the code from PR: you have the line Environment = [ "PATH=${pkgs.slurm}/bin/" ]; in the service definition, which feeds the unwrapped slurm binaries to your exporter service. If you remove that line the service should be able to pick up the wrapped binaries from the system path.

The system path is usually not passed to systemd units.

However, doesn’t telegraf work in a similar way? It requires helper binaries to collect data which are made available via environment.systemPackages and not via a telegraf service specific path statement (at least that’s how I use it).

I do not use telegraf. Though I do not expect anything I have in environment.systemPackages to be available in any unit automatically.

I also checked 3 random services that I had on my machine… None of them had /run/current-system/sw/bin in PATH.

I checked as described here: How do I show the environment variables of a systemd unit? - Unix & Linux Stack Exchange

Removing that line the exporter can’t find any binaries at all, patched or not:

slurm # [   11.329823] prometheus-slurm-exporter[913]: 2021/03/10 12:07:52 main.GetTotalGPUs: Failed to start sinfo: exec: "sinfo": executable file not found in $PATH

Sorry, you are right. telegraph does use systemd.services.telegraf.path to get the job done.

Anyway, the cleanest solution would be to fix the slurm module and expose the path to the slurm config files via the module. This may also be useful for other services. I will open up a PR later today.

2 Likes

The PR: nixos/slurm: expose to path config files by markuskowa · Pull Request #115839 · NixOS/nixpkgs · GitHub

So after that patch has landed I think I will have to patch prometheus-slurm-exporter to forward SLURM_CONF to its subprocesses, and after that we could be done with this.