Nvidia-fabricmanager undefined symbols

Hello,

I am configuring NixOS 25.11 in a server with several A100 GPUs.

These GPUs require nvidia-fabricmanager running, so my understanding is that I need hardware.nvidia.datacenter.enabled=true.

Right now, I have this:

nixpkgs.config.nvidia.acceptLicense = true;

hardware.nvidia = {
  open = false;
  datacenter.enable = true;
  nvidiaPersistenced = true;
  nvidiaSettings = false;
};
hardware.graphics.enable = true;

But during nixos-rebuild switch, I get a bunch of undefined symbols for nvidia-fabricmanager:

the following new units were started: nvidia-persistenced.service
warning: the following units failed: nvidia-fabricmanager.service
× nvidia-fabricmanager.service - Start NVIDIA NVLink Management
     Loaded: loaded (/etc/systemd/system/nvidia-fabricmanager.service; enabled; preset: ignored)
     Active: failed (Result: exit-code) since Sat 2025-12-20 16:45:35 WET; 5s ago
 Invocation: 1a9b517bb5d2402ea68ab4d05ff67b21
    Process: 12849 ExecStart=/nix/store/9h003hg7a5gg7vi76ad8pf6fpxxcfhxj-fabricmanager-570.172.08/bin/nv-fabricmanager -c /nix/store/nv8agj4yys737f704scl6macw6vz5a5b-fabricmanager.conf (code=exited, status=127)
         IP: 0B in, 0B out
         IO: 0B read, 0B written
   Mem peak: 1.7M
        CPU: 6ms

Dec 20 16:45:35 compute01 nv-fabricmanager[12849]: /nix/store/9h003hg7a5gg7vi76ad8pf6fpxxcfhxj-fabricmanager-570.172.08/bin/nv-fabricmanager: /nix/store/9h003hg7a5gg7vi76ad8pf6fpxxcfhxj-fabricmanager-570.172.08/bin/nv-fabricmanager: no version information available (required by /nix/store/9h003hg7a5gg7vi76ad8pf6fpxxcfhxj-fabricmanager-570.172.08/bin/nv-fabricmanager)
Dec 20 16:45:35 compute01 nv-fabricmanager[12849]: /nix/store/9h003hg7a5gg7vi76ad8pf6fpxxcfhxj-fabricmanager-570.172.08/bin/nv-fabricmanager: /nix/store/9h003hg7a5gg7vi76ad8pf6fpxxcfhxj-fabricmanager-570.172.08/bin/nv-fabricmanager: no version information available (required by /nix/store/9h003hg7a5gg7vi76ad8pf6fpxxcfhxj-fabricmanager-570.172.08/bin/nv-fabricmanager)
...
...
Dec 20 16:45:35 compute01 nv-fabricmanager[12849]: /nix/store/9h003hg7a5gg7vi76ad8pf6fpxxcfhxj-fabricmanager-570.172.08/bin/nv-fabricmanager: symbol lookup error: /nix/store/9h003hg7a5gg7vi76ad8pf6fpxxcfhxj-fabricmanager-570.172.08/bin/nv-fabricmanager: undefined symbol:
Dec 20 16:45:35 compute01 systemd[1]: nvidia-fabricmanager.service: Control process exited, code=exited, status=127/n/a
Dec 20 16:45:35 compute01 systemd[1]: nvidia-fabricmanager.service: Failed with result 'exit-code'.
Dec 20 16:45:35 compute01 systemd[1]: Failed to start Start NVIDIA NVLink Management.
Command 'systemd-run -E LOCALE_ARCHIVE -E NIXOS_INSTALL_BOOTLOADER --collect --no-ask-password --pipe --quiet --service-type=exec --unit=nixos-rebuild-switch-to-configuration /nix/store/i4mj5wsxnad6ryj98b2qrx8d78rp248v-nixos-system-compute01-25.11.1948.c6f52ebd45e5/bin/switch-to-configuration switch' returned non-zero exit status 4.

I have tried setting hardware.nvidia.package = config.boot.kernelPackages.nvidiaPackages.dc, but the error is the same. If I use hardware.nvidia.package = config.boot.kernelPackages.nvidiaPackages.production, then I guess nvidia.nix is incompatible with those because I get errors like:

error: lib.meta.getExe’: The first argument is of type set, but it should be a derivation instead.

What is strange is that my configuration.nix is incredibly simple (the rest of the file is networking stuff and so on) - I never change the kernel or anything - yet, I find noone complaining about such undefined symbols.

By the way, if I do ldd /nix/store/9h003hg7a5gg7vi76ad8pf6fpxxcfhxj-fabricmanager-570.172.08/bin/nv-fabricmanager, it’s the same messages as before with “no version information available”.

I managed to work around this. I am posting this in case it helps others.

The way I fixed the problem was by making my own nvidia-fabricmanager (I think there is something wrong with fabricmanager.nix because the binary complains about undefined symbols, but I am yet a novice on nix).

I created a nv-fabricmanager.nix:

{ stdenv, fetchurl, autoPatchelfHook, glibc, libpciaccess, numactl, nvidiaPackage }:

stdenv.mkDerivation rec {
  pname = "nvidia-fabricmanager";
  version = "570.172.08";
  src = fetchurl {
    url = "https://developer.download.nvidia.com/compute/nvidia-driver/redist/fabricmanager/linux-x86_64/fabricmanager-linux-x86_64-${version}-archive.tar.xz";
    sha256 = "07j9jvsci0b3lz14bkca5q5cmma9ld36yhmdk4c4fkamwk6wl94d";
  };

  nativeBuildInputs = [ autoPatchelfHook ];
  buildInputs = [ glibc libpciaccess numactl nvidiaPackage ];
  installPhase = ''
    mkdir -p $out
    cp -r * $out
    sed -i "s|/usr/share/nvidia/nvswitch|$out/share/nvidia/nvswitch|g" "$out/etc/fabricmanager.cfg"
  '';
}

And then in my configuration.nix:

# nvidia drivers: for A100, we also need nvidia-fabricmanager, otherwise
# we get an error return torch._C._cuda_getDeviceCount() > 0
# nvidia-fabricmanager is installed by hardware.nvidia.datacenter.enable=true

# unfortunately, the official nvidia-fabricmanager does not seem to work, so
# we use our own nvidia-fabricmanager package.

systemd.services.nvidia-fabricmanager.enable = lib.mkForce false;

systemd.services.my-nvidia-fabricmanager = let
  fabricmanager = pkgs.callPackage ./nv-fabricmanager.nix { nvidiaPackage = config.hardware.nvidia.package;  };
in {
  description = "NVIDIA NVLink Fabric Manager";
  wantedBy = [ "multi-user.target" ];
  after = [ "multi-user.target" ];
  serviceConfig = {
    ExecStart = "${fabricmanager}/bin/nv-fabricmanager -c ${fabricmanager}/etc/fabricmanager.cfg";
    Restart = "always";
    Type = "forking";
  };
};

boot.kernelModules = [ "nvidia" "nvidia_uvm" "nvidia_drm" ];  # unsure if necessary
nixpkgs.config.nvidia.acceptLicense = true;
hardware.nvidia = {
  open = false;
  package = pkgs.linuxPackages.nvidiaPackages.dc;
  datacenter.enable = true;
  nvidiaPersistenced = true;
  nvidiaSettings = false;
};
hardware.graphics.enable = true;

I have not yet been able to test this suspicion, but I think one issue with the official fabricmanager.nix is that the lib/*.so files are not being patchElf’d:

for d in include lib;do
mv $d $out/.
done

I think patchElf which is applied before to bin/* should also be applied here to lib/*…

Appreciate this, hopefully a fix gets merged soon.