Nvidia-fabricmanager undefined symbols

Hello,

I am configuring NixOS 25.11 in a server with several A100 GPUs.

These GPUs require nvidia-fabricmanager running, so my understanding is that I need hardware.nvidia.datacenter.enabled=true.

Right now, I have this:

nixpkgs.config.nvidia.acceptLicense = true;

hardware.nvidia = {
  open = false;
  datacenter.enable = true;
  nvidiaPersistenced = true;
  nvidiaSettings = false;
};
hardware.graphics.enable = true;

But during nixos-rebuild switch, I get a bunch of undefined symbols for nvidia-fabricmanager:

the following new units were started: nvidia-persistenced.service
warning: the following units failed: nvidia-fabricmanager.service
× nvidia-fabricmanager.service - Start NVIDIA NVLink Management
     Loaded: loaded (/etc/systemd/system/nvidia-fabricmanager.service; enabled; preset: ignored)
     Active: failed (Result: exit-code) since Sat 2025-12-20 16:45:35 WET; 5s ago
 Invocation: 1a9b517bb5d2402ea68ab4d05ff67b21
    Process: 12849 ExecStart=/nix/store/9h003hg7a5gg7vi76ad8pf6fpxxcfhxj-fabricmanager-570.172.08/bin/nv-fabricmanager -c /nix/store/nv8agj4yys737f704scl6macw6vz5a5b-fabricmanager.conf (code=exited, status=127)
         IP: 0B in, 0B out
         IO: 0B read, 0B written
   Mem peak: 1.7M
        CPU: 6ms

Dec 20 16:45:35 compute01 nv-fabricmanager[12849]: /nix/store/9h003hg7a5gg7vi76ad8pf6fpxxcfhxj-fabricmanager-570.172.08/bin/nv-fabricmanager: /nix/store/9h003hg7a5gg7vi76ad8pf6fpxxcfhxj-fabricmanager-570.172.08/bin/nv-fabricmanager: no version information available (required by /nix/store/9h003hg7a5gg7vi76ad8pf6fpxxcfhxj-fabricmanager-570.172.08/bin/nv-fabricmanager)
Dec 20 16:45:35 compute01 nv-fabricmanager[12849]: /nix/store/9h003hg7a5gg7vi76ad8pf6fpxxcfhxj-fabricmanager-570.172.08/bin/nv-fabricmanager: /nix/store/9h003hg7a5gg7vi76ad8pf6fpxxcfhxj-fabricmanager-570.172.08/bin/nv-fabricmanager: no version information available (required by /nix/store/9h003hg7a5gg7vi76ad8pf6fpxxcfhxj-fabricmanager-570.172.08/bin/nv-fabricmanager)
...
...
Dec 20 16:45:35 compute01 nv-fabricmanager[12849]: /nix/store/9h003hg7a5gg7vi76ad8pf6fpxxcfhxj-fabricmanager-570.172.08/bin/nv-fabricmanager: symbol lookup error: /nix/store/9h003hg7a5gg7vi76ad8pf6fpxxcfhxj-fabricmanager-570.172.08/bin/nv-fabricmanager: undefined symbol:
Dec 20 16:45:35 compute01 systemd[1]: nvidia-fabricmanager.service: Control process exited, code=exited, status=127/n/a
Dec 20 16:45:35 compute01 systemd[1]: nvidia-fabricmanager.service: Failed with result 'exit-code'.
Dec 20 16:45:35 compute01 systemd[1]: Failed to start Start NVIDIA NVLink Management.
Command 'systemd-run -E LOCALE_ARCHIVE -E NIXOS_INSTALL_BOOTLOADER --collect --no-ask-password --pipe --quiet --service-type=exec --unit=nixos-rebuild-switch-to-configuration /nix/store/i4mj5wsxnad6ryj98b2qrx8d78rp248v-nixos-system-compute01-25.11.1948.c6f52ebd45e5/bin/switch-to-configuration switch' returned non-zero exit status 4.

I have tried setting hardware.nvidia.package = config.boot.kernelPackages.nvidiaPackages.dc, but the error is the same. If I use hardware.nvidia.package = config.boot.kernelPackages.nvidiaPackages.production, then I guess nvidia.nix is incompatible with those because I get errors like:

error: lib.meta.getExe’: The first argument is of type set, but it should be a derivation instead.

What is strange is that my configuration.nix is incredibly simple (the rest of the file is networking stuff and so on) - I never change the kernel or anything - yet, I find noone complaining about such undefined symbols.

By the way, if I do ldd /nix/store/9h003hg7a5gg7vi76ad8pf6fpxxcfhxj-fabricmanager-570.172.08/bin/nv-fabricmanager, it’s the same messages as before with “no version information available”.

I managed to work around this. I am posting this in case it helps others.

The way I fixed the problem was by making my own nvidia-fabricmanager (I think there is something wrong with fabricmanager.nix because the binary complains about undefined symbols, but I am yet a novice on nix).

I created a nv-fabricmanager.nix:

{ stdenv, fetchurl, autoPatchelfHook, glibc, libpciaccess, numactl, nvidiaPackage }:

stdenv.mkDerivation rec {
  pname = "nvidia-fabricmanager";
  version = "570.172.08";
  src = fetchurl {
    url = "https://developer.download.nvidia.com/compute/nvidia-driver/redist/fabricmanager/linux-x86_64/fabricmanager-linux-x86_64-${version}-archive.tar.xz";
    sha256 = "07j9jvsci0b3lz14bkca5q5cmma9ld36yhmdk4c4fkamwk6wl94d";
  };

  nativeBuildInputs = [ autoPatchelfHook ];
  buildInputs = [ glibc libpciaccess numactl nvidiaPackage ];
  installPhase = ''
    mkdir -p $out
    cp -r * $out
    sed -i "s|/usr/share/nvidia/nvswitch|$out/share/nvidia/nvswitch|g" "$out/etc/fabricmanager.cfg"
  '';
}

And then in my configuration.nix:

# nvidia drivers: for A100, we also need nvidia-fabricmanager, otherwise
# we get an error return torch._C._cuda_getDeviceCount() > 0
# nvidia-fabricmanager is installed by hardware.nvidia.datacenter.enable=true

# unfortunately, the official nvidia-fabricmanager does not seem to work, so
# we use our own nvidia-fabricmanager package.

systemd.services.nvidia-fabricmanager.enable = lib.mkForce false;

systemd.services.my-nvidia-fabricmanager = let
  fabricmanager = pkgs.callPackage ./nv-fabricmanager.nix { nvidiaPackage = config.hardware.nvidia.package;  };
in {
  description = "NVIDIA NVLink Fabric Manager";
  wantedBy = [ "multi-user.target" ];
  after = [ "multi-user.target" ];
  serviceConfig = {
    ExecStart = "${fabricmanager}/bin/nv-fabricmanager -c ${fabricmanager}/etc/fabricmanager.cfg";
    Restart = "always";
    Type = "forking";
  };
};

boot.kernelModules = [ "nvidia" "nvidia_uvm" "nvidia_drm" ];  # unsure if necessary
nixpkgs.config.nvidia.acceptLicense = true;
hardware.nvidia = {
  open = false;
  package = pkgs.linuxPackages.nvidiaPackages.dc;
  datacenter.enable = true;
  nvidiaPersistenced = true;
  nvidiaSettings = false;
};
hardware.graphics.enable = true;

I have not yet been able to test this suspicion, but I think one issue with the official fabricmanager.nix is that the lib/*.so files are not being patchElf’d:

for d in include lib;do
mv $d $out/.
done

I think patchElf which is applied before to bin/* should also be applied here to lib/*…

Appreciate this, hopefully a fix gets merged soon.

I’ve got a PR open that fixes this. The explicitly defined phase list was dropped from the fabricmanager module. Adding them back made some extra phases run that broke the build. Things still built successfully since there was no check phase defined though. I added that in too and threw in a few minor changes. Should be fixed in 25.11 after this gets merged to master and gets back-ported!

2 Likes