Nvidia-fabricmanager undefined symbols

rpcruz · December 20, 2025, 5:11pm

Hello,

I am configuring NixOS 25.11 in a server with several A100 GPUs.

These GPUs require nvidia-fabricmanager running, so my understanding is that I need hardware.nvidia.datacenter.enabled=true.

Right now, I have this:

nixpkgs.config.nvidia.acceptLicense = true;

hardware.nvidia = {
  open = false;
  datacenter.enable = true;
  nvidiaPersistenced = true;
  nvidiaSettings = false;
};
hardware.graphics.enable = true;

But during nixos-rebuild switch, I get a bunch of undefined symbols for nvidia-fabricmanager:

the following new units were started: nvidia-persistenced.service
warning: the following units failed: nvidia-fabricmanager.service
× nvidia-fabricmanager.service - Start NVIDIA NVLink Management
     Loaded: loaded (/etc/systemd/system/nvidia-fabricmanager.service; enabled; preset: ignored)
     Active: failed (Result: exit-code) since Sat 2025-12-20 16:45:35 WET; 5s ago
 Invocation: 1a9b517bb5d2402ea68ab4d05ff67b21
    Process: 12849 ExecStart=/nix/store/9h003hg7a5gg7vi76ad8pf6fpxxcfhxj-fabricmanager-570.172.08/bin/nv-fabricmanager -c /nix/store/nv8agj4yys737f704scl6macw6vz5a5b-fabricmanager.conf (code=exited, status=127)
         IP: 0B in, 0B out
         IO: 0B read, 0B written
   Mem peak: 1.7M
        CPU: 6ms

Dec 20 16:45:35 compute01 nv-fabricmanager[12849]: /nix/store/9h003hg7a5gg7vi76ad8pf6fpxxcfhxj-fabricmanager-570.172.08/bin/nv-fabricmanager: /nix/store/9h003hg7a5gg7vi76ad8pf6fpxxcfhxj-fabricmanager-570.172.08/bin/nv-fabricmanager: no version information available (required by /nix/store/9h003hg7a5gg7vi76ad8pf6fpxxcfhxj-fabricmanager-570.172.08/bin/nv-fabricmanager)
Dec 20 16:45:35 compute01 nv-fabricmanager[12849]: /nix/store/9h003hg7a5gg7vi76ad8pf6fpxxcfhxj-fabricmanager-570.172.08/bin/nv-fabricmanager: /nix/store/9h003hg7a5gg7vi76ad8pf6fpxxcfhxj-fabricmanager-570.172.08/bin/nv-fabricmanager: no version information available (required by /nix/store/9h003hg7a5gg7vi76ad8pf6fpxxcfhxj-fabricmanager-570.172.08/bin/nv-fabricmanager)
...
...
Dec 20 16:45:35 compute01 nv-fabricmanager[12849]: /nix/store/9h003hg7a5gg7vi76ad8pf6fpxxcfhxj-fabricmanager-570.172.08/bin/nv-fabricmanager: symbol lookup error: /nix/store/9h003hg7a5gg7vi76ad8pf6fpxxcfhxj-fabricmanager-570.172.08/bin/nv-fabricmanager: undefined symbol:
Dec 20 16:45:35 compute01 systemd[1]: nvidia-fabricmanager.service: Control process exited, code=exited, status=127/n/a
Dec 20 16:45:35 compute01 systemd[1]: nvidia-fabricmanager.service: Failed with result 'exit-code'.
Dec 20 16:45:35 compute01 systemd[1]: Failed to start Start NVIDIA NVLink Management.
Command 'systemd-run -E LOCALE_ARCHIVE -E NIXOS_INSTALL_BOOTLOADER --collect --no-ask-password --pipe --quiet --service-type=exec --unit=nixos-rebuild-switch-to-configuration /nix/store/i4mj5wsxnad6ryj98b2qrx8d78rp248v-nixos-system-compute01-25.11.1948.c6f52ebd45e5/bin/switch-to-configuration switch' returned non-zero exit status 4.

I have tried setting hardware.nvidia.package = config.boot.kernelPackages.nvidiaPackages.dc, but the error is the same. If I use hardware.nvidia.package = config.boot.kernelPackages.nvidiaPackages.production, then I guess nvidia.nix is incompatible with those because I get errors like:

error: lib.meta.getExe’: The first argument is of type set, but it should be a derivation instead.

What is strange is that my configuration.nix is incredibly simple (the rest of the file is networking stuff and so on) - I never change the kernel or anything - yet, I find noone complaining about such undefined symbols.

By the way, if I do ldd /nix/store/9h003hg7a5gg7vi76ad8pf6fpxxcfhxj-fabricmanager-570.172.08/bin/nv-fabricmanager, it’s the same messages as before with “no version information available”.

rpcruz · December 21, 2025, 10:49am

I managed to work around this. I am posting this in case it helps others.

The way I fixed the problem was by making my own nvidia-fabricmanager (I think there is something wrong with fabricmanager.nix because the binary complains about undefined symbols, but I am yet a novice on nix).

I created a nv-fabricmanager.nix:

{ stdenv, fetchurl, autoPatchelfHook, glibc, libpciaccess, numactl, nvidiaPackage }:

stdenv.mkDerivation rec {
  pname = "nvidia-fabricmanager";
  version = "570.172.08";
  src = fetchurl {
    url = "https://developer.download.nvidia.com/compute/nvidia-driver/redist/fabricmanager/linux-x86_64/fabricmanager-linux-x86_64-${version}-archive.tar.xz";
    sha256 = "07j9jvsci0b3lz14bkca5q5cmma9ld36yhmdk4c4fkamwk6wl94d";
  };

  nativeBuildInputs = [ autoPatchelfHook ];
  buildInputs = [ glibc libpciaccess numactl nvidiaPackage ];
  installPhase = ''
    mkdir -p $out
    cp -r * $out
    sed -i "s|/usr/share/nvidia/nvswitch|$out/share/nvidia/nvswitch|g" "$out/etc/fabricmanager.cfg"
  '';
}

And then in my configuration.nix:

# nvidia drivers: for A100, we also need nvidia-fabricmanager, otherwise
# we get an error return torch._C._cuda_getDeviceCount() > 0
# nvidia-fabricmanager is installed by hardware.nvidia.datacenter.enable=true

# unfortunately, the official nvidia-fabricmanager does not seem to work, so
# we use our own nvidia-fabricmanager package.

systemd.services.nvidia-fabricmanager.enable = lib.mkForce false;

systemd.services.my-nvidia-fabricmanager = let
  fabricmanager = pkgs.callPackage ./nv-fabricmanager.nix { nvidiaPackage = config.hardware.nvidia.package;  };
in {
  description = "NVIDIA NVLink Fabric Manager";
  wantedBy = [ "multi-user.target" ];
  after = [ "multi-user.target" ];
  serviceConfig = {
    ExecStart = "${fabricmanager}/bin/nv-fabricmanager -c ${fabricmanager}/etc/fabricmanager.cfg";
    Restart = "always";
    Type = "forking";
  };
};

boot.kernelModules = [ "nvidia" "nvidia_uvm" "nvidia_drm" ];  # unsure if necessary
nixpkgs.config.nvidia.acceptLicense = true;
hardware.nvidia = {
  open = false;
  package = pkgs.linuxPackages.nvidiaPackages.dc;
  datacenter.enable = true;
  nvidiaPersistenced = true;
  nvidiaSettings = false;
};
hardware.graphics.enable = true;

rpcruz · December 23, 2025, 5:03pm

I have not yet been able to test this suspicion, but I think one issue with the official fabricmanager.nix is that the lib/*.so files are not being patchElf’d:

for d in include lib;do
mv $d $out/.
done

I think patchElf which is applied before to bin/* should also be applied here to lib/*…

Esch · January 9, 2026, 6:16pm

Appreciate this, hopefully a fix gets merged soon.

ItsAzM8 · March 1, 2026, 7:20am

I’ve got a PR open that fixes this. The explicitly defined phase list was dropped from the fabricmanager module. Adding them back made some extra phases run that broke the build. Things still built successfully since there was no check phase defined though. I added that in too and threw in a few minor changes. Should be fixed in 25.11 after this gets merged to master and gets back-ported!

github.com/NixOS/nixpkgs

Fixed nv-fabricmanager Build Phases

master ← ItsAzM8:fix-nv-fabricmanager-build-phases

opened 06:25AM - 07 Feb 26 UTC

ItsAzM8

+21 -3

Ran into this issue when trying to move over to NixOS 25.11 from 25.05 while usi…ng [hardware.nvidia.datacenter.enable](https://search.nixos.org/options?channel=unstable&query=datacenter). Seems that something going on in one of the build phases is causing the following to happen when the `nv-fabricmanager` binary is called: [nv-fabricmanager.log](https://github.com/user-attachments/files/25142867/nv-fabricmanager.log). The exact same output is logged when running checking the nv-fabricmanager binary with ldd. There is a part of the installPhase that checks the binaries with ldd, but it just checks `grep -vqz "not found"`, which doesn't appear in the above. This means the derivation still gets evaluated successfully, but the built binary is in a broken state. ## Things done * Revert build phases back to how they were in NixOS 25.05. * Enabled the `installCheckPhase`. This solves the issue without change as it runs the binary and checks to see if the version gets logged. * Enabled the checkPhase and pulled the ldd check from the installPhase into the checkPhase. Makes more sense to be in the checkPhase imo. - Built on platform: - [x] x86_64-linux - [ ] aarch64-linux - [ ] x86_64-darwin - [ ] aarch64-darwin - Tested, as applicable: - [ ] [NixOS tests] in [nixos/tests]. - [ ] [Package tests] at `passthru.tests`. - [ ] Tests in [lib/tests] or [pkgs/test] for functions and "core" functionality. - [ ] Ran `nixpkgs-review` on this PR. See [nixpkgs-review usage]. - [x] Tested basic functionality of all binary files, usually in `./result/bin/`. - Nixpkgs Release Notes - [ ] Package update: when the change is major or breaking. - NixOS Release Notes - [ ] Module addition: when adding a new NixOS module. - [ ] Module update: when the change is significant. - [x] Fits [CONTRIBUTING.md], [pkgs/README.md], [maintainers/README.md] and other READMEs. [NixOS tests]: https://nixos.org/manual/nixos/unstable/index.html#sec-nixos-tests [Package tests]: https://github.com/NixOS/nixpkgs/blob/master/pkgs/README.md#package-tests [nixpkgs-review usage]: https://github.com/Mic92/nixpkgs-review#usage [CONTRIBUTING.md]: https://github.com/NixOS/nixpkgs/blob/master/CONTRIBUTING.md [lib/tests]: https://github.com/NixOS/nixpkgs/blob/master/lib/tests [maintainers/README.md]: https://github.com/NixOS/nixpkgs/blob/master/maintainers/README.md [nixos/tests]: https://github.com/NixOS/nixpkgs/blob/master/nixos/tests [pkgs/README.md]: https://github.com/NixOS/nixpkgs/blob/master/pkgs/README.md [pkgs/test]: https://github.com/NixOS/nixpkgs/blob/master/pkgs/test