`nvidia-ctk` shows GPU but podman doesn't find it for passthrough

I just updated a server system with an NVidia T400 GPU to latest nixos-unstable, rev 08f22084e6085d19bcfb4be30d1ca76ecb96fe54 (though it hasn’t seen an update in a month or so and I unfortunately don’t know the previous rev). The system uses a podman container with nvidia-container-toolkit for GPU passthrough, and this broke after the update.

The GPU shows up on the host machine and is found by nvidia-ctk:

# nvidia-smi -L
GPU 0: NVIDIA T400 4GB (UUID: GPU-cc33bb28-454b-5df7-0f48-13948ad02329)
# nvidia-ctk cdi list
INFO[0000] Found 2 CDI devices                          
nvidia.com/gpu=0
nvidia.com/gpu=all

However, when trying to run a container passing it through, that yields an error (same result with -gpus=all):

# podman run --rm -it --device=nvidia.com/gpu=all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi
Error: setting up CDI devices: unresolvable CDI devices nvidia.com/gpu=all

The nvidia-container-toolkit-cdi-generator.service systemd unit also runs without errors and the generated JSON file looks fine, containing mentions of nvidia.com/gpu=0 and nvidia.com/gpu=all.

Relevant configuration:

services.xserver.videoDrivers = [ "nvidia" ];

hardware.graphics.enable = true;

hardware.nvidia = {
  open = true;
  modesetting.enable = true;
  package = config.boot.kernelPackages.nvidiaPackages.production;
};

hardware.nvidia-container-toolkit.enable = true;
systemd.services.nvidia-container-toolkit-cdi-generator.environment.LD_LIBRARY_PATH = "${lib.getLib config.hardware.nvidia.package}/lib";

(Don’t ask me what that last line is, don’t remember).

Does anyone have any ideas why this isn’t working?

2 Likes

Were you able to figure this out? I have the same problem, unfortunately.

EDIT:

I found a workaround. It seems that Podman doesn’t check /var/run/cdi for CDI spec files, even though nvidia-ctk places them there. Instead, Podman looks in paths like /etc/cdi.

To fix this, I created a symlink to the spec file in /etc/cdi.

environment.etc."cdi/nvidia-container-toolkit.json".source = "/run/cdi/nvidia-container-toolkit.json";

1 Like

Thanks for the workaround! I’ve unfortunately not found a fix for this yet. I applied the workaround and it seems like the server won’t boot for whatever reason (probably unrelated) - I’ll have to check it out in person tomorrow when I’m there.

EDIT: False alarm, just took very long. Everything works now with your workaround!