Nvidia GPU support in Podman and CDI (nvidia-ctk)

I’m trying to run a Podman container with GPU support but nothing is working.

The official Nvidia docs recommend using CDI with Podman, but it seems the nvidia-ctk tool does not exist in nixpkgs (?). How would I go about installing it manually?

My NixOS config (relevant snippet):

  virtualisation.podman = {
    enableNvidia = true;
  };
  hardware.opengl = {
    enable = true;
    driSupport = true;
    driSupport32Bit = true;
  };
  services.xserver.videoDrivers = [ "nvidia" ];
  hardware.nvidia = {
    modesetting.enable = true;
    powerManagement.enable = false;
    powerManagement.finegrained = false;
    open = true;
  }

Driver is setup properly on the host:

$ nvidia-smi
Thu Nov 30 18:25:16 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.113.01             Driver Version: 535.113.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GTX 1660        Off | 00000000:01:00.0 Off |                  N/A |
|  0%   44C    P8               5W / 130W |      1MiB /  6144MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Podman run command:

$ sudo podman run --rm -e NVIDIA_VISIBLE_DEVICES=all nvidia/cuda:12.1.0-base-ubi8 nvidia-smi
Error: crun: executable file `nvidia-smi` not found in $PATH: No such file or directory: OCI runtime attempted to invoke a command that was not found

And the suggested run command fails due to missing CDI spec:

$ sudo podman run --rm -e NVIDIA_VISIBLE_DEVICES=all --device=nvidia.com/gpu=all --security-opt=label=disable nvidia/cuda:12.1.0-base-ubi8 nvidia-smi
Error: setting up CDI devices: unresolvable CDI devices nvidia.com/gpu=all

Thanks

Did you ever get this figured out?

Nope. Been fiddling with all kinds of different commands to no avail. At this point, I am not even sure whether it’s a NixOS or a Podman issue.

nvidia-container-toolkit: 1.9.0 -> 1.15.0-rc.3 by aaronmondal · Pull Request #278969 · NixOS/nixpkgs · GitHub added nvidia-ctk, please test; NixOS: Add support for CDI by ereslibre · Pull Request #284507 · NixOS/nixpkgs · GitHub implements CDI support and will be merged soon

2 Likes

Thanks! I installed the latest nvidia-podman, but I am still facing the same issue. I guess it’s not solely a CDI issue…

$ sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml
...

$ nvidia-ctk cdi list
INFO[0000] Found 2 CDI devices
nvidia.com/gpu=0
nvidia.com/gpu=all

$ sudo podman run --rm --device nvidia.com/gpu=all --security-opt=label=disable ubuntu nvidia-smi -L
Error: crun: executable file `nvidia-smi` not found in $PATH: No such file or directory: OCI runtime attempted to invoke a command that was not found
 Error: crun: executable file `nvidia-smi` not found in $PATH: No such file or directory: OCI runtime attempted to invoke a command that was not found

I just tried

❯ nvidia-ctk cdi generate --output nvidia.yaml
...
❯ grep nvidia-smi nvidia.yaml 
  - containerPath: /run/current-system/sw/bin/nvidia-smi
    hostPath: /run/current-system/sw/bin/nvidia-smi

So I guess it’s detected just not added to PATH and you can test it by using the absolute path. Note that the proper CDI integration is being worked on in the second PR (the one from ereslibre), while the previous one added the nvidia-ctk tool

1 Like

Ah, got it! Using the full path goes a bit further, but I am guessing there is still some PATH-related stuff that’s missing inside the container:

sudo podman run --rm --device nvidia.com/gpu=all --security-opt=label=disable ubuntu /run/current-system/sw/bin/nvidia-smi -L
{"msg":"exec container process (missing dynamic library?) `/run/current-system/sw/bin/nvidia-smi`: No such file or directory","level":"error","time":"2024-02-18T18:28:48.057437Z"}

How would I go about trying out the unmerged PR? Can I point my Flake-based config to the PR branch? Thanks.

Yes, that’d be much appreciated!

Can I point my Flake-based config to the PR branch? Thanks.

Sure! You could have used github:ereslibre/nixpkgs/containers-cdi or https://github.com/NixOS/nixpkgs/tarball/pull/284507/merge. Now virtualisation.containers.cdi is available on the master branch!

1 Like

I gave it a shot (on master), but overriding the updated module(s) was not a smooth process. So I’ll just wait for CDI support to hit unstable :slight_smile:


This is what I ended up with in case anyone stumbles across this thread:

# Define new overlay in flake.nix
inputs = {
  # ...
  nixpkgs-master.url = "github:NixOS/nixpkgs";
};
outputs = { nixpkgs-master, ... }: {
  overlay-master = final: prev: {
    master = nixpkgs-master.legacyPackages.${prev.system};
  }; 
};

# configuration.nix
{ outputs, nixpkgs-master, ... }:
{
  # Apply overlay
  nixpkgs.overlays = [ outputs.overlay-master ];

  # Override new modules by importing them manually
  imports = [
    "${nixpkgs-master}/nixos/modules/virtualisation/containers.nix"
    "${nixpkgs-master}/nixos/modules/services/hardware/nvidia-container-toolkit-cdi-generator"
  ];
  # Disable any existing modules.
  disabledModules = [ "virtualisation/containers.nix" ];
}

And the error I gave up on is below. It’s complaining about nvidia-container-toolkit missing for some reason.

error:
       … while calling the 'head' builtin

         at /nix/store/16a8jg8zn1xd5a1b9jmwmafr007dmzfx-source/lib/attrsets.nix:922:11:

          921|         || pred here (elemAt values 1) (head values) then
          922|           head values
             |           ^
          923|         else

       … while evaluating the attribute 'value'

         at /nix/store/16a8jg8zn1xd5a1b9jmwmafr007dmzfx-source/lib/modules.nix:807:9:

          806|     in warnDeprecation opt //
          807|       { value = builtins.addErrorContext "while evaluating the option `${showOption loc}':" value;
             |         ^
          808|         inherit (res.defsFinal') highestPrio;

       (stack trace truncated; use '--show-trace' to show the full trace)

       error: attribute 'nvidia-container-toolkit' missing

       at /nix/store/1pnvz1fq5q9zkmcry7salalvsirkpkfv-source/nixos/modules/services/hardware/nvidia-container-toolkit-cdi-generator/cdi-generate.nix:30:5:

           29| function cdiGenerate {
           30|   ${pkgs.nvidia-container-toolkit}/bin/nvidia-ctk cdi generate \
             |     ^
           31|     --format json \
1 Like

Oops, the fix was actually quite simple: just add the new nvidia-container-toolkit from unstable to the overlay. This allows the new CDI module to use the package.

overlay-unstable = final: prev: {
  unstable = nixpkgs-unstable.legacyPackages.${prev.system};
  nvidia-container-toolkit = nixpkgs-unstable.legacyPackages.${prev.system}.nvidia-container-toolkit;
}; 

And with that, my GPU is visible inside the container!

$ sudo podman run --rm --device nvidia.com/gpu=all --security-opt=label=disable ubuntu nvidia-smi -L
GPU 0: NVIDIA GeForce GTX 1660 (UUID: GPU-xxxxx)

Jellyfin sees it too (running as an OCI container), which has been the goal all along :smiley:

Thank you for the help, @SergeK!

https://nixpk.gs/pr-tracker.html?pr=284507