I’m trying to run a Podman container with GPU support but nothing is working.
The official Nvidia docs recommend using CDI with Podman, but it seems the nvidia-ctk tool does not exist in nixpkgs (?). How would I go about installing it manually?
$ nvidia-smi
Thu Nov 30 18:25:16 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.113.01 Driver Version: 535.113.01 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce GTX 1660 Off | 00000000:01:00.0 Off | N/A |
| 0% 44C P8 5W / 130W | 1MiB / 6144MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
Podman run command:
$ sudo podman run --rm -e NVIDIA_VISIBLE_DEVICES=all nvidia/cuda:12.1.0-base-ubi8 nvidia-smi
Error: crun: executable file `nvidia-smi` not found in $PATH: No such file or directory: OCI runtime attempted to invoke a command that was not found
And the suggested run command fails due to missing CDI spec:
$ sudo podman run --rm -e NVIDIA_VISIBLE_DEVICES=all --device=nvidia.com/gpu=all --security-opt=label=disable nvidia/cuda:12.1.0-base-ubi8 nvidia-smi
Error: setting up CDI devices: unresolvable CDI devices nvidia.com/gpu=all
Thanks! I installed the latest nvidia-podman, but I am still facing the same issue. I guess it’s not solely a CDI issue…
$ sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml
...
$ nvidia-ctk cdi list
INFO[0000] Found 2 CDI devices
nvidia.com/gpu=0
nvidia.com/gpu=all
$ sudo podman run --rm --device nvidia.com/gpu=all --security-opt=label=disable ubuntu nvidia-smi -L
Error: crun: executable file `nvidia-smi` not found in $PATH: No such file or directory: OCI runtime attempted to invoke a command that was not found
So I guess it’s detected just not added to PATH and you can test it by using the absolute path. Note that the proper CDI integration is being worked on in the second PR (the one from ereslibre), while the previous one added the nvidia-ctk tool
Ah, got it! Using the full path goes a bit further, but I am guessing there is still some PATH-related stuff that’s missing inside the container:
sudo podman run --rm --device nvidia.com/gpu=all --security-opt=label=disable ubuntu /run/current-system/sw/bin/nvidia-smi -L
{"msg":"exec container process (missing dynamic library?) `/run/current-system/sw/bin/nvidia-smi`: No such file or directory","level":"error","time":"2024-02-18T18:28:48.057437Z"}
How would I go about trying out the unmerged PR? Can I point my Flake-based config to the PR branch? Thanks.
Can I point my Flake-based config to the PR branch? Thanks.
Sure! You could have used github:ereslibre/nixpkgs/containers-cdi or https://github.com/NixOS/nixpkgs/tarball/pull/284507/merge. Now virtualisation.containers.cdi is available on the master branch!
And the error I gave up on is below. It’s complaining about nvidia-container-toolkit missing for some reason.
error:
… while calling the 'head' builtin
at /nix/store/16a8jg8zn1xd5a1b9jmwmafr007dmzfx-source/lib/attrsets.nix:922:11:
921| || pred here (elemAt values 1) (head values) then
922| head values
| ^
923| else
… while evaluating the attribute 'value'
at /nix/store/16a8jg8zn1xd5a1b9jmwmafr007dmzfx-source/lib/modules.nix:807:9:
806| in warnDeprecation opt //
807| { value = builtins.addErrorContext "while evaluating the option `${showOption loc}':" value;
| ^
808| inherit (res.defsFinal') highestPrio;
(stack trace truncated; use '--show-trace' to show the full trace)
error: attribute 'nvidia-container-toolkit' missing
at /nix/store/1pnvz1fq5q9zkmcry7salalvsirkpkfv-source/nixos/modules/services/hardware/nvidia-container-toolkit-cdi-generator/cdi-generate.nix:30:5:
29| function cdiGenerate {
30| ${pkgs.nvidia-container-toolkit}/bin/nvidia-ctk cdi generate \
| ^
31| --format json \
Oops, the fix was actually quite simple: just add the new nvidia-container-toolkit from unstable to the overlay. This allows the new CDI module to use the package.