nvidia.nix config (for nvidia drivers):
{ config, lib, pkgs, inputs, ... }:
{
#"NixOS" Stable - 24.05 still calls the graphical togable option as "OpenGL"
hardware.opengl = {
enable = true;
};
#For nixos-unstable, they renamed it
#hardware.graphics.enable = true;
services.xserver.enable = true;
services.xserver.videoDrivers = ["nvidia"];
hardware.nvidia ={
modesetting.enable = true;
powerManagement.enable = false;
powerManagement.finegrained = false;
open = true;
nvidiaSettings = false;
# package = pkgs.linuxPackages_6_10.nvidiaPackages.beta;
# package = config.boot.kernelPackages.nvidiaPackages.latest;
};
}
docker.nix (to enable docker and the nvidia runtime):
{ config, lib, pkgs, ... }:
{
virtualisation.docker = {
enable = true;
enableOnBoot = true;
# Nvidia Docker (deprecated)
#enableNvidia = true;
};
hardware.nvidia-container-toolkit.enable = true;
# libnvidia-container does not support cgroups v2 (prior to 1.8.0)
# https://github.com/NVIDIA/nvidia-docker/issues/1447
#systemd.enableUnifiedCgroupHierarchy = false;
}
When i run:
$ docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi
docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].
Thanks for the help! ngl i’ve been stuck with this for a while…
1 Like
Hello @SpidFightFR!
With hardware.nvidia-container-toolkit.enable = true;
, the Container Device Interface (CDI) is used instead of the nvidia runtime wrappers.
With CDI you have to specify the devices with the --device
argument instead of the --gpus
one, like this:
$ docker run --rm --device nvidia.com/gpu=all ubuntu:latest nvidia-smi
Also, you will need to set Docker 25 (at least) in NixOS 24.05:
virtualisation.docker.package = pkgs.docker_25;
We are going to update the documentation to make this more clear and less error-prone. Please, let us know what would have helped in your case to identify the difference, it will help other users for sure.
4 Likes
Hey @ereslibre , hope you’re doing well.
Thank you so much for your quick answer !
Indeed it now works, thank you !
$ docker run --rm --device nvidia.com/gpu=all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi
Fri Aug 30 17:30:47 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.78 Driver Version: 550.78 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3060 Off | 00000000:00:10.0 Off | N/A |
| 0% 38C P8 8W / 170W | 2MiB / 12288MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
Final docker config:
docker.nix:
{ config, lib, pkgs, ... }:
{
virtualisation.docker = {
enable = true;
enableOnBoot = true;
package = pkgs.docker_25;
# Nvidia Docker (deprecated)
#enableNvidia = true;
};
hardware.nvidia-container-toolkit.enable = true;
# libnvidia-container does not support cgroups v2 (prior to 1.8.0)
# https://github.com/NVIDIA/nvidia-docker/issues/1447
#systemd.enableUnifiedCgroupHierarchy = false;
}
Final nvidia.nix:
{ config, lib, pkgs, inputs, ... }:
{
#NixOS Stable - 24.05 still calls the graphical togable option as "OpenGL"
hardware.opengl = {
enable = true;
driSupport = true;
driSupport32Bit = true;
};
#For nixos-unstable, they renamed it
#hardware.graphics.enable = true;
services.xserver.enable = true;
services.xserver.videoDrivers = ["nvidia"];
hardware.nvidia ={
modesetting.enable = true;
powerManagement.enable = false;
powerManagement.finegrained = false;
open = true;
nvidiaSettings = false;
# package = pkgs.linuxPackages_6_10.nvidiaPackages.beta;
# package = config.boot.kernelPackages.nvidiaPackages.latest;
};
}
By the way, speaking of nvidia, i seek to have the most minimalist server possible…
Do you think it’s possible to launch the nvidia drivers without a xorg session ?
so far i use services.xserver.videoDrivers = ["nvidia"];
but if i could ditch the xorg sessions while still loading the module, it would be lovely !
best regards, Spid
1 Like
I’m having the same issue and your answer does help but I’m using docker compose and I’m not sure how to declare the device in yaml
@ereslibre I have followed all the instructions above on my Nix system, including your updated documentation above.
When i run systemctl status nvidia-container-toolkit-cdi-generator.service i can see that the service is started and running and able to see the CDI spec at /var/run/cdi/nvidia-container-toolkit.json. However, the service is giving me a number of warming see below
However, when i run docker run --rm -it --device=nvidia.com/gpu=all ubuntu:latest nvidia-smi -L I get the following error: docker: Error response from daemon: could not select device driver “cdi” with capabilities: .
Output of systemctl status nvidia-container-toolkit-cdi-generator.service:
nvidia-container-toolkit-cdi-generator.service - Container Device Interface (CDI) for Nvidia generator
Loaded: loaded (/etc/systemd/system/nvidia-container-toolkit-cdi-generator.service; enabled; preset: enabled)
Active: active (exited) since Wed 2024-10-16 19:48:45 IST; 15h ago
Invocation: a3d181fff3c54e15b9e3ace2e97f0231
Process: 1348 ExecStart=/nix/store/sqnqlkdsgh9q713b34fyd7j5cwjpaqf9-nvidia-cdi-generator/bin/nvidia-cdi-generator (code=exited, status=0/SUCCESS)
Main PID: 1348 (code=exited, status=0/SUCCESS)
IP: 0B in, 0B out
Mem peak: 38.9M
CPU: 70ms
Oct 16 19:48:45 nixos-olan nvidia-cdi-generator[1357]: time=“2024-10-16T19:48:45+01:00” level=info msg="Selecting /nix/store/0ylgjrffgsw8fa8y2xrzpdq1zymq3>
Oct 16 19:48:45 nixos-olan nvidia-cdi-generator[1357]: time=“2024-10-16T19:48:45+01:00” level=warning msg="Could not locate nvidia-smi: pattern nvidia-smi>
Oct 16 19:48:45 nixos-olan nvidia-cdi-generator[1357]: time=“2024-10-16T19:48:45+01:00” level=warning msg="Could not locate nvidia-debugdump: pattern nvid>
Oct 16 19:48:45 nixos-olan nvidia-cdi-generator[1357]: time=“2024-10-16T19:48:45+01:00” level=warning msg="Could not locate nvidia-persistenced: pattern n>
Oct 16 19:48:45 nixos-olan nvidia-cdi-generator[1357]: time=“2024-10-16T19:48:45+01:00” level=warning msg="Could not locate nvidia-cuda-mps-control: patte>
Oct 16 19:48:45 nixos-olan nvidia-cdi-generator[1357]: time=“2024-10-16T19:48:45+01:00” level=warning msg="Could not locate nvidia-cuda-mps-server: patter>
Oct 16 19:48:45 nixos-olan nvidia-cdi-generator[1357]: time=“2024-10-16T19:48:45+01:00” level=warning msg="Could not locate nvidia/xorg/nvidia_drv.so: pat>
Oct 16 19:48:45 nixos-olan nvidia-cdi-generator[1357]: time=“2024-10-16T19:48:45+01:00” level=warning msg="Could not locate nvidia/xorg/libglxserver_nvidi>
Oct 16 19:48:45 nixos-olan nvidia-cdi-generator[1357]: time=“2024-10-16T19:48:45+01:00” level=info msg=“Generated CDI spec with version 0.5.0”
Oct 16 19:48:45 nixos-olan systemd[1]: Finished Container Device Interface (CDI) for Nvidia generator.
Hello @eoinlane!
Those warnings are not something critical, and I can see them as well on systems that are working fine.
Are you using rootless Docker? What version of Docker do you have? What version of nixpkgs do you have? Does it work with Podman? (podman run --rm -it --device=nvidia.com/gpu=all ubuntu:latest nvidia-smi -L
)
Hey @ereslibre
Thanks for your response. Per the instructions above i am using version 25 of docker and rootless
virtualisation.docker.rootless = {
36 + │ enable = true;
37 + │ package = pkgs.docker_25;
38 + │ setSocketVariable = true;
39 + │ };
I’m using nixpkgs.url = “github:nixos/nixpkgs?ref=nixos-unstable”;
and have not tried with podman and i am unfamiliar with this application.
Thanks
So your issue has to do with Docker rootless. Currently, Docker rootless does not copy the CDI specs into the namespace where dockerd is executed. I have opened a PR for Docker here: Dockerd rootless: make {/etc,/var/run}/cdi available by ereslibre · Pull Request #48541 · moby/moby · GitHub.
I also created a nixpkgs PR here: dockerd rootless: include patch to read /etc/cdi and /var/run/cdi by ereslibre · Pull Request #344005 · NixOS/nixpkgs · GitHub, but now is outdated, given I did change the moby PR and the hash does not match anymore. In any case, you can probably place a nixpkgs overlay to include that patch, or:
- Use rootful docker
- Use podman (not rootful) (
nix run nixpkgs#podman -- run --rm -it --device=nvidia.com/gpu=all ubuntu:latest nvidia-smi -L
)
However, eventually, when Moby merges my PR, there’s a new Moby release --or we include the patch ourselves–, and we package the new version–or the current version with the patch once it gets accepted in moby/moby–, your current configuration should work out of the box.
I’m trying to reproduce the same exact error you are getting, but I cannot. With your setup, the error I get is (without the patch I submitted to moby, which fixes it):
docker: Error response from daemon: CDI device injection failed: unresolvable CDI devices nvidia.com/gpu=all.
However, yours read:
docker: Error response from daemon: could not select device driver “cdi” with capabilities: [].
I do wonder, do you have virtualisation.docker.enableNvidia
? Please, if that’s the case disable it. WRT docker rootless I wrote on Nvidia docker container runtime doesn't detect my gpu - #10 by ereslibre, everything still applies.
@ereslibre thanks again for all your help. I went with the rootless option as i need to use docker compose and it is detecting my GPU
id=GPU-d14f1399-d337-2a82-dba9-c86f82d21a25 library=cuda variant=v12 compute=8.9 driver=12.6 name=“NVIDIA GeForce RTX 4060” total=“7.7 GiB” available=“6.2 GiB”
For my own education i need to understand more about what Moby is but for now have access to my GPU for building a local LLM.
I’m glad it worked! Please, let me know what you changed just in case we can improve documentation somehow.
You can read more about Moby here
@eoinlane Just a note: you can use docker compose with Docker rootfull as well, you only need to add the user that will need access to the docker socket to the docker group.
Hi everyone! I followed the instructions from @SpidFightFR (I basically replicated the example/nix-files from late August 2024) and @ereslibre and managed sucessfully to get my GPU detected (see below). However, when I try to run my own docker container the system cannot “select device driver “nvidia” with capabilities: [[gpu]]”. I have tried several different scenarios but I still do not understand why Docker Compose fails to access the GPU when the test container works? Is there anything wrong with my Docker Compose configuration for GPU access? Are there any additional NixOS configurations I need to enable for Docker Compose to access the GPU?
Any help would be appreciated!
System Information
- NixOS Version: 24.05.20241116.e8c38b7 (Uakari)
- Docker Version: 25.0.6
- NVIDIA Driver Version: 550.78
- CUDA Version: 12.2.140
- GPU: NVIDIA GeForce GTX 1050
Issue Description
I can see my GPU when running nvidia-smi
through a test container, but my actual Docker container fails to access the GPU.
Working Test
The following command works successfully:
docker run --rm -it --device=nvidia.com/gpu=all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi
Output shows the GPU is correctly detected:
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.78 Driver Version: 550.78 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce GTX 1050 ... Off | 00000000:01:00.0 Off | N/A |
| N/A 34C P8 N/A / ERR! | 2MiB / 4096MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+
Current Issue
When running my container with docker compose up
, I get this error:
Error response from daemon: could not select device driver "nvidia" with capabilities: [[gpu]]
Configuration
I tried the commented ones and the uncommented ones. With and without explicitly setting the environment variables (below).
Docker Compose Configuration
pipeline:
image: '${DOCKER_IMAGE_PIPELINE?Variable not set}:${TAG-latest}'
build:
target: development
context: ./pipeline
args:
- PIPELINE_BASE_IMAGE=${PIPELINE_BASE_IMAGE}
environment:
- CONFIG_PATH=${PIPELINE_CONFIG_PATH}
runtime: ${DOCKER_RUNTIME:-runc}
restart: always
deploy:
resources:
reservations:
devices:
- driver: ${GPU_DRIVER:-none}
#- driver: nvidia
count: ${GPU_COUNT:-0}
# count: all
capabilities: ['${GPU_CAPABILITIES:-none}']
# capabilities: [gpu]
Environment Variables
GPU_DRIVER=nvidia
GPU_COUNT=all
PIPELINE_BASE_IMAGE=ubuntu:22.04
NixOS Configuration (mostly from @SpidFightFR in this thread)
{
nixpkgs.config.allowUnfree = true;
hardware.nvidia-container-toolkit.enable = true;
environment.systemPackages = with pkgs; [
cudaPackages.cudatoolkit
];
hardware.opengl = {
enable = true;
driSupport = true;
driSupport32Bit = true;
};
services.xserver.videoDrivers = [ "nvidia" ];
hardware.nvidia = {
modesetting.enable = true;
powerManagement.enable = false;
open = false;
nvidiaSettings = true;
package = config.boot.kernelPackages.nvidiaPackages.stable;
prime = {
offload = {
enable = true;
enableOffloadCmd = true;
};
intelBusId = "PCI:0:2:0";
nvidiaBusId = "PCI:1:0:0";
};
};
virtualisation.docker = {
enable = true;
enableOnBoot = true;
package = pkgs.docker;
};
}
Hello @Traktorbek!
You need to adapt your docker-compose file so that it uses the CDI driver, like documented in Nixpkgs Reference Manual. In your case, it should be along the lines:
pipeline:
image: '${DOCKER_IMAGE_PIPELINE?Variable not set}:${TAG-latest}'
build:
target: development
context: ./pipeline
args:
- PIPELINE_BASE_IMAGE=${PIPELINE_BASE_IMAGE}
environment:
- CONFIG_PATH=${PIPELINE_CONFIG_PATH}
runtime: ${DOCKER_RUNTIME:-runc}
restart: always
deploy:
resources:
reservations:
devices:
- driver: cdi
device_ids:
- nvidia.com/gpu=all
Note that this will expose all your GPU’s to the container, if they are being identified by their ID, and you wanted to expose only ID’s 0 and 1, you could do:
pipeline:
image: '${DOCKER_IMAGE_PIPELINE?Variable not set}:${TAG-latest}'
build:
target: development
context: ./pipeline
args:
- PIPELINE_BASE_IMAGE=${PIPELINE_BASE_IMAGE}
environment:
- CONFIG_PATH=${PIPELINE_CONFIG_PATH}
runtime: ${DOCKER_RUNTIME:-runc}
restart: always
deploy:
resources:
reservations:
devices:
- driver: cdi
device_ids:
- nvidia.com/gpu=0
- nvidia.com/gpu=1
2 Likes
You are my hero! Thank you!
1 Like
Hi! I am also trying to run a docker compose file rootless with my GPU, but something with the CDI device injection keeps failing. It says:
docker: Error response from daemon: CDI device injection failed: unresolvable CDI devices nvidia.com/gpu=all.
Even on a simple command like docker run --rm -it --device=nvidia.com/gpu=all ubuntu:latest nvidia-smi
.
I tried to search this issue but couldn’t find much.
This is the docker-compose.yaml I am trying to run:
services:
panoptic_slam:
image: "panoptic_slam:latest"
container_name: panoptic_slam_sys
environment:
DISPLAY: $DISPLAY
PATH: $PATH
NVIDIA_DRIVER_CAPABILITIES: all
NVIDIA_VISIBLE_DEVICES: void
volumes:
- /tmp/.X11-unix:/tmp/.X11-unix
- ~/.Xauthority:/root/.Xauthority
- /dev/bus/usb:/dev/bus/usb
- ../Dataset:/home/panoptic_slam/Dataset
- ../Output:/home/panoptic_slam/Output
device_cgroup_rules:
- 'c 189:* rmw'
network_mode: "host"
privileged: true
tty: true
deploy:
resources:
reservations:
devices:
- driver: cdi
capabilities: [gpu]
device_ids:
- nvidia.com/gpu=all
I added the capabilities: [gpu]
in driver: cdi
otherwise the command would fail in the verification part with the error
validating /path/docker-compose.yaml: services.panoptic_slam.deploy.resources.reservations.devices.0 capabilities is required
And here is my config with the stuff related to docker:
hardware.nvidia-container-toolkit.enable = true;
virtualisation.docker.enable = true;
virtualisation.docker.rootless = {
enable = true;
setSocketVariable = true;
};
users.users.locochoco.extraGroups = [ "docker" ];
And the versions of everything:
- Docker version 27.3.1, build v27.3.1
- NixOS 25.05.20241126.af51545 (Warbler) x86_64
- NVIDIA-SMI 565.57.01
- CUDA Version: 12.7
- NVIDIA GeForce GTX 1660
Any ideas to why it can’t solve for the CDI of the GPU?
Hello @locochoco
Just realized about your message, sorry, I missed it.
This was an upstream bug fixed by Dockerd rootless: make {/etc,/var/run}/cdi available by ereslibre · Pull Request #48541 · moby/moby · GitHub and backported to multiple Docker versions (25.0.x, 26.1.y, 27.z).