Nvidia docker container runtime doesn't detect my gpu

SpidFightFR · August 29, 2024, 8:51pm

nvidia.nix config (for nvidia drivers):

{ config, lib, pkgs, inputs, ... }:

{
        #"NixOS" Stable - 24.05 still calls the graphical togable option as "OpenGL"
        hardware.opengl = {
                enable = true;
        };

        #For nixos-unstable, they renamed it
        #hardware.graphics.enable = true;

        services.xserver.enable = true;
        services.xserver.videoDrivers = ["nvidia"];

        hardware.nvidia ={
                modesetting.enable = true;
                powerManagement.enable = false;
                powerManagement.finegrained = false;

                open = true;

                nvidiaSettings = false;

#               package = pkgs.linuxPackages_6_10.nvidiaPackages.beta;
#               package = config.boot.kernelPackages.nvidiaPackages.latest;

        };
}

docker.nix (to enable docker and the nvidia runtime):

{ config, lib, pkgs, ... }:
{
  virtualisation.docker = {
      enable = true;
      enableOnBoot = true;
      # Nvidia Docker (deprecated)
      #enableNvidia = true;
  };

  hardware.nvidia-container-toolkit.enable = true;
  # libnvidia-container does not support cgroups v2 (prior to 1.8.0)
  # https://github.com/NVIDIA/nvidia-docker/issues/1447
  #systemd.enableUnifiedCgroupHierarchy = false;
}

When i run:

$ docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi
docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].

Thanks for the help! ngl i’ve been stuck with this for a while…

ereslibre · August 30, 2024, 7:25am

Hello @SpidFightFR!

With hardware.nvidia-container-toolkit.enable = true;, the Container Device Interface (CDI) is used instead of the nvidia runtime wrappers.

With CDI you have to specify the devices with the --device argument instead of the --gpus one, like this:

$ docker run --rm --device nvidia.com/gpu=all ubuntu:latest nvidia-smi

Also, you will need to set Docker 25 (at least) in NixOS 24.05:

virtualisation.docker.package = pkgs.docker_25;

We are going to update the documentation to make this more clear and less error-prone. Please, let us know what would have helped in your case to identify the difference, it will help other users for sure.

SpidFightFR · August 30, 2024, 5:33pm

Hey @ereslibre , hope you’re doing well.

Thank you so much for your quick answer !
Indeed it now works, thank you !

$ docker run --rm --device nvidia.com/gpu=all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi
Fri Aug 30 17:30:47 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.78                 Driver Version: 550.78         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3060        Off |   00000000:00:10.0 Off |                  N/A |
|  0%   38C    P8              8W /  170W |       2MiB /  12288MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Final docker config:
docker.nix:

{ config, lib, pkgs, ... }:
{
  virtualisation.docker = {
      enable = true;
      enableOnBoot = true;
      package = pkgs.docker_25;
      # Nvidia Docker (deprecated)
      #enableNvidia = true;
  };

  hardware.nvidia-container-toolkit.enable = true;
  # libnvidia-container does not support cgroups v2 (prior to 1.8.0)
  # https://github.com/NVIDIA/nvidia-docker/issues/1447
  #systemd.enableUnifiedCgroupHierarchy = false;
}

Final nvidia.nix:

{ config, lib, pkgs, inputs, ... }:

{
        #NixOS Stable - 24.05 still calls the graphical togable option as "OpenGL"
        hardware.opengl = {
                enable = true;
                driSupport = true;
                driSupport32Bit = true;
        };

        #For nixos-unstable, they renamed it
        #hardware.graphics.enable = true;

        services.xserver.enable = true;
        services.xserver.videoDrivers = ["nvidia"];

        hardware.nvidia ={
                modesetting.enable = true;
                powerManagement.enable = false;
                powerManagement.finegrained = false;

                open = true;

                nvidiaSettings = false;

#               package = pkgs.linuxPackages_6_10.nvidiaPackages.beta;
#               package = config.boot.kernelPackages.nvidiaPackages.latest;

        };
}

By the way, speaking of nvidia, i seek to have the most minimalist server possible…
Do you think it’s possible to launch the nvidia drivers without a xorg session ?

so far i use services.xserver.videoDrivers = ["nvidia"]; but if i could ditch the xorg sessions while still loading the module, it would be lovely !

best regards, Spid

tacostrange · October 10, 2024, 9:48pm

I’m having the same issue and your answer does help but I’m using docker compose and I’m not sure how to declare the device in yaml

ereslibre · October 10, 2024, 10:29pm

I have open docs: update the CUDA section with how to use the `nvidia-container-toolkit` by ereslibre · Pull Request #344188 · NixOS/nixpkgs · GitHub that documents this and other bits. Have a look at it and let me know if you have further issues!

eoinlane · October 17, 2024, 10:07am

@ereslibre I have followed all the instructions above on my Nix system, including your updated documentation above.
When i run systemctl status nvidia-container-toolkit-cdi-generator.service i can see that the service is started and running and able to see the CDI spec at /var/run/cdi/nvidia-container-toolkit.json. However, the service is giving me a number of warming see below
However, when i run docker run --rm -it --device=nvidia.com/gpu=all ubuntu:latest nvidia-smi -L I get the following error: docker: Error response from daemon: could not select device driver “cdi” with capabilities: .

Output of systemctl status nvidia-container-toolkit-cdi-generator.service:
nvidia-container-toolkit-cdi-generator.service - Container Device Interface (CDI) for Nvidia generator
Loaded: loaded (/etc/systemd/system/nvidia-container-toolkit-cdi-generator.service; enabled; preset: enabled)
Active: active (exited) since Wed 2024-10-16 19:48:45 IST; 15h ago
Invocation: a3d181fff3c54e15b9e3ace2e97f0231
Process: 1348 ExecStart=/nix/store/sqnqlkdsgh9q713b34fyd7j5cwjpaqf9-nvidia-cdi-generator/bin/nvidia-cdi-generator (code=exited, status=0/SUCCESS)
Main PID: 1348 (code=exited, status=0/SUCCESS)
IP: 0B in, 0B out
Mem peak: 38.9M
CPU: 70ms

Oct 16 19:48:45 nixos-olan nvidia-cdi-generator[1357]: time=“2024-10-16T19:48:45+01:00” level=info msg="Selecting /nix/store/0ylgjrffgsw8fa8y2xrzpdq1zymq3>
Oct 16 19:48:45 nixos-olan nvidia-cdi-generator[1357]: time=“2024-10-16T19:48:45+01:00” level=warning msg="Could not locate nvidia-smi: pattern nvidia-smi>
Oct 16 19:48:45 nixos-olan nvidia-cdi-generator[1357]: time=“2024-10-16T19:48:45+01:00” level=warning msg="Could not locate nvidia-debugdump: pattern nvid>
Oct 16 19:48:45 nixos-olan nvidia-cdi-generator[1357]: time=“2024-10-16T19:48:45+01:00” level=warning msg="Could not locate nvidia-persistenced: pattern n>
Oct 16 19:48:45 nixos-olan nvidia-cdi-generator[1357]: time=“2024-10-16T19:48:45+01:00” level=warning msg="Could not locate nvidia-cuda-mps-control: patte>
Oct 16 19:48:45 nixos-olan nvidia-cdi-generator[1357]: time=“2024-10-16T19:48:45+01:00” level=warning msg="Could not locate nvidia-cuda-mps-server: patter>
Oct 16 19:48:45 nixos-olan nvidia-cdi-generator[1357]: time=“2024-10-16T19:48:45+01:00” level=warning msg="Could not locate nvidia/xorg/nvidia_drv.so: pat>
Oct 16 19:48:45 nixos-olan nvidia-cdi-generator[1357]: time=“2024-10-16T19:48:45+01:00” level=warning msg="Could not locate nvidia/xorg/libglxserver_nvidi>
Oct 16 19:48:45 nixos-olan nvidia-cdi-generator[1357]: time=“2024-10-16T19:48:45+01:00” level=info msg=“Generated CDI spec with version 0.5.0”
Oct 16 19:48:45 nixos-olan systemd[1]: Finished Container Device Interface (CDI) for Nvidia generator.

ereslibre · October 17, 2024, 12:45pm

Hello @eoinlane!

Those warnings are not something critical, and I can see them as well on systems that are working fine.

Are you using rootless Docker? What version of Docker do you have? What version of nixpkgs do you have? Does it work with Podman? (podman run --rm -it --device=nvidia.com/gpu=all ubuntu:latest nvidia-smi -L)

eoinlane · October 17, 2024, 10:02pm

Hey @ereslibre

Thanks for your response. Per the instructions above i am using version 25 of docker and rootless
virtualisation.docker.rootless = {
36 + │ enable = true;
37 + │ package = pkgs.docker_25;
38 + │ setSocketVariable = true;
39 + │ };

I’m using nixpkgs.url = “github:nixos/nixpkgs?ref=nixos-unstable”;

and have not tried with podman and i am unfamiliar with this application.

Thanks

ereslibre · October 18, 2024, 6:34am

So your issue has to do with Docker rootless. Currently, Docker rootless does not copy the CDI specs into the namespace where dockerd is executed. I have opened a PR for Docker here: Dockerd rootless: make {/etc,/var/run}/cdi available by ereslibre · Pull Request #48541 · moby/moby · GitHub.

I also created a nixpkgs PR here: dockerd rootless: include patch to read /etc/cdi and /var/run/cdi by ereslibre · Pull Request #344005 · NixOS/nixpkgs · GitHub, but now is outdated, given I did change the moby PR and the hash does not match anymore. In any case, you can probably place a nixpkgs overlay to include that patch, or:

Use rootful docker
Use podman (not rootful) (nix run nixpkgs#podman -- run --rm -it --device=nvidia.com/gpu=all ubuntu:latest nvidia-smi -L)

However, eventually, when Moby merges my PR, there’s a new Moby release --or we include the patch ourselves–, and we package the new version–or the current version with the patch once it gets accepted in moby/moby–, your current configuration should work out of the box.

ereslibre · October 18, 2024, 7:57am

I’m trying to reproduce the same exact error you are getting, but I cannot. With your setup, the error I get is (without the patch I submitted to moby, which fixes it):

docker: Error response from daemon: CDI device injection failed: unresolvable CDI devices nvidia.com/gpu=all.

However, yours read:

docker: Error response from daemon: could not select device driver “cdi” with capabilities: [].

I do wonder, do you have virtualisation.docker.enableNvidia? Please, if that’s the case disable it. WRT docker rootless I wrote on Nvidia docker container runtime doesn't detect my gpu - #10 by ereslibre, everything still applies.

eoinlane · October 18, 2024, 8:42pm

@ereslibre thanks again for all your help. I went with the rootless option as i need to use docker compose and it is detecting my GPU

id=GPU-d14f1399-d337-2a82-dba9-c86f82d21a25 library=cuda variant=v12 compute=8.9 driver=12.6 name=“NVIDIA GeForce RTX 4060” total=“7.7 GiB” available=“6.2 GiB”

For my own education i need to understand more about what Moby is but for now have access to my GPU for building a local LLM.

ereslibre · October 18, 2024, 8:59pm

I’m glad it worked! Please, let me know what you changed just in case we can improve documentation somehow.

You can read more about Moby here

@eoinlane Just a note: you can use docker compose with Docker rootfull as well, you only need to add the user that will need access to the docker socket to the docker group.

Traktorbek · November 26, 2024, 2:32pm

Hi everyone! I followed the instructions from @SpidFightFR (I basically replicated the example/nix-files from late August 2024) and @ereslibre and managed sucessfully to get my GPU detected (see below). However, when I try to run my own docker container the system cannot “select device driver “nvidia” with capabilities: [[gpu]]”. I have tried several different scenarios but I still do not understand why Docker Compose fails to access the GPU when the test container works? Is there anything wrong with my Docker Compose configuration for GPU access? Are there any additional NixOS configurations I need to enable for Docker Compose to access the GPU?

Any help would be appreciated!

System Information

NixOS Version: 24.05.20241116.e8c38b7 (Uakari)
Docker Version: 25.0.6
NVIDIA Driver Version: 550.78
CUDA Version: 12.2.140
GPU: NVIDIA GeForce GTX 1050

Issue Description

I can see my GPU when running nvidia-smi through a test container, but my actual Docker container fails to access the GPU.

Working Test

The following command works successfully:

docker run --rm -it --device=nvidia.com/gpu=all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi

Output shows the GPU is correctly detected:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.78                 Driver Version: 550.78         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce GTX 1050 ...    Off |   00000000:01:00.0 Off |                  N/A |
| N/A   34C    P8             N/A / ERR!  |       2MiB /   4096MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+

Current Issue

When running my container with docker compose up, I get this error:

Error response from daemon: could not select device driver "nvidia" with capabilities: [[gpu]]

Configuration

I tried the commented ones and the uncommented ones. With and without explicitly setting the environment variables (below).

Docker Compose Configuration

pipeline:
  image: '${DOCKER_IMAGE_PIPELINE?Variable not set}:${TAG-latest}'
  build:
    target: development
    context: ./pipeline
    args:
      - PIPELINE_BASE_IMAGE=${PIPELINE_BASE_IMAGE}
  environment:
    - CONFIG_PATH=${PIPELINE_CONFIG_PATH}
  runtime: ${DOCKER_RUNTIME:-runc}
  restart: always
  deploy:
     resources:
       reservations:
         devices:
          - driver: ${GPU_DRIVER:-none}
          #- driver: nvidia
            count: ${GPU_COUNT:-0}
          #  count: all
            capabilities: ['${GPU_CAPABILITIES:-none}']
          #  capabilities: [gpu]

Environment Variables

GPU_DRIVER=nvidia
GPU_COUNT=all
PIPELINE_BASE_IMAGE=ubuntu:22.04

NixOS Configuration (mostly from @SpidFightFR in this thread)

{
  nixpkgs.config.allowUnfree = true;
  hardware.nvidia-container-toolkit.enable = true;

  environment.systemPackages = with pkgs; [
    cudaPackages.cudatoolkit
  ];

  hardware.opengl = {
    enable = true;
    driSupport = true;
    driSupport32Bit = true;
  };

  services.xserver.videoDrivers = [ "nvidia" ];

  hardware.nvidia = {
    modesetting.enable = true;
    powerManagement.enable = false;
    open = false;
    nvidiaSettings = true;
    package = config.boot.kernelPackages.nvidiaPackages.stable;
    prime = {
      offload = {
        enable = true;
        enableOffloadCmd = true;
      };
      intelBusId = "PCI:0:2:0";
      nvidiaBusId = "PCI:1:0:0";
    };
  };

  virtualisation.docker = {
    enable = true;
    enableOnBoot = true;
    package = pkgs.docker;
  };
}

ereslibre · November 26, 2024, 3:26pm

Hello @Traktorbek!

You need to adapt your docker-compose file so that it uses the CDI driver, like documented in Nixpkgs Reference Manual. In your case, it should be along the lines:

pipeline:
  image: '${DOCKER_IMAGE_PIPELINE?Variable not set}:${TAG-latest}'
  build:
    target: development
    context: ./pipeline
    args:
      - PIPELINE_BASE_IMAGE=${PIPELINE_BASE_IMAGE}
  environment:
    - CONFIG_PATH=${PIPELINE_CONFIG_PATH}
  runtime: ${DOCKER_RUNTIME:-runc}
  restart: always
  deploy:
    resources:
      reservations:
        devices:
        - driver: cdi
          device_ids:
          - nvidia.com/gpu=all

Note that this will expose all your GPU’s to the container, if they are being identified by their ID, and you wanted to expose only ID’s 0 and 1, you could do:

pipeline:
  image: '${DOCKER_IMAGE_PIPELINE?Variable not set}:${TAG-latest}'
  build:
    target: development
    context: ./pipeline
    args:
      - PIPELINE_BASE_IMAGE=${PIPELINE_BASE_IMAGE}
  environment:
    - CONFIG_PATH=${PIPELINE_CONFIG_PATH}
  runtime: ${DOCKER_RUNTIME:-runc}
  restart: always
  deploy:
    resources:
      reservations:
        devices:
        - driver: cdi
          device_ids:
          - nvidia.com/gpu=0
          - nvidia.com/gpu=1

Traktorbek · November 26, 2024, 3:54pm

You are my hero! Thank you!

locochoco · December 14, 2024, 8:14pm

Hi! I am also trying to run a docker compose file rootless with my GPU, but something with the CDI device injection keeps failing. It says:

docker: Error response from daemon: CDI device injection failed: unresolvable CDI devices nvidia.com/gpu=all.

Even on a simple command like docker run --rm -it --device=nvidia.com/gpu=all ubuntu:latest nvidia-smi.
I tried to search this issue but couldn’t find much.
This is the docker-compose.yaml I am trying to run:

services:
  panoptic_slam:
    image: "panoptic_slam:latest"
    container_name: panoptic_slam_sys
    environment:
      DISPLAY: $DISPLAY
      PATH: $PATH
      NVIDIA_DRIVER_CAPABILITIES: all
      NVIDIA_VISIBLE_DEVICES: void
    volumes:
      - /tmp/.X11-unix:/tmp/.X11-unix
      - ~/.Xauthority:/root/.Xauthority
      - /dev/bus/usb:/dev/bus/usb
      - ../Dataset:/home/panoptic_slam/Dataset
      - ../Output:/home/panoptic_slam/Output
    device_cgroup_rules:
      - 'c 189:* rmw'
    network_mode: "host"
    privileged: true
    tty: true
    deploy:
      resources:
        reservations:
          devices:
            - driver: cdi
              capabilities: [gpu]
              device_ids:
                - nvidia.com/gpu=all

I added the capabilities: [gpu] in driver: cdi otherwise the command would fail in the verification part with the error

validating /path/docker-compose.yaml: services.panoptic_slam.deploy.resources.reservations.devices.0 capabilities is required

And here is my config with the stuff related to docker:

  hardware.nvidia-container-toolkit.enable = true;
  virtualisation.docker.enable = true;
  virtualisation.docker.rootless = {
    enable = true;
    setSocketVariable = true;
  };
  users.users.locochoco.extraGroups = [ "docker" ];

And the versions of everything:

Docker version 27.3.1, build v27.3.1
NixOS 25.05.20241126.af51545 (Warbler) x86_64
NVIDIA-SMI 565.57.01
CUDA Version: 12.7
NVIDIA GeForce GTX 1660

Any ideas to why it can’t solve for the CDI of the GPU?

ereslibre · January 7, 2025, 6:58am

Hello @locochoco

Just realized about your message, sorry, I missed it.

This was an upstream bug fixed by Dockerd rootless: make {/etc,/var/run}/cdi available by ereslibre · Pull Request #48541 · moby/moby · GitHub and backported to multiple Docker versions (25.0.x, 26.1.y, 27.z).

r-hashi01 · March 6, 2025, 10:09am

Hi, @ereslibre
I’m having the same issue.

command:

docker run --rm --device nvidia.com/gpu=all ubuntu:latest nvidia-smi

Error:

docker: Error response from daemon: could not select device driver "cdi" with capabilities: [].

Environment

OS: NixOS 24.05 (Uakari)
Kernel Version: 6.6.36 (example)
Docker Version: 27.1.1 (rootful mode)
GPU: NVIDIA GeForce RTX 4090
NVIDIA Driver:
- nvidia-smi output shows Driver Version: 550.78, CUDA Version: 12.4

Current Settings

NVIDIA Driver Configuration (`nvidia.nix`)

{
  hardware.opengl.enable = true;
  services.xserver.enable = true;
  services.xserver.videoDrivers = ["nvidia"];
  
  hardware.nvidia = {
    modesetting.enable = true;
    powerManagement.enable = false;
    powerManagement.finegrained = false;
    open = true;
    nvidiaSettings = false;
  };
  virtualisation.docker.enable = true;
  virtualisation.docker.package = pkgs.docker;
  hardware.nvidia-container-toolkit.enable = true;
}

ereslibre · March 7, 2025, 7:28am

@r-hashi01

Can you confirm that you have a /var/run/cdi/nvidia-container-toolkit.json file in your system and that it contains devices? cat /var/run/cdi/nvidia-container-toolkit.json | jq '.devices[].name'?

If it doesn’t, did you update the nvidia driver without restarting the machine? This might be necessary depending on the kind of update of the driver that happened.

r-hashi01 · March 10, 2025, 5:15am

@ereslibre

The file exists on my system and when I run the command:

$ cat /var/run/cdi/nvidia-container-toolkit.json | jq '.devices[].name'
"0"
"all"

I can confirm that it shows the expected device entries.