Docker nvidia-container-runtime not detected

peturparkur · December 11, 2023, 5:53pm

I have the following configuration for nix:

{ config, pkgs, ... }:
{
    nixpkgs.config.allowUnfree = true;
    imports =
    [ # Include the results of the hardware scan.
      ./nvidia.nix
    ];

    environment.systemPackages = with pkgs; [
        docker-compose
        # nvidia-docker
    ];
    virtualisation.docker = {
        enable = true;
        enableNvidia = true;
        extraOptions = "--default-runtime=nvidia";
    };
}

{ config, pkgs, ... }:
{
  # Nvidia specific
  nixpkgs.config.allowUnfree = true;
  environment.systemPackages = with pkgs; [
    # cudaPackages_12.cudatoolkit
  ];
  # Some programs need SUID wrappers, can be configured further or are
  # started in user sessions.
  # programs.mtr.enable = true;
  # programs.gnupg.agent = {
  #   enable = true;
  #   enableSSHSupport = true;
  # };

  # List services that you want to enable:

  # Enable the OpenSSH daemon.
  # services.openssh.enable = true;

  # Open ports in the firewall.
  # networking.firewall.allowedTCPPorts = [ ... ];
  # networking.firewall.allowedUDPPorts = [ ... ];
  # Or disable the firewall altogether.
  # networking.firewall.enable = false;

  # This value determines the NixOS release from which the default
  # settings for stateful data, like file locations and database versions
  # on your system were taken. It‘s perfectly fine and recommended to leave
  # this value at the release version of the first install of this system.
  # Before changing this value read the documentation for this option
  # (e.g. man configuration.nix or on https://nixos.org/nixos/options.html).
  # system.stateVersion = "unstable"; # Did you read the comment?

  # REGION NVIDIA / CUDA

  # Enable OpenGL
  hardware.opengl = {
    enable = true;
    driSupport = true;
    driSupport32Bit = true;
  };

  # Load nvidia driver for Xorg and Wayland
  services.xserver.videoDrivers = [ "nvidia" ];

  # see https://nixos.wiki/wiki/Nvidia#CUDA_and_using_your_GPU_for_compute
  hardware.nvidia = {
    prime = {
      offload = {
			  enable = true;
			  enableOffloadCmd = true;
		  };
      # Make sure to use the correct Bus ID values for your system!
      amdgpuBusId = "PCI:6:0:0";
      nvidiaBusId = "PCI:1:0:0";
    };

    # Modesetting is required.
    modesetting.enable = true;

    # Nvidia power management. Experimental, and can cause sleep/suspend to fail.
    powerManagement.enable = true;
    # Fine-grained power management. Turns off GPU when not in use.
    # Experimental and only works on modern Nvidia GPUs (Turing or newer).
    powerManagement.finegrained = false;

    # Use the NVidia open source kernel module (not to be confused with the
    # independent third-party "nouveau" open source driver).
    # Support is limited to the Turing and later architectures. Full list of 
    # supported GPUs is at: 
    # https://github.com/NVIDIA/open-gpu-kernel-modules#compatible-gpus 
    # Only available from driver 515.43.04+
    # Currently alpha-quality/buggy, so false is currently the recommended setting.
    open = false;

    # Enable the Nvidia settings menu,
	  # accessible via `nvidia-settings`.
    nvidiaSettings = true;

    # Optionally, you may need to select the appropriate driver version for your specific GPU.
    package = config.boot.kernelPackages.nvidiaPackages.stable;
  };
  # ENDREGION
}

and docker-compose.yaml:

services:
  test:
    image: nvidia/cuda:12.3.0-runtime-ubuntu22.04
    command: nvidia-smi
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              capabilities: [gpu]

within the container I get the following:

test-1  | 
test-1  | ==========
test-1  | == CUDA ==
test-1  | ==========
test-1  | 
test-1  | CUDA Version 12.3.0
test-1  | 
test-1  | Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
test-1  | 
test-1  | This container image and its contents are governed by the NVIDIA Deep Learning Container License.
test-1  | By pulling and using the container, you accept the terms and conditions of this license:
test-1  | https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
test-1  | 
test-1  | A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.
test-1  | 
test-1  | WARNING: The NVIDIA Driver was not detected.  GPU functionality will not be available.
test-1  |    Use the NVIDIA Container Toolkit to start this container with GPU support; see
test-1  |    https://docs.nvidia.com/datacenter/cloud-native/ .
test-1  | 
test-1  | Mon Dec 11 17:48:10 2023       
test-1  | +---------------------------------------------------------------------------------------+
test-1  | | NVIDIA-SMI 545.29.06              Driver Version: 545.29.06    CUDA Version: 12.3     |
test-1  | |-----------------------------------------+----------------------+----------------------+
test-1  | | GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
test-1  | | Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
test-1  | |                                         |                      |               MIG M. |
test-1  | |=========================================+======================+======================|
test-1  | |   0  NVIDIA GeForce RTX 3070 ...    Off | 00000000:01:00.0  On |                  N/A |
test-1  | | N/A   50C    P8              18W / 115W |     42MiB /  8192MiB |      0%      Default |
test-1  | |                                         |                      |                  N/A |
test-1  | +-----------------------------------------+----------------------+----------------------+
test-1  |                                                                                          
test-1  | +---------------------------------------------------------------------------------------+
test-1  | | Processes:                                                                            |
test-1  | |  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
test-1  | |        ID   ID                                                             Usage      |
test-1  | |=======================================================================================|
test-1  | +---------------------------------------------------------------------------------------+
test-1 exited with code 0

How is it that nvidia-smi is detected within the container, yet the warning above that the nvidia-driver is not detected.
Note, the nvidia-smi output matches what is outside the container

SergeK · December 11, 2023, 6:17pm

Quick comment 1: you can always grep LD_DEBUG=libs nvidia-smi for occurrences of libcuda to see if the dynamic linker fails to load some of the driver libraries, and you can ls /dev/nvidia* or strace nvidia-smi | grep /dev to see if all of the required devices are mounted correctly
Quick comment 2: you could test a more useful example, like torch.cuda.is_available(), or nix run -f '<nixpkgs>' --arg config '{ allowUnfree = true; }' cudaPackages.saxpy, or something from the cuda-samples

Quick comment 3: at least simple things definitely should work and maybe you confirm those before getting back to docker-compose, e.g.

docker run with --gpus all ...

virtualisation.oci-containers.containers.grobid = {
  image = "grobid/grobid:0.8.0-SNAPSHOT";
  ports = [ "8070:8070" "8071:8071" ];
  extraOptions = [ "--gpus=all" ];
  volumes = [
    "${./grobid.yaml}:/opt/grobid/grobid-home/config/grobid.yaml:ro"
  ];
};

The nvida-container-toolkit story hasn’t been entirely solved yet (e.g. singularity and k8s require some manual work), because somebody has to fix upstream issues/make it possible to disable upstream’s glibc-internals-related hacks

peturparkur · December 11, 2023, 8:56pm

The following image do not bring up the error:
sudo docker run --rm --gpus all -it nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2 nvidia-smi
sudo docker run --rm --gpus all -it pytorch/pytorch:2.1.1-cuda12.1-cudnn8-runtime nvidia-smi

while using the official nvidia-images I do receive the warning as stated above (WARNING: The NVIDIA Driver was not detected):
docker run --gpus all -it nvidia/cuda:12.3.1-runtime-ubuntu20.04 nvidia-smi
docker run --gpus all -it nvidia/cuda:12.3.1-runtime-ubi9 nvidia-smi
docker run --gpus all -it nvidia/cuda:12.3.1-runtime-rockylinux9 nvidia-smi

but only when I use the runtime or devel images. When the base images are used there’s no warning eg.:
docker run --gpus all -it nvidia/cuda:12.3.1-base-ubuntu20.04 nvidia-smi

Note, if I execute on the official runtime tagged pod and install python3, pip, and pytorch then torch.cuda.is_available()==True