Accessing GPUs inside a nix-generated docker container

ethanabrooks · August 20, 2022, 4:06pm

Hello! Is there any way to access GPUs inside a nix generated docker container?

Thank you.

azazel75 · August 20, 2022, 6:32pm

What do you mean? I never done that but it should be no different than for any other docker image…

ethanabrooks · August 20, 2022, 9:46pm

Sorry let me give some more details. I have the following nix file:

# cuda-docker.nix
{
  pkgs ? import <nixpkgs> {},
  pkgsLinux ? import <nixpkgs> {system = "x86_64-linux";},
}:
pkgs.dockerTools.buildImage {
  name = "cuda-env-docker";
  tag = "latest";
  copyToRoot = pkgs.buildEnv {
    name = "image-root";
    pathsToLink = ["/bin"];
    paths = with pkgs; [
      git
      gitRepo
      gnupg
      autoconf
      curl
      procps
      gnumake
      utillinux
      m4
      gperf
      unzip
      cudatoolkit
      linuxPackages.nvidia_x11
      libGLU
      libGL
      xorg.libXi
      xorg.libXmu
      freeglut
      xorg.libXext
      xorg.libX11
      xorg.libXv
      xorg.libXrandr
      zlib
      ncurses5
      stdenv.cc
      binutils
    ];
  };
  config = {
    Env = with pkgs; [
      "PATH=/bin/"
      "CUDA_PATH=${cudatoolkit}"
      "LD_LIBRARY_PATH=${cudatoolkit.lib}/lib"
      "EXTRA_LDFLAGS=-L/lib -L${pkgs.linuxPackages.nvidia_x11}/lib"
      "EXTRA_CCFLAGS=-I/usr/include"
    ];
    Cmd = ["/bin/nvidia-smi"];
  };
}

based on the CUDA tutorial from nixos.org. When I run

docker load < $(NIXPKGS_ALLOW_UNFREE=1 nix-build cuda-docker.nix)
docker run -it --rm --gpus all cuda-env-docker:latest

I get

Failed to initialize NVML: Driver/library version mismatch

I am actually able to reproduce the issue outside docker, but only if I use ${pkgs.linuxPackages.nvidia_x11.bin}/bin/nvidia-smi instead of /usr/bin/nvidia-smi (I’m on Ubuntu, not NixOS).

brogos · August 21, 2022, 1:18am

At NixOS you have to set virtualization.docker.enableNvidia to true.

ethanabrooks · August 21, 2022, 2:25am

I am not using NixOS. Could you show me how to specify that in the cuda-docker.nix file that I posted earlier?

ethanabrooks · August 21, 2022, 2:26am

Also, I found this blog post. A lot of useful information although the suggested solution did not solve my issue.

Atemu · August 21, 2022, 12:49pm

The problem is that you’re trying to use the userspace component from Nixpkgs and whatever version is latest there. You’d need to match the major version of nvidia_x11 to the version your host OS taints your kernel with.

Pretty sure Nvidia provides a shitty solution for their terrible unstable driver interface for containers specifically and you’d probably be better of running with that.

If you get rid of Nixpkgs’ Nvidia userspace components, follow the blog post w.r.t. environment variables and make sure your docker host can do the Nvidia crap, that would probably work.

ethanabrooks · August 21, 2022, 3:43pm

Thanks for the response! I have a minimal working version here:

{
  pkgs ? import <nixpkgs> {},
  pkgsLinux ? import <nixpkgs> {system = "x86_64-linux";},
}:
pkgs.dockerTools.buildImage {
  name = "cuda-env-docker";
  tag = "latest";
  copyToRoot = pkgs.buildEnv {
    name = "image-root";
    pathsToLink = ["/bin"];
    paths = [
      cudatoolkit
      linuxKernel.packages.linux_5_10.nvidia_x11
    ];
  };
  config = {
    Env = ["LD_LIBRARY_PATH=/usr/lib64/"];
    Cmd = ["/bin/nvidia-smi"];
  };
}

Using the same run command:

❯ docker load < $(NIXPKGS_ALLOW_UNFREE=1 nix-build cuda-docker.nix)
docker run -it --rm --gpus all cuda-env-docker:latest
The image cuda-env-docker:latest already exists, renaming the old one with ID sha256:3e98a02cc260f76c17caa7381648746231f6cc48cdaf149c5414e1324a29e3ad to empty string
Loaded image: cuda-env-docker:latest
Sun Aug 21 16:08:23 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 510.85.02    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+

Curiously this also works with linuxPackages.nvidia_x11 instead of linuxKernel.packages.linux_5_10.nvidia_x11.

I think the key element that I was missing before was LD_LIBRARY_PATH=/usr/lib64/.

Atemu · August 21, 2022, 6:26pm

You can/should remove any of nixpkgs’ nvidia_x11 from your container env. Docker provides the nvidia userspace, CUDA etc., that’s what you’re hacking into processes using LD_LIBRARY_PATH.
Introducing another one from Nixpkgs with a version that doesn’t match your kernel’s is asking for trouble.

Run nvidia-smi outside the container if you need to monitor.

ethanabrooks · August 21, 2022, 6:31pm

Right, I actually noticed that when I did this, jax was no longer able to access the GPU and ended up removing it in the application that I’m working on.

Any suggestion how to test GPU access without importing a big library like jax into the container?

ethanabrooks · August 22, 2022, 12:44pm

In my actual application, I found that it was necessary to set LD_LIBRARY_PATH = "${nvidia_x11}/lib"; in my mkShell (in order for jax to access GPUs). Is this a bad idea, and if so, is there a better way to accomplish this?

Atemu · August 22, 2022, 5:40pm

This is far outside what I’m experienced with but what I can tell you is that you’re going to run into trouble if you mismatch kernel module and userspace of the nvidia driver which what you wrote will do.

Does jax not work when LD_LIBRARY_PATH is set to the docker-provided libs?

ethanabrooks · August 22, 2022, 6:46pm

Right that’s no problem. Could you give me a quick definition of the terms “kernel module” and “userspace” in this context? Does that relate to cuda version and driver version?

Does jax not work when LD_LIBRARY_PATH is set to the docker-provided libs?

What are the docker-provided libs? Are you talking about using an nvidia base image like nvidia/cuda:11.7.1-devel-ubuntu20.04?

Also, for clarity, this is what I ended up with in my application:

      packages.default = dockerTools.buildImage {
        name = "ppo";
        tag = "latest";
        copyToRoot =
          buildEnv
          {
            name = "image-root";
            pathsToLink = ["/bin" "ppo"];
            paths = buildInputs ++ [./ppo];
          };
        config = {
          Env = with pkgs; [
            "PYTHONFAULTHANDLER=1"
            "PYTHONBREAKPOINT=ipdb.set_trace"
            "LD_LIBRARY_PATH=/usr/lib64/"
            "PATH=/bin:$PATH"
          ];
          Cmd = ["${bash}/bin/bash"];
        };
      };

Atemu · August 23, 2022, 7:55am

AFAIK, the Nvidia driver has two components: A kernel driver that does KMS and communicates with the hardware and a userspace component that uses the kernel module as a proxy and implements APIs like VK or OGL and, of course, facilitates CUDA. These come from the same package but are installed in different places. These two components must have matching versions.

I don’t meant the base image. In the example you linked, it actually uses no base image at all. I mean the Docker nvidia-runtime. AFAIUI, it provides the userspace component from the system to containers (by mounting the libraries inside) so that they don’t have to provide the Nvidia driver themselves. That avoids the driver version mismatch problem and makes the containers less bloated.

Nix packages don’t use these libraries by default though, so you must to add them to the library path.

ethanabrooks · August 24, 2022, 6:16pm

Thanks for the explanation! I do not know a lot about the low-level components that power these things.

Does jax not work when LD_LIBRARY_PATH is set to the docker-provided libs?

How would I find that path?

Atemu · August 25, 2022, 1:08pm

As I said, this is not my field of expertise but the example from the Blog post you linked has all the environment variables you should need in it and they sound sensible.

I’d expect the runtime to mount its path under /lib64. I’d recommend you to simply go exploring inside a container with nvidia-runtime enabled.

ethanabrooks · August 26, 2022, 1:50pm

Ok. Well thank you for your help and for the explations.