Hello! Is there any way to access GPUs inside a nix generated docker container?
Thank you.
Hello! Is there any way to access GPUs inside a nix generated docker container?
Thank you.
What do you mean? I never done that but it should be no different than for any other docker image…
Sorry let me give some more details. I have the following nix file:
# cuda-docker.nix
{
pkgs ? import <nixpkgs> {},
pkgsLinux ? import <nixpkgs> {system = "x86_64-linux";},
}:
pkgs.dockerTools.buildImage {
name = "cuda-env-docker";
tag = "latest";
copyToRoot = pkgs.buildEnv {
name = "image-root";
pathsToLink = ["/bin"];
paths = with pkgs; [
git
gitRepo
gnupg
autoconf
curl
procps
gnumake
utillinux
m4
gperf
unzip
cudatoolkit
linuxPackages.nvidia_x11
libGLU
libGL
xorg.libXi
xorg.libXmu
freeglut
xorg.libXext
xorg.libX11
xorg.libXv
xorg.libXrandr
zlib
ncurses5
stdenv.cc
binutils
];
};
config = {
Env = with pkgs; [
"PATH=/bin/"
"CUDA_PATH=${cudatoolkit}"
"LD_LIBRARY_PATH=${cudatoolkit.lib}/lib"
"EXTRA_LDFLAGS=-L/lib -L${pkgs.linuxPackages.nvidia_x11}/lib"
"EXTRA_CCFLAGS=-I/usr/include"
];
Cmd = ["/bin/nvidia-smi"];
};
}
based on the CUDA tutorial from nixos.org. When I run
docker load < $(NIXPKGS_ALLOW_UNFREE=1 nix-build cuda-docker.nix)
docker run -it --rm --gpus all cuda-env-docker:latest
I get
Failed to initialize NVML: Driver/library version mismatch
I am actually able to reproduce the issue outside docker, but only if I use ${pkgs.linuxPackages.nvidia_x11.bin}/bin/nvidia-smi
instead of /usr/bin/nvidia-smi
(I’m on Ubuntu, not NixOS).
At NixOS you have to set virtualization.docker.enableNvidia
to true
.
I am not using NixOS. Could you show me how to specify that in the cuda-docker.nix
file that I posted earlier?
Also, I found this blog post. A lot of useful information although the suggested solution did not solve my issue.
The problem is that you’re trying to use the userspace component from Nixpkgs and whatever version is latest there. You’d need to match the major version of nvidia_x11
to the version your host OS taints your kernel with.
Pretty sure Nvidia provides a shitty solution for their terrible unstable driver interface for containers specifically and you’d probably be better of running with that.
If you get rid of Nixpkgs’ Nvidia userspace components, follow the blog post w.r.t. environment variables and make sure your docker host can do the Nvidia crap, that would probably work.
Thanks for the response! I have a minimal working version here:
{
pkgs ? import <nixpkgs> {},
pkgsLinux ? import <nixpkgs> {system = "x86_64-linux";},
}:
pkgs.dockerTools.buildImage {
name = "cuda-env-docker";
tag = "latest";
copyToRoot = pkgs.buildEnv {
name = "image-root";
pathsToLink = ["/bin"];
paths = [
cudatoolkit
linuxKernel.packages.linux_5_10.nvidia_x11
];
};
config = {
Env = ["LD_LIBRARY_PATH=/usr/lib64/"];
Cmd = ["/bin/nvidia-smi"];
};
}
Using the same run command:
❯ docker load < $(NIXPKGS_ALLOW_UNFREE=1 nix-build cuda-docker.nix)
docker run -it --rm --gpus all cuda-env-docker:latest
The image cuda-env-docker:latest already exists, renaming the old one with ID sha256:3e98a02cc260f76c17caa7381648746231f6cc48cdaf149c5414e1324a29e3ad to empty string
Loaded image: cuda-env-docker:latest
Sun Aug 21 16:08:23 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01 Driver Version: 510.85.02 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
Curiously this also works with linuxPackages.nvidia_x11
instead of linuxKernel.packages.linux_5_10.nvidia_x11
.
I think the key element that I was missing before was LD_LIBRARY_PATH=/usr/lib64/
.
You can/should remove any of nixpkgs’ nvidia_x11
from your container env. Docker provides the nvidia userspace, CUDA etc., that’s what you’re hacking into processes using LD_LIBRARY_PATH
.
Introducing another one from Nixpkgs with a version that doesn’t match your kernel’s is asking for trouble.
Run nvidia-smi
outside the container if you need to monitor.
Right, I actually noticed that when I did this, jax was no longer able to access the GPU and ended up removing it in the application that I’m working on.
Any suggestion how to test GPU access without importing a big library like jax
into the container?
In my actual application, I found that it was necessary to set LD_LIBRARY_PATH = "${nvidia_x11}/lib";
in my mkShell
(in order for jax
to access GPUs). Is this a bad idea, and if so, is there a better way to accomplish this?
This is far outside what I’m experienced with but what I can tell you is that you’re going to run into trouble if you mismatch kernel module and userspace of the nvidia driver which what you wrote will do.
Does jax
not work when LD_LIBRARY_PATH
is set to the docker-provided libs?
Right that’s no problem. Could you give me a quick definition of the terms “kernel module” and “userspace” in this context? Does that relate to cuda version and driver version?
Does
jax
not work whenLD_LIBRARY_PATH
is set to the docker-provided libs?
What are the docker-provided libs? Are you talking about using an nvidia base image like nvidia/cuda:11.7.1-devel-ubuntu20.04
?
Also, for clarity, this is what I ended up with in my application:
packages.default = dockerTools.buildImage {
name = "ppo";
tag = "latest";
copyToRoot =
buildEnv
{
name = "image-root";
pathsToLink = ["/bin" "ppo"];
paths = buildInputs ++ [./ppo];
};
config = {
Env = with pkgs; [
"PYTHONFAULTHANDLER=1"
"PYTHONBREAKPOINT=ipdb.set_trace"
"LD_LIBRARY_PATH=/usr/lib64/"
"PATH=/bin:$PATH"
];
Cmd = ["${bash}/bin/bash"];
};
};
AFAIK, the Nvidia driver has two components: A kernel driver that does KMS and communicates with the hardware and a userspace component that uses the kernel module as a proxy and implements APIs like VK or OGL and, of course, facilitates CUDA. These come from the same package but are installed in different places. These two components must have matching versions.
I don’t meant the base image. In the example you linked, it actually uses no base image at all. I mean the Docker nvidia-runtime. AFAIUI, it provides the userspace component from the system to containers (by mounting the libraries inside) so that they don’t have to provide the Nvidia driver themselves. That avoids the driver version mismatch problem and makes the containers less bloated.
Nix packages don’t use these libraries by default though, so you must to add them to the library path.
Thanks for the explanation! I do not know a lot about the low-level components that power these things.
Does
jax
not work whenLD_LIBRARY_PATH
is set to the docker-provided libs?
How would I find that path?
As I said, this is not my field of expertise but the example from the Blog post you linked has all the environment variables you should need in it and they sound sensible.
I’d expect the runtime to mount its path under /lib64
. I’d recommend you to simply go exploring inside a container with nvidia-runtime enabled.
Ok. Well thank you for your help and for the explations.