Has anyone had any success in building a docker container with dockerTools with cudatoolkit and deploying to Kubernetes?
We have a container successfully built that can run nvcc and print out the right details, but the GPU is not detected by pytorch or tensorflow. We have the NVIDIA/ k8s-device-plugin working on the cluster and we are able to use debian images that can detect the GPU.
We don’t want to consider NixOS or NixOps at the present time, we are happy with our containers and kubernetes cluster (and nixpkgs as well!).
Feels like we’re missing something small to get this working.
You don’t necessarily need cudatoolkit in the container when you only want to run tensorflow or pytorch.
I presume you have built the tensorflow and pytorch for the containers using cudaSupport enabled?
The resulting shared objects inside the pytorch or tensorflow installations should have /run/opengl-driver/lib in their RUNPATH when you readelf -d on them.
Most likely you need to create symlinks in the container so that /run/opengl-driver/lib/libcuda.so point to the libcuda that gets mapped into the container by k8s-device-plugin.
Note that libcuda.so is not part of the CUDA toolkit, but part of the NVIDIA driver (nvidia-x11 in nixpkgs). You also do not need the CUDA toolkit on the host machine, only the driver. But, yes, you definitely want to use libcuda.so from the host system’s drivers. Otherwise it is very likely that you get a driver mismatch between libcuda.so and the host kernel driver.
I encountered a couple of rough edges that we could address:
tensorflow-bin incorrectly depends on nvidia_x11 when it should just assume that the host has the driver + libcuda.so installed. I manually removed this dependency in my local nixpkgs checkout.
The generated docker image is several GB big. This is mostly from cudatoolkit and cudnn being in the closure and including CUDA profilers/compilers/tools and static libraries when dynamic libraries would suffice. We could solve that in nixpkgs by moving static libraries into a separate output and separating the cudatoolkit tooling from the libraries (libcudart.so, libcublas.so, libcublasLt.so, libcufft.so, libcurand.so, libcusolver.so, libcusparse.so)
Ideally I’d like a declarative way of linking the libraries that nvidia-docker maps into the container from the host into the spot where nixpkgs would expect it (/run/opengl-driver/lib)
Digging through this you’ve really made me realize how non-hermetic CUDA is.
I’m not actually using tensorflow or pytorch from nixpkgs. Our ML dependencies are part of a larger dependency-closure that is managed inside Bazel and we aren’t using Nix for the overall build, only for providing certain packages (base docker images in this instance).
It seems that even in a build fully managed by Nix, there would be challenges around the way the NVIDIA/k8s-device-plugin works. It really is a curious way of mounting device drivers and libraries into the container. I guess it makes sense.
I can say that I’ve got this working for pytorch so far. Given that the system is non-hermetic anyway, I’ve taken to modifying LD_LIBRARY_PATH. This is probably the antithesis of Nix! But I think dynamic linking is how NVIDIA intends for this to work. I don’t think there is an easy way around it and running scripts to symlink files feels worse than using LD_LIBRARY_PATH.
What I’ve discovered so far:
Yes, as mentioned above, pytorch is bundled with CUDA (the wheel is large as a result) so cudatoolkit and cudnn is not required in the docker image. I understand tensorflow is different, but I have yet to confirm and get this working in my environment.
This environment variable is important “NVIDIA_DRIVER_CAPABILITIES=compute,utility” it determines which .so the device plugin mounts into the container
In my system, the device plugin mounts the .so into /usr/lib64 so setting LD_LIBRARY_PATH=/usr/lib64 works for me so far
The device plugin also mounts nvidia-smi (the tool to monitor the gpu) at /usr/bin/nvidia-smi. This binary is looking for the linker at /lib64/ld-linux-x86-64.so.2. I created a symlink ln -s ${glibc.out}/lib64/ld-linux-x86-64.so.2 /lib64/ld-linux-x86-64.so.2 and the executable worked
I’ll report back once I’ve tried my tensorflow using cudatookit from nixpkgs.
Yes, it’s not great, but this is what NVIDIA does in their images and is what the plugin will do to any container that happens to have a /usr/lib64 on the host. It is truly crazy!
It would be amazing if nix solved this problem! I think there would need to be a way to control where nvidia-docker or the k8s-device-plugin will mount the .so? Then nixpkgs provided tensorflow should be ok, because it can search /run/opengl-driver/lib. Since I’m not patchelf’ing my wheels in the container (they are not managed by nix) my only option is to place the few dynamic libraries I have in LD_LIBRARY_PATH. Most of my dependencies are packaged statically. It is only the openssl, cert-bundles and these cuda drivers that are loaded dynamically. Of course, I can never be sure unless I use Nix to patch all dependencies. I think that will take a lot of convincing! And a lot of work! Maybe some day!