How to use dockerTools.buildLayeredImage and CUDA

groodt · August 18, 2021, 10:43pm

Has anyone had any success in building a docker container with dockerTools with cudatoolkit and deploying to Kubernetes?

We have a container successfully built that can run nvcc and print out the right details, but the GPU is not detected by pytorch or tensorflow. We have the NVIDIA/ k8s-device-plugin working on the cluster and we are able to use debian images that can detect the GPU.

We don’t want to consider NixOS or NixOps at the present time, we are happy with our containers and kubernetes cluster (and nixpkgs as well!).

Feels like we’re missing something small to get this working.

knedlsepp · August 19, 2021, 7:29am

You don’t necessarily need cudatoolkit in the container when you only want to run tensorflow or pytorch.

I presume you have built the tensorflow and pytorch for the containers using cudaSupport enabled?

The resulting shared objects inside the pytorch or tensorflow installations should have /run/opengl-driver/lib in their RUNPATH when you readelf -d on them.

Most likely you need to create symlinks in the container so that /run/opengl-driver/lib/libcuda.so point to the libcuda that gets mapped into the container by k8s-device-plugin.

knedlsepp · August 19, 2021, 7:33am

It’s also possible that something isn’t packaged correctly in the pytorch derivation.

For tensorflow we take care to add the RUNPATH for /run/opengl-driver/lib (which is where NixOS puts the libcuda.so) via https://github.com/NixOS/nixpkgs/blob/808125fff694e4eb4c73952d501e975778ffdacd/pkgs/development/python-modules/tensorflow/default.nix#L331.

I cannot find that line in the pytorch derivation.

danieldk · August 19, 2021, 8:05am

Note that libcuda.so is not part of the CUDA toolkit, but part of the NVIDIA driver (nvidia-x11 in nixpkgs). You also do not need the CUDA toolkit on the host machine, only the driver. But, yes, you definitely want to use libcuda.so from the host system’s drivers. Otherwise it is very likely that you get a driver mismatch between libcuda.so and the host kernel driver.

knedlsepp · August 20, 2021, 3:05pm

So I tried the following (using current unstable and pythonPackages.pyroma: fix tests phase by kamadorueda · Pull Request #134261 · NixOS/nixpkgs · GitHub on top)

{ nixpkgs ? <nixpkgs> }:                                                                                                                                                                     
let                                                                                                                                                                                          
  pkgs = import nixpkgs {                                                                                                                                                                    
    config = {                                                                                                                                                                               
      cudaSupport = true;                                                                                                                                                                    
      allowUnfree = true;                                                                                                                                                                    
    };                                                                                                                                                                                       
  };                                                                                                                                                                                         
  pythonWithTensorflow = pkgs.python37.withPackages (ps: with ps; [ tensorflow-bin ]);                                                                                                     
  container = pkgs.dockerTools.buildLayeredImage {                                                                                                                                           
    name = "tensorflow-test";                                                                                                                                                                     
    tag = "latest";                                                                                                                                                                            
    contents = with pkgs; [                                                                                                                                                                    
      bash                                                                                                                                                                                     
      pythonWithTensorflow                                                                                                                                                                     
      coreutils                                                                                                                                                                                
      findutils                                                                                                                                                                                
    ];                                                                                                                                                                                         
  };                                                                                                                                                                                           
in container

I then ran:

docker load < $(nix-build --argstr nixpkgs ~/my/nixpkgs/checkout)
nvidia-docker run \
    --env NVIDIA_DRIVER_CAPABILITIES='compute,utility' \
    --env NVIDIA_VISIBLE_DEVICES=all \
    -it \
    --gpus all \
    tensorflow-test:latest \
    bash

And inside the container I ran:

mkdir -p /run/opengl-driver/lib
ln -sf /usr/lib64/libcuda* /run/opengl-driver/lib
ln -sf /usr/lib64/libnvidia* /run/opengl-driver/lib
python -c 'import tensorflow as tf; tf.test.gpu_device_name()'

Which worked fine.

I encountered a couple of rough edges that we could address:

tensorflow-bin incorrectly depends on nvidia_x11 when it should just assume that the host has the driver + libcuda.so installed. I manually removed this dependency in my local nixpkgs checkout.
The generated docker image is several GB big. This is mostly from cudatoolkit and cudnn being in the closure and including CUDA profilers/compilers/tools and static libraries when dynamic libraries would suffice. We could solve that in nixpkgs by moving static libraries into a separate output and separating the cudatoolkit tooling from the libraries (libcudart.so, libcublas.so, libcublasLt.so, libcufft.so, libcurand.so, libcusolver.so, libcusparse.so)
Ideally I’d like a declarative way of linking the libraries that nvidia-docker maps into the container from the host into the spot where nixpkgs would expect it (/run/opengl-driver/lib)

groodt · August 23, 2021, 10:10am

Thanks for the pointers @knedlsepp @danieldk !

Digging through this you’ve really made me realize how non-hermetic CUDA is.

I’m not actually using tensorflow or pytorch from nixpkgs. Our ML dependencies are part of a larger dependency-closure that is managed inside Bazel and we aren’t using Nix for the overall build, only for providing certain packages (base docker images in this instance).

It seems that even in a build fully managed by Nix, there would be challenges around the way the NVIDIA/k8s-device-plugin works. It really is a curious way of mounting device drivers and libraries into the container. I guess it makes sense.

I can say that I’ve got this working for pytorch so far. Given that the system is non-hermetic anyway, I’ve taken to modifying LD_LIBRARY_PATH. This is probably the antithesis of Nix! But I think dynamic linking is how NVIDIA intends for this to work. I don’t think there is an easy way around it and running scripts to symlink files feels worse than using LD_LIBRARY_PATH.

What I’ve discovered so far:

Yes, as mentioned above, pytorch is bundled with CUDA (the wheel is large as a result) so cudatoolkit and cudnn is not required in the docker image. I understand tensorflow is different, but I have yet to confirm and get this working in my environment.
This environment variable is important “NVIDIA_DRIVER_CAPABILITIES=compute,utility” it determines which .so the device plugin mounts into the container
In my system, the device plugin mounts the .so into /usr/lib64 so setting LD_LIBRARY_PATH=/usr/lib64 works for me so far
The device plugin also mounts nvidia-smi (the tool to monitor the gpu) at /usr/bin/nvidia-smi. This binary is looking for the linker at /lib64/ld-linux-x86-64.so.2. I created a symlink ln -s ${glibc.out}/lib64/ld-linux-x86-64.so.2 /lib64/ld-linux-x86-64.so.2 and the executable worked

I’ll report back once I’ve tried my tensorflow using cudatookit from nixpkgs.

knedlsepp · August 23, 2021, 11:22am

Good. Keep in mind that there are some scenarios where this solution could break down:

For example if you put other libraries into /usr/lib64, these could conflict with the nix provided ones.

groodt · August 23, 2021, 12:03pm

Yes, it’s not great, but this is what NVIDIA does in their images and is what the plugin will do to any container that happens to have a /usr/lib64 on the host. It is truly crazy!

It would be amazing if nix solved this problem! I think there would need to be a way to control where nvidia-docker or the k8s-device-plugin will mount the .so? Then nixpkgs provided tensorflow should be ok, because it can search /run/opengl-driver/lib. Since I’m not patchelf’ing my wheels in the container (they are not managed by nix) my only option is to place the few dynamic libraries I have in LD_LIBRARY_PATH. Most of my dependencies are packaged statically. It is only the openssl, cert-bundles and these cuda drivers that are loaded dynamically. Of course, I can never be sure unless I use Nix to patch all dependencies. I think that will take a lot of convincing! And a lot of work! Maybe some day!

groodt · August 23, 2021, 12:13pm

I can confirm that I have tensorflow 2.5.0 (from pip, not nixpkgs) also working with the above approach of LD_LIBRARY_PATH.

I’ve added the following:
/usr/lib64
“${cudatoolkit}/lib”
“${cudatoolkit.lib}/lib”
“${cudnn}/lib”

breakds · May 25, 2023, 4:50pm

Based on the answers, here is a minimal docker image that works for me:

dockerTools.buildImage {
  name = "my-torch-runtime";
  tag = "2023.05.24";
  created = "now";

  fromImage = runtime-base;

  copyToRoot = buildEnv {
    name = "runtime-base-env";
    paths = [ my-python-env nvitop ];
    pathsToLink = [ "/bin" ];
  };

  config = {
    Cmd = [ "/bin/bash" ];
    Env = let
      cudatoolkit = python3Packages.pytorchWithCuda.cudaPackages.cudatoolkit;
      cudnn = python3Packages.pytorchWithCuda.cudaPackages.cudnn;
    in [
      "PS1=\\e[33m\\w\\e[m [\\t] \\e[31m\\\\$\\e[m "
"LD_LIBRARY_PATH=${stdenv.cc.cc.lib}/lib:${cudatoolkit}/lib:${cudatoolkit.lib}/lib:${cudnn}/lib:/usr/lib64"
      "NVIDIA_DRIVER_CAPABILITIES=compute,utility"
      "NVIDIA_VISIBLE_DEVICES=all"
    ];
  };

Thanks a lot for having the discussion!