Broken Nvidia GPU Acceleration In Docker Containers

Daniel-McLarty · February 28, 2024, 3:32pm

I want to run the Ollama WebUI docker compose container with offloading to the GPU. But I cant seam to get it to work. This is my current docker config:

  docker = {
    enable = true;
    enableOnBoot = true;
    enableNvidia = true;
    extraOptions = "--default-runtime=nvidia";
  };

But right now when running the docker run --gpus all nvidia/cuda:11.6.2-cudnn8-runtime-ubuntu20.04 nvidia-smi command I get this error:

docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: /nix/store/rlkbyypmi5xdy72sf7fb4kkfx5qrk5rl-nvidia-docker/bin/nvidia-container-runtime did not terminate successfully: exit status 125: unknown.
ERRO[0000] error waiting for container:

If I remove extraOptions = "--default-runtime=nvidia"; it will run, but then it will complain in the SMI that no driver was loaded and acceleration was disabled.

I have looked at many other topics to no success, so I am making my own. Any help would be much appreciated.

nicolas-goudry · February 29, 2024, 11:27pm

Not sure if this would be of any help, but I’m running Tabby in a Docker container with my Nvidia GPU.

Here are the relevant parts of my config (or at least some should be relevant):

  virtualisation = {
    docker = {
      enable = true;
      enableNvidia = true;
    };

    oci-containers = {
      backend = "docker";

      containers.tabby = {
        autoStart = true;
        image = "tabbyml/tabby";
        ports = [ "8080:8080" ];
        volumes = [ "/tabby:/data" ];
        extraOptions = [ "--gpus=all" ];

        cmd = [
          "serve"
          "--model=TabbyML/StarCoder-1B"
          "--device=cuda"
        ];
      };
    };
  };

  hardware = {
    nvidia = {
      modesetting.enable = true;
      open = false;
      nvidiaSettings = true;
      package = config.boot.kernelPackages.nvidiaPackages.stable;

      powerManagement = {
        enable = false;
        finegrained = false;
      };

      prime = {
        # Bus ID of the Intel GPU.
        intelBusId = "PCI:0:2:0";
        # Bus ID of the NVIDIA GPU.
        nvidiaBusId = "PCI:1:0:0";

        # Set sync mode (https://nixos.wiki/wiki/Nvidia#Optimus_PRIME_Option_B:_Sync_Mode)
        offload.enable = false;
        sync.enable = true;
      };
    };
  };

The only issue I have with this setup is that whenever I wake from sleep mode, I have to run the following two commands for the container to gain access to the GPU again:

sudo modprobe --remove nvidia-uvm
sudo modprobe nvidia-uvm

Hope this can help you solve your issue.

Daniel-McLarty · March 1, 2024, 12:15am

The container is external, so this would not work as I cant set the config in my configuration.nix.

sciyoshi · March 12, 2024, 12:35pm

I ran into this same issue (exit status 125 when trying to attach GPUs to Docker containers). After a while of testing, I tried running the container with podman instead, which worked first try without issues. The two settings I needed on unstable were

virtualisation.podman.enable = true;
virtualisation.containers.cdi.dynamic.nvidia.enable = true;

ConnorBaker · March 12, 2024, 7:47pm

For what its worth, these are the changes I had to make to my NixOS config to be able to use VS Code’s dev container extension to build software inside of NVIDIA-published docker images:

Of note, that included switching to Docker 25 so I could manually enable CDI.