K3S + Nvidia container runtime on 25.05

I cant figure out how to get GPU support in k3s.

I followed

But this seems to be a bit outdated, or something is missing, as there is now nvidia-container-toolkit but that does not have

/run/current-system/sw/bin/nvidia-container-runtime

It has

/nix/store/cqdvc05s2g8cq6y1v69qyz52ixi8a7hv-nvidia-container-toolkit-1.17.8-tools/bin/nvidia-container-runtime-hook
/nix/store/cqdvc05s2g8cq6y1v69qyz52ixi8a7hv-nvidia-container-toolkit-1.17.8-tools/bin/.nvidia-container-runtime-wrapped
/nix/store/cqdvc05s2g8cq6y1v69qyz52ixi8a7hv-nvidia-container-toolkit-1.17.8-tools/bin/nvidia-container-runtime
/nix/store/cqdvc05s2g8cq6y1v69qyz52ixi8a7hv-nvidia-container-toolkit-1.17.8-tools/bin/.nvidia-container-runtime-hook-wrapped
/nix/store/cqdvc05s2g8cq6y1v69qyz52ixi8a7hv-nvidia-container-toolkit-1.17.8-tools/bin/nvidia-container-runtime.legacy
/nix/store/cqdvc05s2g8cq6y1v69qyz52ixi8a7hv-nvidia-container-toolkit-1.17.8-tools/bin/nvidia-container-runtime.cdi

though but I cant write that in the toml as it would change.

My current config looks like this, it does not include

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]

as the k3s readme above states.

  hardware.nvidia = {
    open = true;
    package = config.boot.kernelPackages.nvidiaPackages.stable;
    nvidiaSettings = true;
  };

  services.xserver = {
    enable = false;
    videoDrivers = [ "nvidia" ];
  };

  nixpkgs.config.allowUnfreePredicate = pkg: builtins.elem (lib.getName pkg) [
    "nvidia-x11"
    "nvidia-settings"
  ];

  hardware.nvidia-container-toolkit.enable = true;
  hardware.nvidia-container-toolkit.mount-nvidia-executables = true; 

  environment.systemPackages = with pkgs; [
    nvidia-container-toolkit 
    runc
  ];

  systemd.services = {
    #nvidia-container-toolkit-cdi-generator = {
    #  # hack as with `--library-search-path`, `nvidia-ctk` won't find the libs
    #  environment.LD_LIBRARY_PATH = "${config.hardware.nvidia.package}/lib";
    #};
    k3s-containerd-setup = {
      # `virtualisation.containerd.settings` has no effect on k3s' bundled containerd.
      serviceConfig.Type = "oneshot";
      requiredBy = ["k3s.service"];
      before = ["k3s.service"];
      script = ''
        mkdir -p /var/lib/rancher/k3s/agent/etc/containerd
        cat << EOF > /var/lib/rancher/k3s/agent/etc/containerd/config.toml.tmpl
        {{ template "base" . }}
        
        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
            privileged_without_host_devices = false
            runtime_engine = ""
            runtime_root = ""
            runtime_type = "io.containerd.runc.v2"
        
        EOF
      '';
    };
  };

Also applied

apiVersion: node.k8s.io/v1
handler: nvidia
kind: RuntimeClass
metadata:
  labels:
    app.kubernetes.io/component: gpu-operator
  name: nvidia

, installed nvdp/nvidia-device-plugin helm chart with runtimeClassName=nvidia and deployed

apiVersion: v1
kind: Pod
metadata:
  name: nbody-gpu-benchmark
  namespace: default
spec:
  restartPolicy: OnFailure
  runtimeClassName: nvidia
  containers:
  - name: cuda-container
    image: nvcr.io/nvidia/k8s/cuda-sample:nbody
    args: ["nbody", "-gpu", "-benchmark"]
    env:
    - name: NVIDIA_VISIBLE_DEVICES
      value: all
    - name: NVIDIA_DRIVER_CAPABILITIES
      value: all

which is crashing.

My understanding is that there is a missing piece, the runtime is not really available / detected by the k3s containerd/runc.

Does anyone have a working config on 25.05 with k3s that they can share?

Nvm, I helped myself.

Basically most documentation, is outdated / non-functional anymore.

Just in case someone facing the same problem: This is the best you can get atm:

1 Like