I cant figure out how to get GPU support in k3s.
I followed
But this seems to be a bit outdated, or something is missing, as there is now nvidia-container-toolkit but that does not have
/run/current-system/sw/bin/nvidia-container-runtime
It has
/nix/store/cqdvc05s2g8cq6y1v69qyz52ixi8a7hv-nvidia-container-toolkit-1.17.8-tools/bin/nvidia-container-runtime-hook
/nix/store/cqdvc05s2g8cq6y1v69qyz52ixi8a7hv-nvidia-container-toolkit-1.17.8-tools/bin/.nvidia-container-runtime-wrapped
/nix/store/cqdvc05s2g8cq6y1v69qyz52ixi8a7hv-nvidia-container-toolkit-1.17.8-tools/bin/nvidia-container-runtime
/nix/store/cqdvc05s2g8cq6y1v69qyz52ixi8a7hv-nvidia-container-toolkit-1.17.8-tools/bin/.nvidia-container-runtime-hook-wrapped
/nix/store/cqdvc05s2g8cq6y1v69qyz52ixi8a7hv-nvidia-container-toolkit-1.17.8-tools/bin/nvidia-container-runtime.legacy
/nix/store/cqdvc05s2g8cq6y1v69qyz52ixi8a7hv-nvidia-container-toolkit-1.17.8-tools/bin/nvidia-container-runtime.cdi
though but I cant write that in the toml as it would change.
My current config looks like this, it does not include
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
as the k3s readme above states.
hardware.nvidia = {
open = true;
package = config.boot.kernelPackages.nvidiaPackages.stable;
nvidiaSettings = true;
};
services.xserver = {
enable = false;
videoDrivers = [ "nvidia" ];
};
nixpkgs.config.allowUnfreePredicate = pkg: builtins.elem (lib.getName pkg) [
"nvidia-x11"
"nvidia-settings"
];
hardware.nvidia-container-toolkit.enable = true;
hardware.nvidia-container-toolkit.mount-nvidia-executables = true;
environment.systemPackages = with pkgs; [
nvidia-container-toolkit
runc
];
systemd.services = {
#nvidia-container-toolkit-cdi-generator = {
# # hack as with `--library-search-path`, `nvidia-ctk` won't find the libs
# environment.LD_LIBRARY_PATH = "${config.hardware.nvidia.package}/lib";
#};
k3s-containerd-setup = {
# `virtualisation.containerd.settings` has no effect on k3s' bundled containerd.
serviceConfig.Type = "oneshot";
requiredBy = ["k3s.service"];
before = ["k3s.service"];
script = ''
mkdir -p /var/lib/rancher/k3s/agent/etc/containerd
cat << EOF > /var/lib/rancher/k3s/agent/etc/containerd/config.toml.tmpl
{{ template "base" . }}
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
privileged_without_host_devices = false
runtime_engine = ""
runtime_root = ""
runtime_type = "io.containerd.runc.v2"
EOF
'';
};
};
Also applied
apiVersion: node.k8s.io/v1
handler: nvidia
kind: RuntimeClass
metadata:
labels:
app.kubernetes.io/component: gpu-operator
name: nvidia
, installed nvdp/nvidia-device-plugin helm chart with runtimeClassName=nvidia and deployed
apiVersion: v1
kind: Pod
metadata:
name: nbody-gpu-benchmark
namespace: default
spec:
restartPolicy: OnFailure
runtimeClassName: nvidia
containers:
- name: cuda-container
image: nvcr.io/nvidia/k8s/cuda-sample:nbody
args: ["nbody", "-gpu", "-benchmark"]
env:
- name: NVIDIA_VISIBLE_DEVICES
value: all
- name: NVIDIA_DRIVER_CAPABILITIES
value: all
which is crashing.
My understanding is that there is a missing piece, the runtime is not really available / detected by the k3s containerd/runc.
Does anyone have a working config on 25.05 with k3s that they can share?