I’m trying to run k3s with nvidia support using the newly updated docs at here
I’ve got everything working with regards to the device-plugin. I can exec into the device-plugin pod and run commands like nvidia-smi, however, when I try to run a cuda based container image or something that requires gpu. I get the following error:
failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: fork/exec /usr/bin/nvidia-ctk: no such file or directory: unknown
Current device-plugin config:
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
name: nvidia-device-plugin
namespace: kube-system
spec:
interval: 1h0m0s
chart:
spec:
chart: nvidia-device-plugin
sourceRef:
kind: HelmRepository
name: nvidia
namespace: kube-system
version: '>=0.15.0 <0.16.0'
values:
config:
map:
default: |-
version: v1
flags:
migStrategy: 'none'
failOnInitError: true
nvidiaDriverRoot: '/run/current-system/sw/bin'
plugin:
passDeviceSpecs: false
deviceListStrategy: envvar
deviceIDStrategy: uuid
migStrategy: none
failOnInitError: true
deviceListStrategy: envvar
deviceIDStrategy: uuid
nvidiaDriverRoot: '/run/current-system/sw/bin'
gdsEnabled: false
mofedEnabled: false
compatWithCPUManager: false
allowDefaultNamespace: false
runtimeClassName: nvidia
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- sheol
gfd:
enabled: true
Here is the pod I’m trying to run just to test:
apiVersion: v1
kind: Pod
metadata:
name: vector-add-gpu-healthcheck
namespace: kube-system
spec:
restartPolicy: OnFailure
runtimeClassName: nvidia
containers:
- name: cuda-vector-add
image: nvidia/cuda:12.0.0-base-ubuntu22.04
resources:
limits:
nvidia.com/gpu: 1
env:
- name: NVIDIA_VISIBLE_DEVICES
value: all
- name: NVIDIA_DRIVER_CAPABILITIES
value: all
My nixos configuration for k3s can be found here
docker version 1.15.0-rc.3
runc version 1.1.12
spec: 1.0.2-dev
go: go1.22.3
libseccomp: 2.5.5
k3s version v1.30.0+k3s1 (14549535)
Has anyone had any luck standing up their own gpu ready pods?