Trying to get K3s + Nvidia to play nice

BriianPowell · June 3, 2024, 6:14am

I’m trying to run k3s with nvidia support using the newly updated docs at here

I’ve got everything working with regards to the device-plugin. I can exec into the device-plugin pod and run commands like nvidia-smi, however, when I try to run a cuda based container image or something that requires gpu. I get the following error:

failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: fork/exec /usr/bin/nvidia-ctk: no such file or directory: unknown

Current device-plugin config:

apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: nvidia-device-plugin
  namespace: kube-system
spec:
  interval: 1h0m0s
  chart:
    spec:
      chart: nvidia-device-plugin
      sourceRef:
        kind: HelmRepository
        name: nvidia
        namespace: kube-system
      version: '>=0.15.0 <0.16.0'
  values:
    config:
      map:
        default: |-
          version: v1
          flags:
            migStrategy: 'none'
            failOnInitError: true
            nvidiaDriverRoot: '/run/current-system/sw/bin'
            plugin:
              passDeviceSpecs: false
              deviceListStrategy: envvar
              deviceIDStrategy: uuid
    migStrategy: none
    failOnInitError: true
    deviceListStrategy: envvar
    deviceIDStrategy: uuid
    nvidiaDriverRoot: '/run/current-system/sw/bin'
    gdsEnabled: false
    mofedEnabled: false
    compatWithCPUManager: false
    allowDefaultNamespace: false
    runtimeClassName: nvidia
    affinity:
      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
            - matchExpressions:
                - key: kubernetes.io/hostname
                  operator: In
                  values:
                    - sheol
    gfd:
      enabled: true

Here is the pod I’m trying to run just to test:

apiVersion: v1
kind: Pod
metadata:
  name: vector-add-gpu-healthcheck
  namespace: kube-system
spec:
  restartPolicy: OnFailure
  runtimeClassName: nvidia
  containers:
    - name: cuda-vector-add
      image: nvidia/cuda:12.0.0-base-ubuntu22.04
      resources:
        limits:
          nvidia.com/gpu: 1
      env:
        - name: NVIDIA_VISIBLE_DEVICES
          value: all
        - name: NVIDIA_DRIVER_CAPABILITIES
          value: all

My nixos configuration for k3s can be found here

docker version 1.15.0-rc.3
runc version 1.1.12
spec: 1.0.2-dev
go: go1.22.3
libseccomp: 2.5.5
k3s version v1.30.0+k3s1 (14549535)

Has anyone had any luck standing up their own gpu ready pods?