I stopped toying with this (I will get back to it soon). However when I had this error, it meant that some binaries from the nvidia-docker package were not in path.
Unfortunately no, I have now tried with versions 1.9.0, 1.13.1, and 1.12.1.
nvidia-smi
works with a k3s ctr run --gpus 0
, but the nvidia-container-runtime
binary is now (in v1.12.1 and v1.13.1) failing to load libcuda
, which results in a missing symbol error.
k3s ctr run --rm -t --gpus 0 --runc-binary=nvidia-container-runtime docker.io/nvidia/cuda:11.4.0-base-ubuntu20.04 cuda
ctr: failed to create shim task: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/k3s/containerd/io.containerd.run
time.v2.task/k8s.io/cuda23/log.json: no such file or directory): nvidia-container-runtime did not terminate successfully: exit status 127: /
nix/store/qm28zv7kyl60pxhf8xyp33c1m5dr6jzz-nvidia-k3s/bin/nvidia-container-runtime: symbol lookup error: /nix/store/qm28zv7kyl60pxhf8xyp33c1
m5dr6jzz-nvidia-k3s/bin/nvidia-container-runtime: undefined symbol: cuDriverGetVersion: unknown
Very puzzling, since libcuda
is in the /tmp/ld.so.cache
, and 1.9.0
was not having this issue.
ldconfig -C /tmp/ld.so.cache --print-cache | grep cuda
libcudadebugger.so.1 (libc6,x86-64) => /tmp/nvidia-libs/libcudadebugger.so.1
libcudadebugger.so (libc6,x86-64) => /tmp/nvidia-libs/libcudadebugger.so
libcuda.so.1 (libc6,x86-64) => /tmp/nvidia-libs/libcuda.so.1
libcuda.so (libc6,x86-64) => /tmp/nvidia-libs/libcuda.so
And nvidia-container-cli
is having no issues
nvidia-container-cli -k -d log
cat log | grep libcuda
I0519 22:42:56.134093 3346551 nvc_info.c:174] selecting /nix/store/30x7mhkxv6ghf8893d6lhd5jiplxh897-nvidia-x11-525.89.02-5.15.96/lib/libcuda
debugger.so.525.89.02
I0519 22:42:56.134210 3346551 nvc_info.c:174] selecting /nix/store/30x7mhkxv6ghf8893d6lhd5jiplxh897-nvidia-x11-525.89.02-5.15.96/lib/libcuda
.so.525.89.02
W0519 22:42:56.134862 3346551 nvc_info.c:404] missing compat32 library libcuda.so
W0519 22:42:56.134869 3346551 nvc_info.c:404] missing compat32 library libcudadebugger.so
But there is no load of libcuda
occurring
$ LD_DEBUG=libs nvidia-container-runtime 2>&1 | grep "find library"
3375337: find library=libdl.so.2 [0]; searching
3375337: find library=libc.so.6 [0]; searching
3375344: find library=libdl.so.2 [0]; searching
3375344: find library=libc.so.6 [0]; searching
3375337: find library=libdl.so.2 [0]; searching
3375337: find library=libpthread.so.0 [0]; searching
3375337: find library=libc.so.6 [0]; searching
$ strace nvidia-container-runtime 2>&1 | rg 'openat\(.*, "/nix/store/(.*)",.*' -r '$1'
76l4v99sk83ylfwkz8wmwrm4s8h73rhd-glibc-2.35-224/lib/glibc-hwcaps/x86-64-v3/libdl.so.2
76l4v99sk83ylfwkz8wmwrm4s8h73rhd-glibc-2.35-224/lib/glibc-hwcaps/x86-64-v2/libdl.so.2
76l4v99sk83ylfwkz8wmwrm4s8h73rhd-glibc-2.35-224/lib/tls/haswell/x86_64/libdl.so.2
76l4v99sk83ylfwkz8wmwrm4s8h73rhd-glibc-2.35-224/lib/tls/haswell/libdl.so.2
76l4v99sk83ylfwkz8wmwrm4s8h73rhd-glibc-2.35-224/lib/tls/x86_64/libdl.so.2
76l4v99sk83ylfwkz8wmwrm4s8h73rhd-glibc-2.35-224/lib/tls/libdl.so.2
76l4v99sk83ylfwkz8wmwrm4s8h73rhd-glibc-2.35-224/lib/haswell/x86_64/libdl.so.2
76l4v99sk83ylfwkz8wmwrm4s8h73rhd-glibc-2.35-224/lib/haswell/libdl.so.2
76l4v99sk83ylfwkz8wmwrm4s8h73rhd-glibc-2.35-224/lib/x86_64/libdl.so.2
76l4v99sk83ylfwkz8wmwrm4s8h73rhd-glibc-2.35-224/lib/libdl.so.2
76l4v99sk83ylfwkz8wmwrm4s8h73rhd-glibc-2.35-224/lib/libc.so.6
76l4v99sk83ylfwkz8wmwrm4s8h73rhd-glibc-2.35-224/lib/gconv/gconv-modules.cache
76l4v99sk83ylfwkz8wmwrm4s8h73rhd-glibc-2.35-224/lib/gconv/gconv-modules
76l4v99sk83ylfwkz8wmwrm4s8h73rhd-glibc-2.35-224/lib/gconv/gconv-modules.d
76l4v99sk83ylfwkz8wmwrm4s8h73rhd-glibc-2.35-224/lib/gconv/gconv-modules.d/gconv-modules-extra.conf
4y4jdqg9s8sw4f56n7lqy59azi8lgp5z-container-toolkit-container-toolkit-1.12.1/lib/glibc-hwcaps/x86-64-v3/libdl.so.2
4y4jdqg9s8sw4f56n7lqy59azi8lgp5z-container-toolkit-container-toolkit-1.12.1/lib/glibc-hwcaps/x86-64-v2/libdl.so.2
4y4jdqg9s8sw4f56n7lqy59azi8lgp5z-container-toolkit-container-toolkit-1.12.1/lib/tls/haswell/x86_64/libdl.so.2
4y4jdqg9s8sw4f56n7lqy59azi8lgp5z-container-toolkit-container-toolkit-1.12.1/lib/tls/haswell/libdl.so.2
4y4jdqg9s8sw4f56n7lqy59azi8lgp5z-container-toolkit-container-toolkit-1.12.1/lib/tls/x86_64/libdl.so.2
4y4jdqg9s8sw4f56n7lqy59azi8lgp5z-container-toolkit-container-toolkit-1.12.1/lib/tls/libdl.so.2
4y4jdqg9s8sw4f56n7lqy59azi8lgp5z-container-toolkit-container-toolkit-1.12.1/lib/haswell/x86_64/libdl.so.2
4y4jdqg9s8sw4f56n7lqy59azi8lgp5z-container-toolkit-container-toolkit-1.12.1/lib/haswell/libdl.so.2
4y4jdqg9s8sw4f56n7lqy59azi8lgp5z-container-toolkit-container-toolkit-1.12.1/lib/x86_64/libdl.so.2
4y4jdqg9s8sw4f56n7lqy59azi8lgp5z-container-toolkit-container-toolkit-1.12.1/lib/libdl.so.2
76l4v99sk83ylfwkz8wmwrm4s8h73rhd-glibc-2.35-224/lib/glibc-hwcaps/x86-64-v3/libdl.so.2
76l4v99sk83ylfwkz8wmwrm4s8h73rhd-glibc-2.35-224/lib/glibc-hwcaps/x86-64-v2/libdl.so.2
76l4v99sk83ylfwkz8wmwrm4s8h73rhd-glibc-2.35-224/lib/tls/haswell/x86_64/libdl.so.2
76l4v99sk83ylfwkz8wmwrm4s8h73rhd-glibc-2.35-224/lib/tls/haswell/libdl.so.2
76l4v99sk83ylfwkz8wmwrm4s8h73rhd-glibc-2.35-224/lib/tls/x86_64/libdl.so.2
76l4v99sk83ylfwkz8wmwrm4s8h73rhd-glibc-2.35-224/lib/tls/libdl.so.2
76l4v99sk83ylfwkz8wmwrm4s8h73rhd-glibc-2.35-224/lib/haswell/x86_64/libdl.so.2
76l4v99sk83ylfwkz8wmwrm4s8h73rhd-glibc-2.35-224/lib/haswell/libdl.so.2
76l4v99sk83ylfwkz8wmwrm4s8h73rhd-glibc-2.35-224/lib/x86_64/libdl.so.2
76l4v99sk83ylfwkz8wmwrm4s8h73rhd-glibc-2.35-224/lib/libdl.so.2
76l4v99sk83ylfwkz8wmwrm4s8h73rhd-glibc-2.35-224/lib/libpthread.so.0
76l4v99sk83ylfwkz8wmwrm4s8h73rhd-glibc-2.35-224/lib/libc.so.6
This is with the patch on nvidia-container-toolkit
preBuild = ''
substituteInPlace go/src/github.com/NVIDIA/nvidia-container-toolkit/internal/config/config.go \
--replace '/usr/bin' '${placeholder "out"}/bin'
sed -i -e "s@/etc/ld.so.cache@/tmp/ld.so.cache@" -e "s@/etc/ld.so.conf@/tmp/ld.so.conf@" \
go/src/github.com/NVIDIA/nvidia-container-toolkit/internal/ldcache/ldcache.go \
go/src/github.com/NVIDIA/nvidia-container-toolkit/cmd/nvidia-ctk/hook/update-ldcache/update-ldcache.go \
'';
Thanks for the tip on nvidia’s containerized driver solution. I was considering using kata containers but that would be a pivot away from nixos which is what I’m trying to avoid
Ok so I took the very un-nix hammer approach and it worked! I have the current nixpkgs-unstable nvidia-container-toolkit derivation in my overlay and statically linked libcuda
and libnvidia-ml
into the go binaries. This is with v1.12.1
.
ldflags = [ "-s" "-w" "-extldflags" "'-L${unpatched-nvidia-driver}/lib -lcuda -lnvidia-ml'" ];
Where unpatched-nvidia-driver
is @eadwu’s builder swap, though I’m not sure if that matters in this case since these libs don’t link to any other nvidia libs, and the ones they do link should be in whatever container is running.
ldd /nix/store/30x7mhkxv6ghf8893d6lhd5jiplxh897-nvidia-x11-525.89.02-5.15.96/lib/libcuda.so.525.89.02
linux-vdso.so.1 (0x00007ffd5fc4e000)
libm.so.6 => /nix/store/xnk2z26fqy86xahiz3q797dzqx96sidk-glibc-2.37-8/lib/libm.so.6 (0x00007f3405733000)
libc.so.6 => /nix/store/xnk2z26fqy86xahiz3q797dzqx96sidk-glibc-2.37-8/lib/libc.so.6 (0x00007f340554d000)
libdl.so.2 => /nix/store/xnk2z26fqy86xahiz3q797dzqx96sidk-glibc-2.37-8/lib/libdl.so.2 (0x00007f3405548000)
libpthread.so.0 => /nix/store/xnk2z26fqy86xahiz3q797dzqx96sidk-glibc-2.37-8/lib/libpthread.so.0 (0x00007f3405543000)
librt.so.1 => /nix/store/xnk2z26fqy86xahiz3q797dzqx96sidk-glibc-2.37-8/lib/librt.so.1 (0x00007f340553e000)
/nix/store/xnk2z26fqy86xahiz3q797dzqx96sidk-glibc-2.37-8/lib64/ld-linux-x86-64.so.2 (0x00007f34074f5000)
ldd /nix/store/30x7mhkxv6ghf8893d6lhd5jiplxh897-nvidia-x11-525.89.02-5.15.96/lib/libnvidia-ml.so.1
linux-vdso.so.1 (0x00007ffdde5fd000)
libpthread.so.0 => /nix/store/xnk2z26fqy86xahiz3q797dzqx96sidk-glibc-2.37-8/lib/libpthread.so.0 (0x00007faa7a5b8000)
libm.so.6 => /nix/store/xnk2z26fqy86xahiz3q797dzqx96sidk-glibc-2.37-8/lib/libm.so.6 (0x00007faa79520000)
libdl.so.2 => /nix/store/xnk2z26fqy86xahiz3q797dzqx96sidk-glibc-2.37-8/lib/libdl.so.2 (0x00007faa7a5b3000)
libc.so.6 => /nix/store/xnk2z26fqy86xahiz3q797dzqx96sidk-glibc-2.37-8/lib/libc.so.6 (0x00007faa7933a000)
/nix/store/xnk2z26fqy86xahiz3q797dzqx96sidk-glibc-2.37-8/lib64/ld-linux-x86-64.so.2 (0x00007faa7a5bf000)
Furthermore, I had twiddle with the /var/lib/rancher/k3s/agent/etc/containerd/config.toml.tmpl
quite a bit and landed on this
[plugins.opt]
path = "{{ .NodeConfig.Containerd.Opt }}"
[plugins.cri]
stream_server_address = "127.0.0.1"
stream_server_port = "10010"
# ---- added for gpu
enable_selinux = {{ .NodeConfig.SELinux }}
enable_unprivileged_ports = true
enable_unprivileged_icmp = true
# end added for gpu
{{- if .IsRunningInUserNS }}
disable_cgroup = true
disable_apparmor = true
restrict_oom_score_adj = true
{{end}}
{{- if .NodeConfig.AgentConfig.PauseImage }}
sandbox_image = "{{ .NodeConfig.AgentConfig.PauseImage }}"
{{end}}
{{- if not .NodeConfig.NoFlannel }}
[plugins.cri.cni]
bin_dir = "{{ .NodeConfig.AgentConfig.CNIBinDir }}"
conf_dir = "{{ .NodeConfig.AgentConfig.CNIConfDir }}"
{{end}}
[plugins.cri.containerd]
default_runtime_name = "runc"
# ---- added for GPU support
# https://github.com/k3s-io/k3s/issues/4391#issuecomment-1202986597
snapshotter = "overlayfs"
disable_snapshot_annotations = true
[plugins.cri.containerd.runtimes.runc]
runtime_type = "io.containerd.runc.v2"
# ---- added for GPU support
[plugins.cri.containerd.runtimes.nvidia]
runtime_type = "io.containerd.runc.v2"
runtime_root = ""
runtime_engine = ""
privileged_without_host_devices = false
[plugins.cri.containerd.runtimes.nvidia.options]
BinaryName = "@nvidia-container-runtime@"
SystemdCgroup = true
{{ if .PrivateRegistryConfig }}
{{ if .PrivateRegistryConfig.Mirrors }}
[plugins.cri.registry.mirrors]{{end}}
{{range $k, $v := .PrivateRegistryConfig.Mirrors }}
[plugins.cri.registry.mirrors."{{$k}}"]
endpoint = [{{range $i, $j := $v.Endpoints}}{{if $i}}, {{end}}{{printf "%q" .}}{{end}}]
{{end}}
{{range $k, $v := .PrivateRegistryConfig.Configs }}
{{ if $v.Auth }}
[plugins.cri.registry.configs."{{$k}}".auth]
{{ if $v.Auth.Username }}username = "{{ $v.Auth.Username }}"{{end}}
{{ if $v.Auth.Password }}password = "{{ $v.Auth.Password }}"{{end}}
{{ if $v.Auth.Auth }}auth = "{{ $v.Auth.Auth }}"{{end}}
{{ if $v.Auth.IdentityToken }}identitytoken = "{{ $v.Auth.IdentityToken }}"{{end}}
{{end}}
{{ if $v.TLS }}
[plugins.cri.registry.configs."{{$k}}".tls]
{{ if $v.TLS.CAFile }}ca_file = "{{ $v.TLS.CAFile }}"{{end}}
{{ if $v.TLS.CertFile }}cert_file = "{{ $v.TLS.CertFile }}"{{end}}
{{ if $v.TLS.KeyFile }}key_file = "{{ $v.TLS.KeyFile }}"{{end}}
{{end}}
{{end}}
{{end}}
Wherein “@nvidia-container-runtime@” is substituted at rebuild with the full /nix/store
path.
Now I’m at a stage where the nvidia-device-plugin
is running stable and show gpus on the node, but only libcuda.so.525.89.02
is being mounted into pod containers with the runtimeClassName: nvidia
instead of libcuda.so.1
which is what nvidia-smi
and friends are looking for. Looking to patch that with some kind of admission webhook policy since one run of ldconfig
in the container adds the libcuda.so.1
symlink and gets it all working, as opposed to reversing the static library linking and crossing my fingers that nvidia-container-runtime
mounts appropriately.
Thanks for the help!
My final solution to the ldconfig
run issue inside the container was another patch of libnvidia-container/src/nvc_ldcache.c
, essentially running ldconfig
again but with the /etc/ld.so.conf
that is mounted into the containers containing the paths expected by the nvidia-container-runtime
.
$ kubectl exec -n kube-system -it nvidia-device-plugin-pz56s -- /bin/sh
$ cat /etc/ld.so.conf
include /etc/ld.so.conf.d/*.conf
$ cat /etc/ld.so.conf.d/*.conf
# libc default configuration
/usr/local/lib
/usr/local/nvidia/lib
/usr/local/nvidia/lib64
# Multiarch support
/usr/local/lib/x86_64-linux-gnu
/lib/x86_64-linux-gnu
/usr/lib/x86_64-linux-gnu
This patch works on top of @eadwu’s patch.
libnvc-ldcache-container-again.patch
diff --git a/src/nvc_ldcache.c b/src/nvc_ldcache.c
index db3b2f69..28e08d3b 100644
--- a/src/nvc_ldcache.c
+++ b/src/nvc_ldcache.c
@@ -356,6 +356,7 @@ int
nvc_ldcache_update(struct nvc_context *ctx, const struct nvc_container *cnt)
{
char **argv;
+ char **argv_container;
pid_t child;
int status;
bool drop_groups = true;
@@ -402,11 +403,18 @@ nvc_ldcache_update(struct nvc_context *ctx, const struct nvc_container *cnt)
if (limit_syscalls(&ctx->err) < 0)
goto fail;
+ argv_container = (char * []){argv[0], "-f", "/etc/ld.so.conf", "-C", "/etc/ld.so.cache", cnt->cfg.libs_dir, cnt->cfg.libs32_dir, NULL};
if (fd < 0)
execve(argv[0], argv, (char * const []){NULL});
else
fexecve(fd, argv, (char * const []){NULL});
error_set(&ctx->err, "process execution failed");
+ log_infof("executing %s again", argv_container[0]);
+ if (fd < 0)
+ execve(argv_container[0], argv_container, (char * const []){NULL});
+ else
+ fexecve(fd, argv_container, (char * const []){NULL});
+ error_set(&ctx->err, "process execution failed");
fail:
log_errf("could not start %s: %s", argv[0], ctx->err.msg);
(ctx->err.code == ENOENT) ? _exit(EXIT_SUCCESS) : _exit(EXIT_FAILURE);
libnvidia-container.nix
patches = [
# eadwu's patch
./libnvidia-container-ldcache.patch
# patch from above
./libnvc-ldcache-container-again.patch
# patch from nixpkgs
./inline-nvcgo-struct.patch
];
Edit: the admission webhook solution did not work because
-
initContainers
can only operate on files in a sharedvolumeMount
which could work in this case but complicates matter significantly since you’d have to mount all of the library volumes and/etc
subpaths containingld.so.{cache,config}
-
lifeCycle: postStart: exec: command: ["ldconfig"]
just invites a race condition that fails often
Was wondering if anyone has a working example of this? Currently trying to integrate my Nvidia GPU with k3s + containerd and I’m hitting a wall, was hoping to find some guidance
My problem (like an idiot) was linuxPackages_hardend
Can someone post a working config? I am struggling to make it work.
I’ve received various requests over the past months for a working config, so I’ve decided to make the repo I have running this config public.
Please note the nixpkgs flake lock versions, as something may have changed and I have not needed to update the base config in some time.
Good luck!
Thanks! I’m not familiar with k3s, or runc, or nvidia-container-runtime… may I ask again, why do you have to to link nvidia_x11
directly? Is it not an option to just take the userspace drivers from /run/opengl-driver/lib
?
I’m sorry but I don’t think I understand the question. We’re patching (or rather unpatching the patchelf step) nvidia_x11 pkg so that the drivers work in the FHS containers in k3s which won’t have the /nix/store in them. The drivers need to be discovered twice via ldconfig - once by the nvidia container runtime on the host for mounting into the container, and once by the container itself. How would the userspace drivers inherit that property?
I can decompose the question into smaller ones:
-
How do you decide, at build time, which
nvidia_x11
to take? Because the container must mount the userspace drivers (includinglibcuda.so
andlibnvidia-ml.so
) compatible (=precisely the same, or at least newer) with the kernel module used by the host machine at runtime -
… which won’t have the /nix/store in them
But the drivers are still mounted impurely from the host, why not mount the
/nix/store/xxxx...-nvidia_x11
&c paths as well -
The drivers need to be discovered twice via ldconfig
This happens for some packages. If possible, the desirable way to approach this is to remove/make optional all of the references to
ld.so.cache
, and replace the “path inference” results with the predictable values. For a container running on a NixOS host, this path would be/run/opengl-driver/lib
, plus targets of the symlinks therein. The drivers deployed there by NixOS in that location are known to be compatible with the kernel used at runtime
I’m asking because I think we should want to integrate the solution into nixpkgs. Thanks!
Ok! We’re bumping up against the edge of my knowledge here, but if I understand correctly:
-
How do you decide, at build time, which
nvidia_x11
to take?I assume you’re talking about this line here:
unpatched-nvidia-driver = (super.pkgs.linuxKernel.packages.linux_5_15.nvidia_x11_production.overrideAttrs (oldAttrs: { builder = ../overlays/nvidia-builder.sh; }));
And indeed I chose that through trial and error. Which package is the standard one to install that will be guaranteed to be compatible with the kernel?
-
But the drivers are still mounted impurely from the host
This is done by the
nvidia-container-runtime
. Where is chooses to mount the drivers in the container – regardless of how it finds them – is a bit deeper in the codebase than I looked. The discovery mechanism is vialdconfig
, but currently we are pointing it at the drivers in/nix/store/xxx...-nvidia-x11
on the host here. Are you asking why/nix/store/xxx-nvidia-x11
is not mounted directly (at the same path) in the container? -
the desirable way to approach this is to remove/make optional all of the references to
ld.so.cache
, and replace the “path inference” results with the predictable valuesI believe this would require a different patch of
libnvidia-container
so that it searches/run/opengl-driver/lib
instead of relying onldconfig
while on the host. I’m not sure how deep that would go, but it may be possible. The search forld.so.cache
is hardcoded into the library here, a path altered by this patch and incommon.h
by this patch. I think this would fundamentally changelibnvidia-container
’s discovery mechanism. The difficulty there is that this code seems to run once on the host and once in the container, so getting it to use different discovery mechanisms depending on the context would take more considerable understanding of the library, esp. since containers will not adhere to the NixOS paths. Alternatively, gettinglibnvidia-container
to mount the NixOS paths, then update the container’s ownld.so.cache
to reference those paths is a different challenge. For example, I’m not sure if the following search paths provided byld.so.conf.d
in thenvidia-device-plugin
container are baked into the container, or altered bylibnvidia-container
at runtime.$ kubectl exec -n kube-system -it nvidia-device-plugin-pz56s -- /bin/sh $ cat /etc/ld.so.conf include /etc/ld.so.conf.d/*.conf $ cat /etc/ld.so.conf.d/*.conf # libc default configuration /usr/local/lib /usr/local/nvidia/lib /usr/local/nvidia/lib64 # Multiarch support /usr/local/lib/x86_64-linux-gnu /lib/x86_64-linux-gnu /usr/lib/x86_64-linux-gnu
Would love to see a pure integration in nixpkgs! My feeling is that some deeper C skills than I possess would be necessary here.
rusty-jules has pretty much said everything but I’ll add what I can remember. I’ve since abandoned my NVIDIA integration in k3s since there were a host of other issues.
libnvidia-container has a list of files it looks for https://github.com/NVIDIA/libnvidia-container/blob/1eb5a30a6ad0415550a9df632ac8832bf7e2bbba/src/nvc_info.c#L51
Which then finds the library paths using the ldcache https://github.com/NVIDIA/libnvidia-container/blob/1eb5a30a6ad0415550a9df632ac8832bf7e2bbba/src/nvc_info.c#L220
Ideally, we’d be using the NVIDIA/CUDA driver pod from NVIDIA, but my attempts have failed due to layout of NixOS (don’t quite remember errors, but I’d imagine the classic problems with trying to use imperative programs).
2/3.
If you were to try to make it NixOS-friendly, a better alternative it write it from scratch. In comparison in making due with the ldcache approach, it is a lot more prone to needing active maintenance from upstream changes.
fwiw using the cdi integration with containerd seems much better. I have a working setup, though I’m still using the nvidia ldcache shenanigans to generate the cdi spec via nvidia-ctk cdi generate
, but this is patchable and/or writable from scratch. For what it’s worth it seems to work more or less out of the box, though I had to manually patch the output (with jq) to mount /run/opengl-driver into the sandbox to get a well-known LD_LIBRARY_PATH to work with, but the autogenerated spec (essentially nvidia-ctk cdi generate
used on an ldconfig dump of /run/opengl-driver) actually managed to at least resolve every symlink into the store along with the relevant devices on its own.
We definitely need to reach out with the upstream and discuss a way for them to (1) stop relying on the obscure glibc internals such as /etc/ld.so.cache
, and (2) stop assuming that mounting the host system’s drivers is the right thing to do (generally speaking, it’s not, because the container image might come with a different libc).
The first step would be to introduce the static configuration option allowing the user to explicitly list the paths to the drivers, at build time or in a config file read at runtime. In principle, the /etc/ld.so.cache
already is that, except writing one affects more than just the ctk.
I’m planning to open an issue/PR eventually but it’ll probably take me ages, and I’ll be really happy if somebody goes ahead before then
I’ll also add that this doesn’t only affect k3s, but also for example apptainer/singularity which are important for the HPC stuff
Hi! Trying to update libnvidia-container
and nvidia-container-toolkit
in apptainer: unbreak --nv by SomeoneSerge · Pull Request #279235 · NixOS/nixpkgs · GitHub, but I’m mostly motivated by apptainer
’s needs. Could somebody test these changes with containerd
?
EDIT: nvidia-container-toolkit: 1.9.0 -> 1.15.0-rc.3 by aaronmondal · Pull Request #278969 · NixOS/nixpkgs · GitHub has been merged
Hi! Now that nvidia-container-toolkit
is available on nixpkgs-unstable
, does it mean that we can just make it available in environment.systemPackages and it’s just Going To Work™️?
I’ll probably be testing it soon but I thought I’d ask if someone’s tried it already?