Hi, I’d like to use a docker container with CUDA enabled for deep learning experiments, in a nixos
host, and it seems that the containers can’t see the GPU that nixos
can see.
I have enabled all the options for virtualization, docker
and docker-nvidia
on the nixos
host, to the point in which the nvidia-smi
command returns an output that shows that the graphic card is present:
$ nvidia-smi
Fri Dec 9 06:47:14 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.60.02 Driver Version: 510.60.02 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 On | N/A |
| 0% 33C P8 20W / 270W | 272MiB / 8192MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1391 G ...xorg-server-1.20.13/bin/X 122MiB |
| 0 N/A N/A 12415 G ...02.0/bin/.firefox-wrapped 147MiB |
+-----------------------------------------------------------------------------+
Now, in the repo for gpu-jupyter, it is mentioned that this is a good command to check if your GPU can be used from a docker container:
docker run --gpus all nvidia/cuda:11.6.2-cudnn8-runtime-ubuntu20.04 nvidia-smi
In my case, the output of the command is:
docker: Error response from daemon: failed to create shim: OCI runtime create failed:
container_linux.go:380: starting container process caused: process_linux.go:545:
container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr:
nvidia-container-cli: container error: cgroup subsystem devices not found: unknown.
ERRO[0000] error waiting for container: context canceled
by googling the error, I tried adding a number of cgroup
-related options to the kernel, but none worked properly. I’d like to take a step back, and properly understand what this means, and ideally how to fix it. My overarching goal is to be able to get a gpu-accelerated docker container for deep learning.