Passing ROCm drivers into docker container on nixos

Morfin · September 7, 2024, 10:25am

So I am running invokeai in a docker container, but this thing is very slow.
My AMD GPU is supposed to have 20GB of VRAM, therefore I figured out that something is wrong.
When using the docker run command I was told to pass my GPU’s drivers, yet I assume that this is not where they are on nixos. It may work for debian or arch based distros.

docker run --device /dev/kfd --device /dev/dri --publish 9090:9090 --name invokeai -d --volume ~/invokeai:/invokeai ghcr.io/invoke-ai/invokeai:v4.2.9-rocm

I’m talking about that --device part.
Invokeai stated that a command rocm-smi is supposed to work.
It does on my base system, not prompting any errors (or so I believe:)

========================================= ROCm System Management Interface =========================================
=================================================== Concise Info ===================================================
Device  [Model : Revision]    Temp    Power    Partitions      SCLK   MCLK     Fan  Perf  PwrCap       VRAM%  GPU%  
        Name (20 chars)       (Edge)  (Avg)    (Mem, Compute)                                                       
====================================================================================================================
0       [0x1002 : 0xcc]       35.0°C  11.0W    N/A, N/A        26Mhz  96Mhz    0%   auto  265.0W         7%   0%    
        0x1002                                                                                                      
1       [0x1002 : 0xc5]       41.0°C  21.096W  N/A, N/A        None   2400Mhz  0%   auto  Unsupported    5%   0%    
        0x1002                                                                                                      
====================================================================================================================
=============================================== End of ROCm SMI Log ================================================

But it doesn’t inside of my docker container.

How to approach this? Like it is pretty difficult to find information about it.
My assumptions are that I need to know where nixos keeps the gpu device/driver, then somehow pass this into the docker.
Alternatively I may need to build my own image based on the already working one, but having something about my rocm in its build steps?
I’m sorry if this is obvious.
And thank you in an advance for any sort of help.

Morfin · September 8, 2024, 4:10pm

Ok, I don’t know if this is it, but replacing --device flags with --privileged seems to increase its speed.
It may be a security risk though.
I will try our more things before claiming something to be the solution to my originally stated issue.

Morfin · September 8, 2024, 5:27pm

Ok, it’s going to be lots of fun. Apparently to have amd gpu support I am supposed to rebuild the image. The issue is that, for my current knowledge, on nixos I will be expected to use this, instead of docker build or what the creators expect people to use docker compose build.

Also, if somebody would encounter this thread, here is an interesting url:

And I guess this one is also related but specific to running this SD container:

The only way forward will be to learn that nix way to build docker images.
Well, the sooner I start the quicker I’ll finish.

SergeK · September 8, 2024, 9:25pm

You’re not expected to do that; dockerTools is just a more flexible and reliable way to build images, but neither NixOS depends on dockerTools nor dockerTools on NixOS

This says that you do not have to do anything special, just mount the respective /dev nodes into the container

strange, could be something with ulimits

Verify the device (/dev/... nodes) is accessible in the container; try running any simple rocm application (rocm-smi too) with strace or LD_DEBUG=libs and inspect the outputs

I don’t have access to amd hardware and am not familiar with the ecosystem, but I believe yo shouldn’t need any such impure stuff with ROCm

Morfin · September 8, 2024, 10:42pm

I’ve done much more digging. The issue is that the app inside of the container tries to use cuda which won’t work, because I use rocm, then it switches to use cpu only which is painfully slow.

This is what I currently try to run and it’s output:

sudo docker run --device /dev/kfd --device /dev/dri --publish 9090:9090 --volume ~/invokeai:/invokeai --env GPU_DRIVER=rocm --privileged ghcr.io/invoke-ai/invokeai:v4.2.9-rocm
The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.
0it [00:00, ?it/s]
The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
[2024-09-08 22:31:58,149]::[InvokeAI]::INFO --> Patchmatch initialized
[2024-09-08 22:31:58,665]::[InvokeAI]::INFO --> Using torch device: CPU
[2024-09-08 22:31:58,932]::[InvokeAI]::INFO --> cuDNN version: 8902
[2024-09-08 22:31:58,945]::[uvicorn.error]::INFO --> Started server process [1]
[2024-09-08 22:31:58,945]::[uvicorn.error]::INFO --> Waiting for application startup.
[2024-09-08 22:31:58,945]::[InvokeAI]::INFO --> InvokeAI version 4.2.9
[2024-09-08 22:31:58,945]::[InvokeAI]::INFO --> Root directory = /invokeai
[2024-09-08 22:31:58,945]::[InvokeAI]::INFO --> Initializing database at /invokeai/databases/invokeai.db
[2024-09-08 22:31:59,034]::[uvicorn.error]::INFO --> Application startup complete.
[2024-09-08 22:31:59,035]::[uvicorn.error]::INFO --> Uvicorn running on http://0.0.0.0:9090 (Press CTRL+C to quit)

I believe that to really have it working, I need to rebuild the image with that --env GPU_DRIVER=rocm flag on.

And yeah, rocm-smi is not detected inside of the docker container:

root@91021869ee03:/opt/invokeai# rocm-smi
bash: rocm-smi: command not found
root@91021869ee03:/opt/invokeai#

Shall I make and image based on this image, but in my own build steps add a step that installs it? Am I thinking in the right direction?

My bad, it’s not an actual issue I think.

Docker started to cache errors on top of errors and I’ve begun to look for other ways to solve this issue. And I’ve heard that this may be the way.

Yet it doesn’t work, but I think that this is an issue cause by how the maintainers have build this image. I may try other versions of this docker image, maybe I’m using a faulty one.

SergeK · September 8, 2024, 11:05pm

Ah so the image ships software that doesn’t support ROCm:)

Indeed. This is the part that’d be easier to manage via dockerTools if this program was packaged nixpkgs-style. You could try your luck with something like nix bundle --bundler github:NixOS/bundlers#toDockerImage github:nixified-ai/flake#invokeai-amd but with the disclaimer that that probably reuses some the pip/pypi artifacts (possibly via poetry2nix) and may not necessarily provide the same experience as nixpkgs

rocm-smi

You need to include rocm-smi in the image (or mount it from the host like it’s, unfortunately, done with nvidia)