Trying to package tensorflow-rocm

knightpp · September 14, 2023, 7:06pm

I have AMD RX 5500 and wanted to use it for casual ML. So, here is the shell I made. To use that open shell and type nix-shell then create a python virtual environment virtualenv .venv and install the lib pip install tensorflow-rocm. It should pass the following test

import tensorflow as tf
from keras import backend as K

print(tf.config.list_physical_devices('GPU'))

But it does not work when I try to actually run it for ML. I see the following errors in the logs

kernel: amdgpu: HIQ MQD's queue_doorbell_id0 is not 0, Queue preemption time out
kernel: amdgpu: Pasid 0x8016 DQM create queue type 0 failed. ret -62
kernel: amdgpu: HIQ MQD's queue_doorbell_id0 is not 0, Queue preemption time out
kernel: amdgpu: Failed to evict process queues
kernel: amdgpu: Failed to quiesce KFD
kernel: amdgpu: HIQ MQD's queue_doorbell_id0 is not 0, Queue preemption time out
kernel: amdgpu: Didn't find vmid for pasid 0x8016

I think it is because my GPU is not supported (but why does CUDA work on all nvidia GPUs?) or I force GFX version by setting HSA_OVERRIDE_GFX_VERSION env variable to an incorrect version.

Here is pip list output for historical purposes.

Package                      Version
---------------------------- ----------
absl-py                      1.4.0
astunparse                   1.6.3
cachetools                   5.3.1
certifi                      2023.7.22
charset-normalizer           3.2.0
flatbuffers                  23.5.26
gast                         0.4.0
google-auth                  2.23.0
google-auth-oauthlib         1.0.0
google-pasta                 0.2.0
grpcio                       1.58.0
h5py                         3.9.0
idna                         3.4
jax                          0.4.14
keras                        2.12.0
libclang                     16.0.6
Markdown                     3.4.4
MarkupSafe                   2.1.3
ml-dtypes                    0.2.0
numpy                        1.23.5
oauthlib                     3.2.2
opt-einsum                   3.3.0
packaging                    23.1
pip                          23.2.1
protobuf                     4.24.3
pyasn1                       0.5.0
pyasn1-modules               0.3.0
requests                     2.31.0
requests-oauthlib            1.3.1
rsa                          4.9
scipy                        1.11.2
setuptools                   68.1.2
six                          1.16.0
tensorboard                  2.12.3
tensorboard-data-server      0.7.1
tensorflow-estimator         2.12.0
tensorflow-io-gcs-filesystem 0.34.0
tensorflow-rocm              2.12.0.560
termcolor                    2.3.0
typing_extensions            4.7.1
urllib3                      1.26.16
Werkzeug                     2.3.7
wheel                        0.41.2
wrapt                        1.14.1

And the package file.

let
  pkgs = import <nixpkgs> {};
  inherit (pkgs) stdenv;
  amdgpuVersions = {
    gfx1030 = "10.3.0";
    gfx900 = "9.0.0";
    gfx906 = "9.0.6";
    gfx908 = "9.0.8";
    gfx90a = "9.0.a";
  };
  libs = pkgs.lib.makeLibraryPath (builtins.attrValues {
    inherit (pkgs.llvmPackages_rocm) libunwind;
    inherit
      (pkgs)
      rocm-runtime
      rocm-opencl-runtime
      rocm-comgr
      rocm-smi
      miopengemm
      rocblas
      ncurses
      sqlite
      libelf
      libdrm
      numactl
      rocrand
      hipfft
      miopen
      hip
      rccl
      ;
    inherit (stdenv.cc.cc) lib;
  });
  python = pkgs.python310;
in
  stdenv.mkDerivation {
    name = "dev-env";

    env = {
      LD_LIBRARY_PATH = libs;
      CUDA_PATH = pkgs.cudaPackages.cudatoolkit;
      CUDNN_PATH = pkgs.cudaPackages.cudnn;
      OCL_ICD_VENDORS = "${pkgs.rocm-opencl-icd}/etc/OpenCL/vendors/";
      HSA_OVERRIDE_GFX_VERSION = amdgpuVersions.gfx908;
    };

    buildInputs = builtins.attrValues {
      python = python.withPackages (ps:
        builtins.attrValues {
          inherit
            (ps)
            virtualenv 
            ;
        });
    };

    shellHook = ''
      . ./.venv/bin/activate
    '';
  }

Maybe all that information will inspire someone, or the complete solution already exists, and I do not know about it .

At last, does anyone know the error?

       last 10 log lines:
       > Sourcing python-namespaces-hook
       > Sourcing python-catch-conflicts-hook.sh
       > unpacking sources
       > unpacking source archive /nix/store/jy636qwb5bzyag808miwzcm8v1gac7n4-source
       > source root is source
       > setting SOURCE_DATE_EPOCH to timestamp 315619200 of file source/tools/tf_env_collect.sh
       > patching sources
       > configuring
       > configure flags: --prefix=/nix/store/h76wdymkn3scp1swral7h85dkkqx9kn3-python3.10-tensorflow-rocm-2.13.0 --bindir=/nix/store/h76wdymkn3scp1swral7h85dkkqx9kn3-python3.10-tensorflow-rocm-2.13.0/bin --sbindir=/nix/store/h76wdymkn3scp1swral7h85dkkqx9kn3-python3.10-tensorflow-rocm-2.13.0/sbin --includedir=/nix/store/h76wdymkn3scp1swral7h85dkkqx9kn3-python3.10-tensorflow-rocm-2.13.0/include --oldincludedir=/nix/store/h76wdymkn3scp1swral7h85dkkqx9kn3-python3.10-tensorflow-rocm-2.13.0/include --mandir=/nix/store/h76wdymkn3scp1swral7h85dkkqx9kn3-python3.10-tensorflow-rocm-2.13.0/share/man --infodir=/nix/store/h76wdymkn3scp1swral7h85dkkqx9kn3-python3.10-tensorflow-rocm-2.13.0/share/info --docdir=/nix/store/h76wdymkn3scp1swral7h85dkkqx9kn3-python3.10-tensorflow-rocm-2.13.0/share/doc/python3.10-tensorflow-rocm --libdir=/nix/store/h76wdymkn3scp1swral7h85dkkqx9kn3-python3.10-tensorflow-rocm-2.13.0/lib --libexecdir=/nix/store/h76wdymkn3scp1swral7h85dkkqx9kn3-python3.10-tensorflow-rocm-2.13.0/libexec --localedir=/nix/store/h76wdymkn3scp1swral7h85dkkqx9kn3-python3.10-tensorflow-rocm-2.13.0/share/locale
       > /nix/store/vfdg65hiv4bwls48588msw8la7452w2q-stdenv-linux/setup: line 1299: ./configure: cannot execute: required file not found

It happens when I try to compile from source

let
  pkgs = import <nixpkgs> {};
  inherit (pkgs) stdenv;
  python = pkgs.python310;
in
  stdenv.mkDerivation {
    name = "dev-env";

    buildInputs = builtins.attrValues {
      python = python.withPackages (ps:
        builtins.attrValues {
          inherit
            (ps)
            virtualenv
            ;
          tensorflow-rocm = ps.buildPythonPackage rec {
            pname = "tensorflow-rocm";
            version = "2.13.0";
            src = pkgs.fetchFromGitHub {
              owner = "ROCmSoftwarePlatform";
              repo = "tensorflow-upstream";
              rev = "v${version}";
              sha256 = "sha256-Rq5pAVmxlWBVnph20fkAwbfy+iuBNlfFy14poDPd5h0=";
            };

            doCheck = false;
            nativeBuildInputs = [
              python
              pkgs.bazel
            ];
          };
        });
    };
  }

APCodes · September 15, 2023, 1:52pm

That is an interesting ROCm related topic finally

I used to run a 5500 XT 8GB not too long ago. And sadly yes, there is no official support for this card in ROCm as of right now, and to my current knowledge. However there were some attempts to hack something together. Mainly by recompiling bits and pieces of ROCm to fit the right LLVM target architecture. But I never tried making use of these on NixOS.

LLVM supplies a table for all the GPU architecture targets here:

https://www.llvm.org/docs/AMDGPUUsage.html#amdgpu-processors

As you can see from this table, the RX 5500 and 5500 XT are gfx1012. The gfx1030, which is officially supported on the Radeon Pro W6800, represents the Navi 21 architecture in e.g. the W6800 or the Radeon RX 6800 and 6800 XT.

As for my former card and yours, which is Navi 14, you might wanna look at what this chinese dude did:

https://github.com/xuhuisheng/rocm-build/tree/master/navi14

You may also want to carefully read this GitHub issue:

https://github.com/RadeonOpenCompute/ROCm/issues/1735

Please let me here know if you find a working solution, or even partial solution. Maybe I can still put my old 5500 XT to some use with ROCm then

Edit: Just to make this clear. Because the topic mentions tensorflow only. The problem is not tensorflow I’d say. Rather ROCm itself and some of its component libraries will need to be recompiled for support with the specific LLVM architecture to make this work. In principle. And even then there is no guarantee you will not get errors and crashes or something.

Edit: After having read the GitHub issues myself again, I noticed that there is someone now who apparently has successfully built a docker image for ROCm using gfx1012 with pyTorch available. A link to the dockerfile can be found here in this comment:

https://github.com/RadeonOpenCompute/ROCm/issues/1735#issuecomment-1671128570

knightpp · September 16, 2023, 1:40pm

Thanks for the links! That’s interesting, maybe I’ll try to build rocm with navi14 support sometime. In the meantime, I’ve created GitHub - knightpp/nix-tensorflow-rocm flake version of tensorflow-rocm, it does not work yet because of collision

error: collision between `/nix/store/nz375fa4snlkwsmh651rykm1bl6xprpz-python3.10-tensorboard-2.11.0/bin/tensorboard' and `/nix/store/48s18kj0ckjwfwmr3c4qb840v0aa6g2y-python3.10-tensorflow-rocm-2.11.1.550/bin/tensorboard'

I think I’ll need to set ignoreCollisions = true but it forces rebuilding of rocFFT (which is extremely slow)

APCodes · September 16, 2023, 2:15pm

I guess my hope was to motivate you to try the docker image that the other guy has already built. And then give me and potentially others here some feedback on how to goes

The dockerfile I mentioned contains all the instructions to compile ROCm for the correct target on Ubuntu. Plus it includes two patches that are apparently needed for the Navi 14 architecture. The guy has even built the image and put it on dockerhub for anyone to pull. In case you don’t have enough RAM to compile all that.

In principle you should be able to run ROCm on NixOS via this docker image. So if it works you should have pytorch-rocm available inside the container created from the image.

A sidenote: There is also a tensorflow-rocm docker container.

https://rocm.docs.amd.com/en/latest/how_to/tensorflow_install/tensorflow_install.html

So bottom line: I don’t see the added benefit of forcing all this into nix when most of the resources are available via containerization. For tensorflow-rocm you could also try to just setup a python project and install it via pip or poetry or whaterver you like into a venv. That would evidently require a working ROCm setup on your machine. Hence the docker approach

In the end I might try it myself at some point, I still have a 5500 XT and a 5700 XT which I no longer use. And I am in the process of building a little home server. So maybe that would be a nice project to make them work in a VM for stable diffusion or something. We’ll see.

knightpp · September 16, 2023, 2:32pm

Yeah, I agree, I got distracted, learning Nix is more fun than writing python I’ll try the official container and the custom one and see which works.

knightpp · September 16, 2023, 4:03pm

The custom-built container docker.io/serhiin/rocm_gfx1012_pytorch:ubuntu2204_rocm543_pytorch21 does work with pytorch on mnist.

But tensorflow-rocm from PyPi does not work in the container.

2023-09-16 15:54:23.455752: I tensorflow/core/common_runtime/gpu/gpu_device.cc:2011] Ignoring visible gpu device (device: 0, name: AMD Radeon RX 5500, pci bus id: 0000:03:00.0) with AMDGPU version : gfx1012. The supported AMDGPU versions are gfx1030, gfx900, gfx906, gfx908, gfx90a.
2023-09-16 15:54:23.469216: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:508] ROCm Fusion is enabled.

I’m not sure if I can force .whl package to use container’s rocm libs.

APCodes · September 16, 2023, 5:52pm

If pytorch actually works properly that is already a big step! You now have a useable platform for some deep learning experimentation!

For tensorflow, I’d say you will have to recompile that to include support for gfx1012. Which you can do in docker as well. The dockerfiles appear to be on GitHub.

https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/blob/ec6152a25de72ad2a2384f14323713e04583f27c/tensorflow/tools/tf_sig_build_dockerfiles/Dockerfile.rocm#L11

If I am not mistaken, you might just try and set the GPU_DEVICE_TARGETS environment varibable to include gfx1012. And then try to build and run that build of tensorflow inside the container with working ROCm.

If you search through the entire repo you’ll see some other files that also include information regarding this. You might try and set things there as well before doing a recompile.

So you will probably have to fork or at least clone the whole thing and make some changes and take it from there.

But that is considerably more work than just playing with pytorch a bit first…

APCodes · September 28, 2023, 7:31pm

btw, if you ever decide to make a docker image for recompiling tensorflow please share the dockerfile @knightpp

Redhawk18 · October 15, 2024, 9:11pm

Are you still interested in working on this @knightpp? It seems you got super close.

knightpp · October 16, 2024, 7:19am

No, I’m not. My ML enthusiasm has gone