Cmake and "CUDA failure 35: CUDA driver version is insufficient for CUDA runtime version"

chrism · June 29, 2023, 3:39pm

Hi folks,

I’m having a fine old time trying to get a package to compile against CUDA.

This would be an easy bucket for somebody, but I’ve already seen Compiling Cuda Kernels with nvcc on nixos no driver found . cudaGetDriverVersion returns 0 . cudaRuntime Error 35 driver version insufficient and I’ve used addOpenGLRunpath to add the rpath to everything in the “out” output in both postBuild and postFixup (I first run autoPatchelf, then that). Alas, I still get the error when its tests are run.

So although I’m pretty sure the error is lying to me (I’m on NVIDIA driver version 525.116.04 and it is marked as usable with CUDA 11.6), I suppose I have to be sure.

The project I’m messing with is onnxruntime, a massive pile from MS which has a 1.13.1 release in nixpkgs but I need 1.14.1, and I need it built with “tensorrt” support, which the nixpkgs one is not. I’ve managed to get it compiling seemingly fine, but demonstrates the error when one of the tests that uses CUDA is executed despite all my elfpatching.

The tests are kicked off via a command something like /nix/store/fqfi0m3fw3szj3n99r5n359579808bh6-cmake-3.25.3/bin/ctest --force-new-ctest-process . What I’d like to do is strace the offending test process to see if it actually does find libcuda.so.1 (the error, maddeningly, is apparently the same whether the driver library is not found or mismatching). But I’m not sure a) how to inject the strace into the ctest invocation b) whether the strace will work given that ctest appears to want to create new processes for each test.

So my question is: does someone with cmake-fu and nix-fu have any suggestions about how to put an strace in here so I can see what’s happening?

The nix derivation I’m hacking on is at https://github.com/mcdonc/.nixconfig/blob/4fe7b64175e7d071721997e1975623fcb3a4883f/common/obs-backgroundremoval/stripped-onnxruntime.nix and it contains the meat of one of the errors at the top in a comment.

Thanks for any thoughts!

C

chrism · July 1, 2023, 6:17am

It turns out that the GNU dynamic linker respects an environment variable named LD_DEBUG. Setting LD_DEBUG=libs in the environment of the test process displays the paths each executable and shared lib searches for its components. The message was indeed lying to me; my runtime and driver versions are compatible, it just couldn’t find the driver (libcuda.so.1).

For whatever reason (likely for repeatability), just setting his in the environment of nix-build didn’t do the trick. I had to set it in the preCheck of the derivation I’m trying to build, e.g.

  preCheck = ''
    export LD_DEBUG=libs
  ''

My derivation still isn’t working, but this question is now answered.

chrism · August 1, 2023, 3:14am

For the record, this turned out to be a Nix build sandboxing issue. Temporarily turning off sandboxing got me further.