Improving NixOS data science infrastructure: CI for MKL&CUDA

danieldk · August 25, 2020, 1:57pm

eadwu:

In case anyone comes across this, I’m not sure how much of a strict dependency this is, but it seems to prefers CUDA 10.1 (or at least one of the executables has a link to a CUDA 10.1 library)
    cudnn = pkgs.cudnn_cudatoolkit_10_1;
    cudatoolkit = pkgs.cudatoolkit_10_1;

They are now including CUDA in the prebuilt binaries, which makes it even easier to package. Only downside is that libtorch_cuda.so is now a 709MiB binary ;).

danieldk · August 31, 2020, 2:36pm

We now have a derivation python3Packages.pytorch-bin with CUDA support:

https://github.com/NixOS/nixpkgs/pull/96669

Should help those who want to avoid the heavy build of python3Packages.pytorch. I also did a PR for libtorch-bin for the C++ API (which is also used by e.g. the Rust tch) crate, so hopefully we’ll have that soon as well:

https://github.com/NixOS/nixpkgs/pull/96488

danieldk · September 5, 2020, 7:12am

Forgot to add: the upstream builds use MKL as their BLAS library. This should generally give better performance than multi-threaded OpenBLAS, which we use by default as the system-wide BLAS and is thus used by python3Packages.pytorch by default. Multi-threaded OpenBLAS also does not work correctly if your application uses any kind of threading.

Unfortunately, on a AMD Ryzen CPUs, MKL will use slower SSE kernels. You can force the use of AVX2 kernels with the MKL version that libtorch/PyTorch use, with export MKL_DEBUG_CPU_TYPE=5.

eadwu · October 16, 2020, 1:57am

I’d assume this is probably where the most people affected would reside.

If you’re on 5.9, you’ll need to circumvent the GPL-condom to use nvidia-uvm for CUDA. Spent more time then I’d like debugging the wrong places.

danieldk · October 23, 2020, 12:06pm

Thanks for the heads-up! I was wondering why I was getting CUDA error: unknown error errors. Some straceing revealed that /dev/nvidia-uvm could not be opened. Manual modprobeing showed an error that reminded me of your comment.

It’s annoying to run into these Linux ↔ NVIDIA licensing shenanigans when you are just trying to get work done .

brogos · October 23, 2020, 2:14pm

I haven’t tested it, but there is this patch https://github.com/Frogging-Family/nvidia-all/blob/f1d3c6cf024945e7a477ed306bd173fa6b81d72d/patches/kernel-5.9.patch

Officially, we need to wait a month to Nvidia fix it NVIDIA Doesn't Expect To Have Linux 5.9 Driver Support For Another Month - Phoronix

danieldk · October 23, 2020, 6:24pm

Luckily NixOS makes it so easy to switch kernels , so it’s not a real problem to stick to a slightly older kernel.

alexv · November 17, 2020, 12:17am

I have tried to use the new BLAS/LAPACK infrastructure to build R with MKL and the resulting R produces incorrect results for matrix multiplication (see my comment Add BLAS/LAPACK switching mechanism by matthewbauer · Pull Request #83888 · NixOS/nixpkgs · GitHub). I have opened an issue here R built with MKL computes incorrect results · Issue #104026 · NixOS/nixpkgs · GitHub which has the code I used to build R.

brogos · December 15, 2020, 10:45pm

@danieldk CUDA and OpenCL in the last version of Nvidia Driver is working with the kernel 5.9.

danieldk · December 16, 2020, 6:11am

It is. I have switched to 5.9 a while ago (maybe 2 weeks?) and have been using CUDA a lot with Torch.