Most good libraries select optimized kernels based on the CPUID instruction, but there are exceptions. You can also build derivations with specific march
or mtune
flags. There was a recent discussion about this:
They distribute MKL:
https://docs.anaconda.com/mkl-optimizations/index.html
IANAL, but I think the MKL license would allow us to redistribute it, the only icky part is that it requires that MKL is not modified and the question is whether patchelf
’ing is modification. The other roadblock is that Hydra does not build non-free software packages.
It’s pretty much the same story with CUDA. The CUDA Toolkit is distributed through Anaconda:
https://anaconda.org/anaconda/cudatoolkit
Licensing is more complicated for the whole CUDA stack, as far as I remember you cannot redistribute the driver libraries outside very specific circumstances (through the NVIDIA-provided Docker container). But I think most of the toolkit can be redistributed.
Same here. I strongly prefer Nix and I think it is better, but it requires an investment. It’s like writing machine learning code in Rust. I love it, but I would still recommend most of my colleagues to stick with Python + PyTorch for the time being. Doing machine learning in Rust is going off the beaten path and as a result, you have to do a lot of the plumbing yourself.
The are many paper cuts, but our current biggest problem [1] is that we do not have packages built against MKL and CUDA in our binary cache. This makes the user experience miserable. With conda or plain old pip, you are up and running in a few seconds. With nixpkgs, you are first off to one or two hours of builds. That is if the packages build at all, since we don’t build them in Hydra regressions are not always noticed.
CUDA is pretty much essential to any kind of high-performance numerical computing. MKL is still the dominant BLAS/LAPACK library, because it is faster in a lot of scenarios. E.g. a few month ago I benchmarked some of my transformer models again with various BLAS libraries, MKL was ~2 times faster than the best competition (OpenBLAS), due to having specific optimizations (e.g. batched GEMM). Another issue is that OpenBLAS is not usable in some cases, because it cannot be used in multi-threaded applications.
Another issue is that even without MKL and CUDA, builds of some libraries frequently fail, because there are still some pre-SSE4.2 machines in the Hydra cluster. So, even if you intend to use e.g. our PyTorch builds without MKL or CUDA, it frequently gets built locally because it is not in the binary cache.
[1] In my experience, machine learning + NLP pays my bills.