Announcing the @NixOS/cuda-maintainers team and a call for maintainers!

samuela · March 8, 2022, 12:42am

I’m excited to announce the launch of the @NixOS/cuda-maintainers team. Our goal is to make Nix/NixOS the first choice for users of CUDA-related software, especially in machine learning and scientific computing.

Motivation

GPUs are essential in the machine learning and scientific computing world. However, installing and managing driver versions, CUDA toolkit versions, and related software is notoriously challenging. Navigating the litany of version constraints is a massive PITA. First-class support for Nvidia/CUDA and reproducible software toolchains is a massive selling point for many in the field, and has the potential to bring in new kinds of investment and users into the Nix/NixOS ecosystem!

Taking cue from the lovely @NixOS/darwin-maintainers and Marketing teams, it’s our mission to do the same for GPU-accelerated scientific computing!

Initial roadmap

Here are a few things on the immediate TODO list:

CI and caching infrastructure for packages depending on CUDA. Hydra currently does not build or cache any software that depends on CUDA. See this thread.
Create a test suite of GPU-enabled tests for software in Nixpkgs. The nix build environment does not allow any access to the GPU which means that we currently do not have any tests of actual GPU behavior.
Nvidia “data center” driver support in NixOS so that we can use V100/A100 GPUs. See this thread.
Upgrade the current cudatoolkit and cudatoolkit_11 versions in Nixpkgs. The cudatoolkit alias is especially outdated.

How can I help?

Ping @samuela to join the @NixOS/cuda-maintainers GitHub team.
If you have spare x86_64-linux cycles to run a GitHub Actions runner, let @samuela know. If you have an Nvidia GPU as well that’s even better!
If your academic lab or company would benefit from CUDA support and maintenance, please reach out re sponsorship. DM @samuela on Discourse!

SergeK · March 10, 2022, 2:37am

This is a great initiative!

ML/DL is a big market where Nix has a lot to offer: in handling complex builds (where the mainstream conda and docker just struggle), and in handling deployments (potentially beating LMod, docker, conda thanks to the immutable store and fast cache). What Nix needs to prevail in this domain in one word is “UX”.

Lack of binary cache, disabled GPU tests, frequent rebuilds of heavy packages, hardships in deploying cuda&graphics applications outside NixOS (/run/opengl-driver/lib), the complexity of overlaying or extending the python packageset are scaring potential users away. With proper leadership and coordination with other nixpkgs teams - I do not see any reasons we couldn’t address all of these issues.

Some additional concerns:

A place for immediate communication? There’s #datascience:nixos.org. I think it’s the same public, and it’s already discoverable
RE: frequent rebuilds. We’ll be interested in python packages a lot. One trait of python packages is that they have many propagatedBuildInputs which actually do not affect the build output. As a result one can trigger a pytorch rebuild by perturbing something irrelevant like python3Packages.pillow. This means that even if we setup a CI and binary cache, the user risks a cache-miss with any tiny overlay. This also means slow iterations. Note that these “propagated” dependencies are not fit for passthru.extras-require but demand for a new propagatedXXXInputs attribute, that would cause them to appear in the user’s PYTHONPATH automatically, but would not trigger a rebuild. This might also require special care for these packages’ .dist-infos, because these might be queried for by pip, but ultimately the intuition is: you want to rebuild pytorch/tf as seldom as possible.
RE: /run/opengl-driver/lib. We need nix run to run on other distributions and support graphics. I know NixGL exists but in practice, I only had need for it on NixOS (because of libc versions in different nixpkgs). On macos nix run .#some-opengl-demo simply worked. On archlinux and ubuntu I tried manually symlinking system’s libraries into /run/opengl-driver/lib and on all the machines I cared about - it worked. This is probably not stable, but it should work at least sometimes, and sometimes is better than never. I think we should try and spread the convention of picking out the graphics and cuda drivers into /run/opengl-driver/lib to other distributions. This could start as a PPA and an AUR package(s) that would ship a /etc/ld.so.conf.d/opengl-driver.conf and a systemd-tmpfiles config to maintain the symlinks

samuela · March 10, 2022, 6:15am

This is a great idea! I’m not on IRC, but if there’s already a bunch of people on it I’ll check it out.

SergeK:

RE: frequent rebuilds. We’ll be interested in python packages a lot. One trait of python packages is that they have many propagatedBuildInputs which actually do not affect the build output. As a result one can trigger a pytorch rebuild by perturbing something irrelevant like python3Packages.pillow . This means that even if we setup a CI and binary cache, the user risks a cache-miss with any tiny overlay. This also means slow iterations. Note that these “propagated” dependencies are not fit for passthru.extras-require but demand for a new propagatedXXXInputs attribute, that would cause them to appear in the user’s PYTHONPATH automatically, but would not trigger a rebuild. This might also require special care for these packages’ .dist-info s, because these might be queried for by pip , but ultimately the intuition is: you want to rebuild pytorch/tf as seldom as possible.

I agree that this is tricky. But I’m not sure that fixing it will be easy. Eg, if a package is changed by an overlay, I would want things that depend on it to run through their test suites and make sure they build with the new version. This mess is part of the reason that I made sure to separate jax/jaxlib and the impetus for this PR. For packages like JAX/TF/PyTorch I would be more than happy to adopt a practice of never putting them in propagatedBuildInputs.

Happy to have you on board! I’m excited for what we will accomplish!

NobbZ · March 10, 2022, 8:17am

This is not only true for propagated, but all kind of buildInputs, the closer you change something to the root of the dependency graph, the more has to be rebuilt, often not changing much but the references.

SergeK · March 10, 2022, 11:38pm

Eg, if a package is changed by an overlay, I would want things that depend on it to run through their test suites

Well, in this case it could be passthru.tests depending on these changing packages

I’m not on IRC

It’s matrix. Among the relevant ones there are at least:

samuela · March 13, 2022, 5:58am

Cool, ok just joined those two rooms/channels. We could also create one for cuda-maintainers or just use discourse DMs with all of us. The pro of discourse DMs is that it would also go to my email, and I’m afraid that I won’t check matrix often since I don’t use it otherwise.

As a quick progress update:

It looks like @illustris may have solved the V100/A100 driver issue here. But we’ll need to upstream any patches/document how it works.
I cleaned up the NVIDIA wiki page a bit and added a section for GPU compute use cases.
@mcwitt has a PR for exposing compute-sanitizer and nsys in cudatoolkit. He also uncovered a bug in cuda-memcheck in the process. We’re working on getting that merged.
Shout out to @kmittman for patiently explaining to me the intricacies of CUDA/driver packaging!

samuela · March 14, 2022, 11:59pm

Update: I’ve kicked off work on a GPU test suite for nix software: GitHub - samuela/cuda-nix-testsuite. Check it out and add your own tests!

SergeK · March 16, 2022, 11:50pm

Looking at

cuda-cudnn compatibility tables in cudnn: 8.3.0 -> 8.3.2 by samuela · Pull Request #164338 · NixOS/nixpkgs · GitHub
cudaSupport-conditional hashes in https://github.com/NixOS/nixpkgs/blob/2e80229c8180f05de3d119d726e5b12b05f0a9f9/pkgs/development/python-modules/jaxlib/default.nix#L218

cudaSupport- and platform-conditional hashes in https://github.com/NixOS/nixpkgs/blob/ed8d52174f0995f3baeba694f2edaf73d0598c0f/pkgs/development/python-modules/tensorflow/default.nix#L366

  # .../tensorflow/default.nix
  fetchAttrs = {
    # cudaSupport causes fetch of ncclArchive, resulting in different hashes
    sha256 = if cudaSupport then
      "sha256-+szc2mRoImwijzbj3nw6HmZp3DeRjjPRU5yC+5AEbkg="
    else
      if stdenv.isDarwin then
        "sha256-+bwIzp6t7gRJPcI8B5oyuf9z0AjCAyggUR7x+vv5kFs="
      else
        "sha256-5yOYmeGpJq4Chi55H7iblxyRXVktgnePtpYTPvBs538=";
  };

…one more line of work, in addition to validation in CI, should be reducing the maintenance cost via some sort of update scripts

disclaimer: this has already been brought up in #python:nixos.org

samuela · March 17, 2022, 9:14pm

Yes, auto-update scripts are always handy. FWIW jaxlib-bin already has something like that (https://github.com/NixOS/nixpkgs/blob/9f60e300d363c73d148d50bd6191950697cff01c/pkgs/development/python-modules/jaxlib/prefetch.sh). I suspect that the source builds of things like jaxlib, tensorflow, etc are going to be more annoying to automate based on the way that bazel builds work in nix atm.

SergeK · March 23, 2022, 2:09pm

We’ve now got #cuda:nixos.org (thanks @grahamc) where we can have a broader discussion and answer cuda-related questions without creating noise in github issues or this discourse thread (which we can reserve for updates)