Announcing the @NixOS/cuda-maintainers team and a call for maintainers!

I’m excited to announce the launch of the @NixOS/cuda-maintainers team. Our goal is to make Nix/NixOS the first choice for users of CUDA-related software, especially in machine learning and scientific computing.

Motivation

GPUs are essential in the machine learning and scientific computing world. However, installing and managing driver versions, CUDA toolkit versions, and related software is notoriously challenging. Navigating the litany of version constraints is a massive PITA. First-class support for Nvidia/CUDA and reproducible software toolchains is a massive selling point for many in the field, and has the potential to bring in new kinds of investment and users into the Nix/NixOS ecosystem!

Taking cue from the lovely @NixOS/darwin-maintainers and Marketing teams, it’s our mission to do the same for GPU-accelerated scientific computing!

Initial roadmap

Here are a few things on the immediate TODO list:

  • CI and caching infrastructure for packages depending on CUDA. Hydra currently does not build or cache any software that depends on CUDA. See this thread.
  • Create a test suite of GPU-enabled tests for software in Nixpkgs. The nix build environment does not allow any access to the GPU which means that we currently do not have any tests of actual GPU behavior.
  • Nvidia “data center” driver support in NixOS so that we can use V100/A100 GPUs. See this thread.
  • Upgrade the current cudatoolkit and cudatoolkit_11 versions in Nixpkgs. The cudatoolkit alias is especially outdated.

How can I help?

  • Ping @samuela to join the @NixOS/cuda-maintainers GitHub team.
  • If you have spare x86_64-linux cycles to run a GitHub Actions runner, let @samuela know. If you have an Nvidia GPU as well that’s even better!
  • If your academic lab or company would benefit from CUDA support and maintenance, please reach out re sponsorship. DM @samuela on Discourse!
26 Likes

This is a great initiative!

ML/DL is a big market where Nix has a lot to offer: in handling complex builds (where the mainstream conda and docker just struggle), and in handling deployments (potentially beating LMod, docker, conda thanks to the immutable store and fast cache). What Nix needs to prevail in this domain in one word is “UX”.

Lack of binary cache, disabled GPU tests, frequent rebuilds of heavy packages, hardships in deploying cuda&graphics applications outside NixOS (/run/opengl-driver/lib), the complexity of overlaying or extending the python packageset are scaring potential users away. With proper leadership and coordination with other nixpkgs teams - I do not see any reasons we couldn’t address all of these issues.

Some additional concerns:

  • A place for immediate communication? There’s #datascience:nixos.org. I think it’s the same public, and it’s already discoverable
  • RE: frequent rebuilds. We’ll be interested in python packages a lot. One trait of python packages is that they have many propagatedBuildInputs which actually do not affect the build output. As a result one can trigger a pytorch rebuild by perturbing something irrelevant like python3Packages.pillow. This means that even if we setup a CI and binary cache, the user risks a cache-miss with any tiny overlay. This also means slow iterations. Note that these “propagated” dependencies are not fit for passthru.extras-require but demand for a new propagatedXXXInputs attribute, that would cause them to appear in the user’s PYTHONPATH automatically, but would not trigger a rebuild. This might also require special care for these packages’ .dist-infos, because these might be queried for by pip, but ultimately the intuition is: you want to rebuild pytorch/tf as seldom as possible.
  • RE: /run/opengl-driver/lib. We need nix run to run on other distributions and support graphics. I know NixGL exists but in practice, I only had need for it on NixOS (because of libc versions in different nixpkgs). On macos nix run .#some-opengl-demo simply worked. On archlinux and ubuntu I tried manually symlinking system’s libraries into /run/opengl-driver/lib and on all the machines I cared about - it worked. This is probably not stable, but it should work at least sometimes, and sometimes is better than never. I think we should try and spread the convention of picking out the graphics and cuda drivers into /run/opengl-driver/lib to other distributions. This could start as a PPA and an AUR package(s) that would ship a /etc/ld.so.conf.d/opengl-driver.conf and a systemd-tmpfiles config to maintain the symlinks
3 Likes

This is a great idea! I’m not on IRC, but if there’s already a bunch of people on it I’ll check it out.

I agree that this is tricky. But I’m not sure that fixing it will be easy. Eg, if a package is changed by an overlay, I would want things that depend on it to run through their test suites and make sure they build with the new version. This mess is part of the reason that I made sure to separate jax/jaxlib and the impetus for this PR. For packages like JAX/TF/PyTorch I would be more than happy to adopt a practice of never putting them in propagatedBuildInputs.

Happy to have you on board! I’m excited for what we will accomplish!

1 Like

This is not only true for propagated, but all kind of buildInputs, the closer you change something to the root of the dependency graph, the more has to be rebuilt, often not changing much but the references.

Eg, if a package is changed by an overlay, I would want things that depend on it to run through their test suites

Well, in this case it could be passthru.tests depending on these changing packages

I’m not on IRC

It’s matrix. Among the relevant ones there are at least:

Cool, ok just joined those two rooms/channels. We could also create one for cuda-maintainers or just use discourse DMs with all of us. The pro of discourse DMs is that it would also go to my email, and I’m afraid that I won’t check matrix often since I don’t use it otherwise.

As a quick progress update:

  • It looks like @illustris may have solved the V100/A100 driver issue here. But we’ll need to upstream any patches/document how it works.
  • I cleaned up the NVIDIA wiki page a bit and added a section for GPU compute use cases.
  • @mcwitt has a PR for exposing compute-sanitizer and nsys in cudatoolkit. He also uncovered a bug in cuda-memcheck in the process. We’re working on getting that merged.
  • Shout out to @kmittman for patiently explaining to me the intricacies of CUDA/driver packaging!
3 Likes

Update: I’ve kicked off work on a GPU test suite for nix software: GitHub - samuela/cuda-nix-testsuite. Check it out and add your own tests!

3 Likes

Looking at

…one more line of work, in addition to validation in CI, should be reducing the maintenance cost via some sort of update scripts

disclaimer: this has already been brought up in #python:nixos.org

Yes, auto-update scripts are always handy. FWIW jaxlib-bin already has something like that (https://github.com/NixOS/nixpkgs/blob/9f60e300d363c73d148d50bd6191950697cff01c/pkgs/development/python-modules/jaxlib/prefetch.sh). I suspect that the source builds of things like jaxlib, tensorflow, etc are going to be more annoying to automate based on the way that bazel builds work in nix atm.

:boom: We’ve now got #cuda:nixos.org (thanks @grahamc) where we can have a broader discussion and answer cuda-related questions without creating noise in github issues or this discourse thread (which we can reserve for updates)

3 Likes