Python dll import

Hello,

On a non-nixos Linux system, the python installed via nix is not able to load a DLL, whereas the system python can.

More precisely,

galepage in 🌐 alya in ~ 
✦ ❮ find /usr -name libnvidia-ml.so.1
/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1
/usr/lib/i386-linux-gnu/libnvidia-ml.so.1

galepage in 🌐 alya in ~ 
✦ ❮ /usr/bin/python -c "from ctypes import CDLL; CDLL('libnvidia-ml.so.1')"

galepage in 🌐 alya in ~ 
✦ ❮ /home/galepage/.nix-profile/bin/python -c "from ctypes import CDLL; CDLL('libnvidia-ml.so.1')"               
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/nix/store/9srs642k875z3qdk8glapjycncf2pa51-python3-3.10.7/lib/python3.10/ctypes/__init__.py", line 374, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: libnvidia-ml.so.1: cannot open shared object file: No such file or directory

It seems that both pythons are not looking at the same locations on the system…
Do you know why this happens ?
The CDLL constructor is relying on the system’s dlopen function so I don’t get why they are not behaving the same.

1 Like

This make sense: to increase reproducibility, Nix does not rely on /lib: everything must be in /nix/store/hash…. As explained here, CDLL uses dlopen to get the library, which itself uses various folders to search the library, including the rpath of the loader, LD_PRELOAD and the LD_LIBRARY_PATH variable (not really recommended to use this if possible)…

I don’t have time to try, but my guess it that you should install the library inside the in python3.withPackages (ps: with ps; [ … ]). If your library is not packaged already, something like that may work:

let myLib = stdenv.mkDerivation {
  src = ./.; # folder containing the .so files
  nativeBuildInputs = [
    pkgs.autoPatchelfHook
  ];
  installPhase = ''
    mkdir -p $out/lib
    cp libnvidia-ml.so.1 $out/lib
  '';
};
in python3.withPackages (ps: with ps; [ myLib ])
2 Likes

Thank you for your reply.

Ok, this makes sense.

To give a bit more context, I am trying to package nvitop which, through the nvidia-ml-py python library loads the DLL libnvidia-ml.so.1.
The thing is that the latter is provided by the Nvidia driver directly.
Hence, on a non-NixOS system, I guess that nvidia-ml-py should be able to load the system library where it is already available.

In this setting, from what you say, changing $LD_LIBRARY_PATH seems to be the only way…
What do you think ?

I confirm that the following is working:

galepage in 🌐 alya in ~ 
✦ ❯ LD_LIBRARY_PATH=/lib/x86_64-linux-gnu /home/galepage/.nix-profile/bin/python -c "from ctypes import CDLL; CDLL('libnvidia-ml.so.1')"

Arg… this may be working for a quick and really dirty solution but it can break in thousands of ways (like python will surely also pick other libs from this folder)… And it will not be portable to other OS that put the library somewhere else.

For a clean solution, I’m not an nvidia expert (and I don’t even have nvidia card to test) but I would first double check if this library can’t be packaged individually outside of a per-os driver (debian does have a .deb for that so it may even work if you just extract the deb and copy the lib folder in your derivation). If not, then you may get inspired by cudatoolkit that does something similar for CUDA. I’m not even sure how it works.

1 Like

Ok, you are right.

I looked at the nccl package which also relies on libnvidia-ml.so.1.

This lead me to discover addOpenGLRunPath.

Do you think that it could be a good solution for this problem ?
Is it also suited to non-NixOS Linux platforms ?

Thank you once again for your help !

The addOpenGLRunPath hook patches ELF binaries’ headers to teach the dynamic loader to look libraries up in the /run/opengl-driver/lib, unless overridden by LD_LIBRARY_PATH. You have hacked more or less similar behaviour in nvidia-ml-py by looking up libnvidia-ml.so at that exact (absolute) location. When the user sets LD_LIBRARY_PATH before running python, the first CDLL call (CDLL("libnvidia-ml.so.1"), with just the base name) succeeds and the NixOS-specific one is never executed

Setting LD_LIBRARY_PATH=/lib (or similar) is “dangerous” in that python will try to load all of its shared libraries from /lib (which are uncontrolled random revisions) rather than exact versions from /nix/store. You might want to point LD_LIBRARY_PATH to a separate directory with symlinks to individual libraries (like libnvidia-ml.so.1 or libcuda.so)

Also note that python is going to crash if the libraries in /lib were built against some other version of libc, than used in your revision of nixpkgs. In this case you probably can’t really avoid using GitHub - guibou/nixGL: A wrapper tool for nix OpenGL application

1 Like

Yes, this is unsupported and if it would work like this it would be considered impure and patched out/removed again.

1 Like

@SergeK So how are packages like blender (that needs CUDA) dealing with this case for non-nixos systems? As far as I know they don’t ask you to create a separate folder and symlink your /lib/ in this folder and override LD_LIBRARY_PATH.

I thought that cudatoolkit was somehow providing a “fake” library that would check if /run/opengl-driver/lib exists and otherwise would try to see if something exists in /lib… but I guess my understanding is wrong.

1 Like

Indeed, I would need a little bit of help to properly handle this at the nvidia-ml-py package level.

Maybe what they do with nccl can help.

I suspect that people use nixGL? :thinking:

Indeed it works fine with nixGL on my non-NixOS system.
Isn’t there any way to upstream the fix in the package definition itself ?

Hello again! I think this thread actually deserves a little more attention and a further refined explanation of how we’re failing.

Let’s first re-cap the current situation:

  • There’s a python library called nvidia-ml-py which does a dlopen() to load libnvidia-ml.so, one of the driver version-locked libraries[^1] that we get from linuxPackages.nvidia_x11 and deploy impurely at /run/opengl-driver/lib
  • We have nvidia-ml-py packaged in Nixpkgs with the following patch to make dlopen work: nixpkgs/0001-locate-libnvidia-ml.so.1-on-NixOS.patch at f00994e78cd39e6fc966f0c4103f908e63284780 · NixOS/nixpkgs · GitHub
  • It works without an issue on NixOS, because it knows to first look up the impure path
  • The meaning of 0001-locate-libnvidia-ml.so.1-on-NixOS.patch is exactly that of addOpenGLRunpath: it makes discovery of impure system-dependend libraries work on NixOS, it still respects LD_LIBRARY_PATH, but it doesn’t try to achieve anything beyond that

However, on non-NixOS we begin to confuse people:

  • On non-NixOS systems this module only works if we manually set up LD_LIBRARY_PATH to specify the location where the driver-compatible libnvidia-ml.so is deployed. Without LD_LIBRARY_PATH the module won’t know to look in /usr/lib or other FHS locations. This is explicitly intended by the order of imports in the linked patch. IIRC solutions like nixGL do the work of setting up a LD_LIBRARY_PATH that is expected to work (i.e. link to a nixpkgs-compatible libc, and to system-compatible driver), but don’t take my word for it
  • Confusion! If people install the python interpreter from a normie fhs distro and try to use it with our nvidia-ml-py python module, it suddenly works. Unlike with nixpkgs-packaged python. You can see how this conveys a totally wrong message
  • The reason the fhs python interpreter works and our doesn’t is because they are linked to different dynamic loaders: patchelf --print-interpreter /usr/bin/python3 will reveal that the fhs python is, technically, build in a platform-dependend way: it expects that there would exist /lib64/ld-linux-x86-64.so.2 which would know to (check against /etc/ld.so.conf and) search for libraries in the FHS-style paths like /usr/lib. Nixpkgs python would link to a loader from /nix/store/<hash>-... which does not expect /etc/ld.so.conf to exist

So much for the recap

At this point I would like to explicitly say that it is my impression, that things today work rather smooth on NixOS. All of the things we analyze above in this thread are internals that NixOS users do not really have to learn about. They can be just instructed to set hardware.opengl.enable = true (NOTE: this option is getting a better name and hopefully soon) and sometimes they need to add nixpkgs.config.cudaSupport = true (may get a better name in future too). I think this is pretty awesome and a great achievement for the NixOS/nixpkgs community.

…but having said that, maybe we still should overhaul entirely our approach to system-dependend shared libraries :upside_down_face:

Nixpkgs is extremely useful outside NixOS, but we send too many false signals when it comes to /run/opengl-driver/lib. Non-NixOS users shouldn’t have to learn any of the stuff above either

Having made impure shared libraries deployments mostly work on NixOS, and observing that naively linking to fhs distro-deployed libraries like libcuda.so through LD_LIBRARY_PATH quite often does work (again, unless libc, which is a nixGL use-case), our next milestone should be to provide native nixpkgs support for discovering impure (system-dependend) libraries in these FHS distros. This also involves sending better error messages to users for when things actually break for a good reason

Maybe alternatives

I kind of hope for other people to come up with ideas, but I’ll include one for a starter.

  • Maybe we could deprecate addOpenGLRunpath and (or?) build a more flexible linker instead.
    FHS-systems’ ld.so can be configured via /etc/ld.so.conf and via LD_LIBRARY_PATH, and they should even respect the runpaths we write to nix-built libraries.
  • We could maybe introduce a /etc/nix/ld.so.conf to accommodate the edge-cases we’re facing.
  • We could try and work with other FHS distros’ maintainers to establish a practice of collecting the driver-locked shared libraries in an isolated location (prior to merging to /usr/lib via symlinks)
  • Then nixpkgs could instruct non-NixOS users to set up LD_LIBRARY_PATH=/that/negotiated/location/for/system/dependend/libraries or to add path to the hypothetical /etc/nix/ld.so.conf

In the hindsight, the point about addOpenGLRunpath is not that relevant and has more to do with the tradeoff between more impurity versus more flexibility+fewer rebuilds, but I won’t edit the message now.


P.S. Acknowledgements to Kiskae for the “driver version-locked” terminology, I myself didn’t know how to formulate this

1 Like

Wow, thank you very much @SergeK for this very complete recap.
In the specific case that is nvidia-ml-py, I guess that we have a few solutions:

  • Set $LD_LIBRARY_PATH so that nvidia-ml-py is able to find the libnvidia-ml.so.1 located in /usr/lib/x86_64/ on non-NixOS systems.
  • Add cudatoolkit (packaged in nixpkgs) and set $LD_LIBRARY_PATH to ${pkgs.cudatoolkit.lib}/lib (or similar) so that everything is properly self-contained. Indeed cudatoolkit seems to include libnvidia-ml.so in its output.

I don’t know which one is best. The opinion of confirmed nix-CUDA maintainers would be very appreciated :slight_smile:

Note: a nearly identical library has also been packaged. Hence, any future fix should be applied to both.

LD_LIBRARY_PATH is rather something for the end-user (or their operating system) to set. We can sometimes wrap executables to extend LD_LIBRARY_PATH (which is hacky), but in case of python modules we’d have to export LD_LIBRARY_PATH before we start the python interpreter. This is what we achieve with runpaths

/usr/lib/x86_64/

I would expect that including this location in LD_LIBRARY_PATH would lead to failures. If the other OS deploys something innocuous like /usr/lib/x86_64/libpython.so or libz.so, then these would take priority over /nix/store/... paths recorded in Runpaths

This is why we’d prefer other distros to separate their driver-dependend libraries from everything else. Our approach, effectively, is to guarantee deterministic behaviour of the dynamic linkage whenever we can. For libnvidia-ml.so and libcuda.so this is not feasible (we could link to nvidia_x11 directly and force users to override nvidia_x11 with their config.kernelPackages.nvidia_x11, but that would render the binary cache useless). So we acknowledge a small selection of exceptions and throw them in /run/opengl-driver/lib. We also leave the escape hatch by the name of LD_LIBRARY_PATH (this is why we runpaths over rpaths)

Indeed cudatoolkit seems to include libnvidia-ml.so in its output.

I’d need to check, but I suspect you’re talking about lib/stubs/libnvidia-ml.so. This is not the “real” libnvidia-ml.so, just a placeholder to pacify ld

Indeed, I would not be surprised if the libnvidia-ml.so provided by cudatoolkit was not enough.
I agree with your point on $LD_LIBRARY_PATH.
However, what is the practical solution you envision for this package (if any) ?

I have an Ubuntu system with Nix and a GPU where I can test any idea for this.

I agree that it is an important problem to solve. My two cents also to raise an issue with the currert system on Nixos : if the system packs in /run/… libraries that have a too old glibc (or maybe too recent, I don’t remember) then the application will crash. We should maybe also consider this issue if we try to solve another related problem, maybe by creating multiple versions of the drivers with multiple glibc…

Also I don’t know if we could solve/mitigate the problem raised above by adding a wrapper that creates a temporary folder isolating the needed libraries.

1 Like