Mach-nix: Create python environments quick and easy

It’s interesting to think that other languages’ dependecies websites have learned the lesson I guess, and made it easy to discover the dependency graph. But Python which is super popular, hasn’t.

Reading your explanation, I realise how much work you’ve done. Well done :ok_hand:.

3 Likes

BTW have you used eventually any of these solutions internally:

?

Since PEP 517/518 there is now support for backends other than setuptools for building a wheel. This means there are now tools for building wheels from projects that do not have a setup.py. Several backends already exist and are in use. Often these backends declare their dependencies somehow, but they don’t have to; it’s really up to the back-end. Thus, the only reliable way for extracting dependencies is from wheels. Clearly one does not always want to build a wheel, so indeed in case of setuptools monkey-patching is a popular choice and should work fine.

In the past I did something similar to your GitHub - DavHau/nix-pypi-fetcher: Pypi Fetcher for Nix with simplified interface. (contains hashes for all packages), essentially downloading the PyPI database into a git repo (GitHub - FRidh/make-pypi-dump: Extract all the JSON from PyPI) with the idea that could then be used for a second stage that would extract all dependencies. That information would be very useful for other projects as well, if well-described. The crucial part here is to consider dependencies per platform and Python version (or in general the markers), which I see you have done.

Graphs with dependency information constructed from source packages won’t ever happen with Python, but forming graphs from wheels is possible.

2 Likes

The mentioned PEPs:

BTW there’s also:

Nice! Instead of

let pkgs = import <nixpkgs> { };
in pkgs.mkShell {
  buildInputs = let
    python-with-pkgs = pkgs.python3.withPackages (ps:
      with ps;
      [
        # Figure out what to put here based on requirements.txt
      ]);
  in [ python-with-pkgs ];
}

I can now put this shell.nix alongside requirements.txt:

let
  mach-nix = import (builtins.fetchGit {
    url = "https://github.com/DavHau/mach-nix/";
    ref = "master";
    rev = "fa2bb2d33fb9b9dc3113046e4fcc16088f56981a";
  });
in mach-nix.mkPythonShell {
  requirements = builtins.readFile ./requirements.txt;
}

And it works in conjunction with lorri, too.

3 Likes

I just checked that thread and none of these solutions are similar to what i use. All solutions proposed there require you to do a full installation of the package which is exactly what i wanted to avoid. Installing costs far more computational resources than my current solution which is implemented and explained with more detail in the pypi-crawlers project.

Thanks, but i could not find any information about dependencies there. Can you?

I saw that the actual dump produced by your make-pypi-dump project actually got taken down by github with the notice Repository unavailable due to DMCA takedown.

Do you think i need to worry about such complications regarding the data I’m publishing? I guess your data contained wrong information about some projects license and that could have been the problem.

The only thing I’m publishing are project names, their dependency relations, and download URLs.
Could this lead to any trouble of that kind? It would be sad if the project would break because of that. I’d like to make sure to undertake any measures to prevent that.

Yes but it seems you still need to crawl them: API Documentation - Libraries.io . I also wonder whether their search is capable of giving you all pypi projects there are…

It got taken down because a certain company had uploaded an apparently internal package. Even though they removed it, the package description was still in the PyPI database and thus in my dump.

Cool, I now played around with their API a little bit. Their general collection of python packages seems quite complete. But concerning the requirements, their data seems to be less complete than mine. First of all, they are not differentiating between install_requires, setup_requires, tests_require, extras_require. Information about markers or differentiation between python versions is also missing. Checking the dependencies for requests i noticed that they are missing the requirements idna and urllib3. Not sure how they mine their data. I also checked for scipy for which my crawler failed to extract the requirements. And there they also don’t have the data at all.

As far as I know they collect their data from the wheels. Because wheels are already produced artifacts, it won’t contain setup_requires because that’s setuptools specific for building the wheel. The same goes for tests_require, which upstream still wants to remove, but that issue has stalled.

I’m currently trying to understand which benefit it will bring to support wheels. For example scipy and tensorflow are currently unsupported by mach-nix. But I’m not sure if wheel support will help.
They both release manylinux{x} wheels. I’ve seen that manylinux wheels are now supported since nixos 20.03. I tested using their wheels with buildPythonPackage. They build, but then fail during import because they link against libstdc++.so.6. This proposal here seems to target these linking issues, but it has been closed. Not sure why. If this could be accomplished, it would be a really nice thing for mach-nix.

In general i assume, with the current state of nixpkgs, wheels which include pre-compiled binaries will most likely not work and therefore supporting wheels won’t help for those python packages.

Apart from that, I’m aware that wheel is the current distribution standard. Therefore there are libraries out there which only release wheels and no sdist even they don’t contain any binaries. For these libraries wheel support in mach-nix would make them available. It would be interesting to know how high the number of these libraries actually is. I assume it’s low. In nixpkgs19.09 the number of libraries using wheels as an installation method is less than 10. But of course nixpkgs might not be representing the general situation well. Maybe it’s time for another pypi crawling session :wink:

All in all, supporting wheels doesn’t sound anymore like that big of an advantage as i originally thought.
Maybe it might be more beneficial to use nixpkgs itself as a provider for the resolver and take packages directly from there. Currently only sdist releases on pypi are considered. That makes tensorflow drop out which would actually be available in nixpkgs and could easily be included. Of course then only that specific version of tensorflow which is in nixpkgs would be available and as soon as you specify anything else in your requirements.txt, it will fail again.

See the manylinux packages for Python by FRidh · Pull Request #73866 · NixOS/nixpkgs · GitHub. Including a manylinux package or list of packages and autoPatchelfHook should fix that issue.

Correct, they need to be patched with this method and additional libraries that are used need to be included as well.

Currently that count is very low, but given the new backends that number will go up. New packages I create (though they’re private/work) seldom use setuptools. Also, there are already widely used packages that have such packages as dependencies. An example of such a dependency is the entrypoints package, which includes a generated setup.py for compatibility reasons and is used in packages such as flake8 and nbconvert.

1 Like

Thanks so much! That’s amazing! I now manged to build tensorflow from wheel. And it’s for sure much much faster than building from source. One of the problems of mach-nix currently is that build times can be very long. Should i consider using wheels by default wherever possible? Or are there any troubles ahead I’m not seeing right now?

As you’ve seen already, wheels are build artifacts, thus they do not list build-time dependencies. That’s no problem if your users are fine using those pre-built wheels. In Nixpkgs we prefer source builds.

Then i should probably let the user decide if using sdist releases should be preferred/enforced.

Of course wheel support might blow up the dataset for the dependency graph. To reduce the size, can i make the following 2 assumptions ?:

  1. Since my current dependency extraction uses setup.py and fails on anything else, i know that the build backend for all packages in my current database must be setuptools.
  2. If a package’s sdist release uses setuptools as build backend, the requirements specified via install_requires are exactly the requirements of the wheel release for this package on pypi.

If this is true, I don’t need to store any dependency information for the wheel release if i already know the dependencies of the sdist.
Also, I would only need to download and analyze wheel’s for packages which either don’t have ‘sdist’ or dependency extraction failed on their ‘sdist’.

Why? In case of non-universal wheels there will indeed be many more artifacts and thus your dataset size would indeed increase, but you could merge the results with markers as is done in the source. Tricky, and relies on the assumption that all combinations of platform/version are provided/covered. Given the far majority of packages are pure Python and thus universal wheels I doubt it makes much difference.

That’s fine.

In the sdist/source markers are often used to describe dependencies across platform/version. In non-universal wheels, the platform/version may be more restricted, and thus the dependencies are only correct for those platforms/versions. Also, setuptools (specifically pkg_resources module) is often an undeclared dependency. Like poetry2nix you could choose to always include it.

Correct. Using source will cover the entire platform/version range, however, the method for extracting those dependencies is backend-dependent.

Thanks again for the very helpful reply!
Merging dependency information from sdist and wheels seems unnecessary complex and error-prone.
Therefore I’m going to treat wheels as separate nodes in the dependency graph.
Today i added all URLs for wheels to the nix-pypi-fetcher index while we also optimized the format to reduce file size.
Therefore adding wheels just increased the uncompressed size from 330 MB to 435 MB while the compressed size even decreased from 136 MB to 120 MB.

One more question arises. Having more than one wheel per package candidate in the index will increase the complexity in a lot of places. It would be simpler to store only one wheel per candidate.
Therefore I’d like to know if there are any benefits at all for the nix community to have multiple wheels available.

  • The windows wheels for sure are useless.
  • What about macosx like for example tensorflow-2.1.0-cp37-cp37m-macosx_10_11_x86_64.whl? Are they useful for the darwin platform under nix?
  • Then there is i686. Looking at Phasing out i686, It doesn’t look like it’s going to be needed by anyone, right?
1 Like

It’s needed if you want to use tensorflow with python 3.7. Can’t say whether it would actually work.

Multiple files is needed in case of extension modules, there is no way around that, unless one compiles from source. In the end it depends on what you want to support.

Right.

1 Like