Mach-nix: Create python environments quick and easy

Mach-nix makes the process of creating a python environment from a requirements.txt file a no-brainer.

Motivation: When I started using NixOS, I was amazed! (Thanks BTW!) But what I was missing was a quick way to set up a python environment from a requirements.txt file. There are amazing tools like pypi2nix which simplify the task a lot, but the whole experience is still far from perfect. Non-pythonic build inputs are not handled, which often makes it a long manual process of rebuilding stuff over and over again and fixing errors by adding inputs. Compared to a simple pip install -r requirements.txt, this was still quite unpleasant. To gain back the comfort, I started working on my own tool a few months ago. The whole problem turned out to be a bit more complex, but I kept working on it until it did what I wanted. I optimized the UI to be super simple. Iā€™m confident to assume that this could be the perfect tool for nix beginners who quickly need to load a python environment, and also for anyone else because it is just super simple to use.

The main differences to similar projects like pypi2nix are:

  • Speed: mach-nix resolves python dependencies via an internal graph and does not need to download and try-build packages first.
  • Reproducibility: If you run the same version of mach-nix against the same requirements.txt, it will always produce the same result.
  • No tool required: mach-nix can be called directly via nix expression (see below)
  • Extra build inputs are handled: Many python packages require non-pythonic build inputs and are difficult to build from scratch. Mach-nix reuses the definitions from nixpkgs to build those packages smoothly

Example usage to create a python derivation:

let
  mach-nix = import (builtins.fetchGit {
    url = "https://github.com/DavHau/mach-nix/";
    ref = "2.0.0";
  });
in
mach-nix.mkPython {
  requirements = ''
    pillow>=7.1.1
    numpy>=1.17
    requests<2.22
  '';
}

Be patient on first use. It needs to fetch the dependency graph which is around ~ 200 MB.

Alternatively, mach-nix can be installed and used as a cmdline tool. Its interface is similar to pythonā€™s venv. It is supposed to be newcomer friendly and does not require any knowledge about nix.

For more information please visit the github page: GitHub - DavHau/mach-nix: Create highly reproducible python environments
Always feel free to collaborate!

A big thanks goes to the guys from genesiscloud.com who supported me with the necessary computation resources to generate that massive pypi dependency graph. Check out their service! They offer cheap GPU based cloud computing. Currently 50% off due to beta.

There is still a lot to do to make this tool perfect. Suggestions / ideas / bugs / issues or any other feedback is very welcome!

EDIT: version 2.0.0 is out. changelog . More details in announcement

28 Likes

Thatā€™s amazing.

I was thinking just this morning on a related concept: Why do we need to write an almost identical buildPythonPackage and buildPythonApplication derivation every time we want a new Python package in Nixpkgs ? Almost always this doesnā€™t consist of anything specific besides putting the requirements as specified by upstream. Since pypi makes that link by itself anyway, whatā€™s the point in regenerating that data on our end?

While I was merely thinking about it, youā€™ve done it - you automated the process of linking the dependencies data from pypi and made it available to Nix. Plus, you took one step further - the step of wrapping it nicely in cleaner fetchpypi function. Thatā€™s just one benefit this brings.

Perhaps your project, is only an implementation of this idea for Python. Rust has crates.io, Golang has itā€™s modules system where most of them are documented and (I think) indexed etc.


I have some comments / questions:

  1. You wrote:
  • Extra build inputs are handled: Many python packages require non-pythonic build inputs and are difficult to build from scratch. Mach-nix reuses the definitions from nixpkgs to build those packages smoothly

Just to make sure, if thereā€™s a Python package available on Pypi that has an external dependency and itā€™s not pacakged in Nixpkgs, Mach-nix isnā€™t capable of knowing how that external dependency is named in Nixpkgs right?

  1. Do you use Pypiā€™s shas, or do you generate it yourself? It should be available according to this.
  2. GitHub - DavHau/pypi-crawlers: Collection of crawlers to mine pypi packages metadata and their dependencies which you link to in the README is 404

I forgot to make the pypi-crawlers project public. Thanks for noticing. It should now be available.

Pypi actually does not provide any information about dependencies. Please correct me if Iā€™m wrong. Because this would make my project a lot simpler. As far as i figured it out, getting information about python dependencies is non-trivial. It was in fact one of the major points to tackle while developing mach-nix.

To sum it up: Dependencies of a python package are only revealed during its installation itself. Regexing the setup.py for definitions like ā€œinstall_requires =ā€. or using the abstract syntax tree module failed utterly, since there are too many variants to be considered on how these variables can be defined. Sometimes requrements are loaded from txt file etc. Also, over the years, python introduced additional methods to specify dependencies which are unrelated to setup.py. Itā€™s baically a jungle.

It appeared to me that there is no simple way of mining dependency information without executing the actual installation.

The current strategy of the crawler maintaining the dependency graph is, to run all packages through a nix builder which fake-installs them through a patched python version. That means, it executes each projects setup.py which is untrusted code. Can i trust the nix sandbox? Should i add additional encapsulation layers?

I need to add that Iā€™m only handling sdist python distributions so far and ignore wheels. Using wheels could make this process easier since they might contain some useful extra metadata. I started out handling only sdist because not all python packages on pypi have wheels and also it seemed python packages in nixpkgs are usually built using sdist. I guess thatā€™s for a reason. Only a few of them use wheels. I need to learn more about wheels, so any information regarding this is welcome. I would like to support wheels in the future since some projects like tensorflow only release wheels.

That is correct. A package which requires external dependencies and is itself not specified in nixpkgs will fail. During this project I already started working on a tool with the goal to build a mapping from ā€˜build error messagesā€™ to nix package attributes to build a database for missing inputs. But it apppeared that really nearly all of these difficult-to-build python packages are already packaged in nixpkgs. Therefore I trashed that project again and decided to better just rely on nixpkgs as a general base. In case my filter bubble view is wrong and external dependencies will still be a problem for many people, we could start working on a solution for that. But currently I have the feeling that other things would gain more benefits, like adding support for wheels for example.

Indirectly Iā€™m using the hashes coming from pypi. But as mach-nix itself is running as a nix builder i cannot and should not do arbitrary api requestā€™s since we cannot trust the integrety of the data coming from pypi. Therefore I built my own pypi fetcher tool nix-pypi-fetcher. It includes a mapping from (pkg_name, pkg_version) to URL and sha256. The mapping is updated twice a day (by querying pypi). Mach-nix pins one specific version of that fetcher together with one specific version of the dependency graph.

2 Likes

Itā€™s interesting to think that other languagesā€™ dependecies websites have learned the lesson I guess, and made it easy to discover the dependency graph. But Python which is super popular, hasnā€™t.

Reading your explanation, I realise how much work youā€™ve done. Well done :ok_hand:.

3 Likes

BTW have you used eventually any of these solutions internally:

?

Since PEP 517/518 there is now support for backends other than setuptools for building a wheel. This means there are now tools for building wheels from projects that do not have a setup.py. Several backends already exist and are in use. Often these backends declare their dependencies somehow, but they donā€™t have to; itā€™s really up to the back-end. Thus, the only reliable way for extracting dependencies is from wheels. Clearly one does not always want to build a wheel, so indeed in case of setuptools monkey-patching is a popular choice and should work fine.

In the past I did something similar to your GitHub - DavHau/nix-pypi-fetcher: Pypi Fetcher for Nix with simplified interface. (contains hashes for all packages), essentially downloading the PyPI database into a git repo (GitHub - FRidh/make-pypi-dump: Extract all the JSON from PyPI) with the idea that could then be used for a second stage that would extract all dependencies. That information would be very useful for other projects as well, if well-described. The crucial part here is to consider dependencies per platform and Python version (or in general the markers), which I see you have done.

Graphs with dependency information constructed from source packages wonā€™t ever happen with Python, but forming graphs from wheels is possible.

2 Likes

The mentioned PEPs:

BTW thereā€™s also:

Nice! Instead of

let pkgs = import <nixpkgs> { };
in pkgs.mkShell {
  buildInputs = let
    python-with-pkgs = pkgs.python3.withPackages (ps:
      with ps;
      [
        # Figure out what to put here based on requirements.txt
      ]);
  in [ python-with-pkgs ];
}

I can now put this shell.nix alongside requirements.txt:

let
  mach-nix = import (builtins.fetchGit {
    url = "https://github.com/DavHau/mach-nix/";
    ref = "master";
    rev = "fa2bb2d33fb9b9dc3113046e4fcc16088f56981a";
  });
in mach-nix.mkPythonShell {
  requirements = builtins.readFile ./requirements.txt;
}

And it works in conjunction with lorri, too.

3 Likes

I just checked that thread and none of these solutions are similar to what i use. All solutions proposed there require you to do a full installation of the package which is exactly what i wanted to avoid. Installing costs far more computational resources than my current solution which is implemented and explained with more detail in the pypi-crawlers project.

Thanks, but i could not find any information about dependencies there. Can you?

I saw that the actual dump produced by your make-pypi-dump project actually got taken down by github with the notice Repository unavailable due to DMCA takedown.

Do you think i need to worry about such complications regarding the data Iā€™m publishing? I guess your data contained wrong information about some projects license and that could have been the problem.

The only thing Iā€™m publishing are project names, their dependency relations, and download URLs.
Could this lead to any trouble of that kind? It would be sad if the project would break because of that. Iā€™d like to make sure to undertake any measures to prevent that.

Yes but it seems you still need to crawl them: API Documentation - Libraries.io . I also wonder whether their search is capable of giving you all pypi projects there areā€¦

It got taken down because a certain company had uploaded an apparently internal package. Even though they removed it, the package description was still in the PyPI database and thus in my dump.

Cool, I now played around with their API a little bit. Their general collection of python packages seems quite complete. But concerning the requirements, their data seems to be less complete than mine. First of all, they are not differentiating between install_requires, setup_requires, tests_require, extras_require. Information about markers or differentiation between python versions is also missing. Checking the dependencies for requests i noticed that they are missing the requirements idna and urllib3. Not sure how they mine their data. I also checked for scipy for which my crawler failed to extract the requirements. And there they also donā€™t have the data at all.

As far as I know they collect their data from the wheels. Because wheels are already produced artifacts, it wonā€™t contain setup_requires because thatā€™s setuptools specific for building the wheel. The same goes for tests_require, which upstream still wants to remove, but that issue has stalled.

Iā€™m currently trying to understand which benefit it will bring to support wheels. For example scipy and tensorflow are currently unsupported by mach-nix. But Iā€™m not sure if wheel support will help.
They both release manylinux{x} wheels. Iā€™ve seen that manylinux wheels are now supported since nixos 20.03. I tested using their wheels with buildPythonPackage. They build, but then fail during import because they link against libstdc++.so.6. This proposal here seems to target these linking issues, but it has been closed. Not sure why. If this could be accomplished, it would be a really nice thing for mach-nix.

In general i assume, with the current state of nixpkgs, wheels which include pre-compiled binaries will most likely not work and therefore supporting wheels wonā€™t help for those python packages.

Apart from that, Iā€™m aware that wheel is the current distribution standard. Therefore there are libraries out there which only release wheels and no sdist even they donā€™t contain any binaries. For these libraries wheel support in mach-nix would make them available. It would be interesting to know how high the number of these libraries actually is. I assume itā€™s low. In nixpkgs19.09 the number of libraries using wheels as an installation method is less than 10. But of course nixpkgs might not be representing the general situation well. Maybe itā€™s time for another pypi crawling session :wink:

All in all, supporting wheels doesnā€™t sound anymore like that big of an advantage as i originally thought.
Maybe it might be more beneficial to use nixpkgs itself as a provider for the resolver and take packages directly from there. Currently only sdist releases on pypi are considered. That makes tensorflow drop out which would actually be available in nixpkgs and could easily be included. Of course then only that specific version of tensorflow which is in nixpkgs would be available and as soon as you specify anything else in your requirements.txt, it will fail again.

See the manylinux packages for Python by FRidh Ā· Pull Request #73866 Ā· NixOS/nixpkgs Ā· GitHub. Including a manylinux package or list of packages and autoPatchelfHook should fix that issue.

Correct, they need to be patched with this method and additional libraries that are used need to be included as well.

Currently that count is very low, but given the new backends that number will go up. New packages I create (though theyā€™re private/work) seldom use setuptools. Also, there are already widely used packages that have such packages as dependencies. An example of such a dependency is the entrypoints package, which includes a generated setup.py for compatibility reasons and is used in packages such as flake8 and nbconvert.

1 Like

Thanks so much! Thatā€™s amazing! I now manged to build tensorflow from wheel. And itā€™s for sure much much faster than building from source. One of the problems of mach-nix currently is that build times can be very long. Should i consider using wheels by default wherever possible? Or are there any troubles ahead Iā€™m not seeing right now?

As youā€™ve seen already, wheels are build artifacts, thus they do not list build-time dependencies. Thatā€™s no problem if your users are fine using those pre-built wheels. In Nixpkgs we prefer source builds.

Then i should probably let the user decide if using sdist releases should be preferred/enforced.

Of course wheel support might blow up the dataset for the dependency graph. To reduce the size, can i make the following 2 assumptions ?:

  1. Since my current dependency extraction uses setup.py and fails on anything else, i know that the build backend for all packages in my current database must be setuptools.
  2. If a packageā€™s sdist release uses setuptools as build backend, the requirements specified via install_requires are exactly the requirements of the wheel release for this package on pypi.

If this is true, I donā€™t need to store any dependency information for the wheel release if i already know the dependencies of the sdist.
Also, I would only need to download and analyze wheelā€™s for packages which either donā€™t have ā€˜sdistā€™ or dependency extraction failed on their ā€˜sdistā€™.