Mach-nix: Create python environments quick and easy

Yes but it seems you still need to crawl them: API Documentation - Libraries.io . I also wonder whether their search is capable of giving you all pypi projects there are…

It got taken down because a certain company had uploaded an apparently internal package. Even though they removed it, the package description was still in the PyPI database and thus in my dump.

Cool, I now played around with their API a little bit. Their general collection of python packages seems quite complete. But concerning the requirements, their data seems to be less complete than mine. First of all, they are not differentiating between install_requires, setup_requires, tests_require, extras_require. Information about markers or differentiation between python versions is also missing. Checking the dependencies for requests i noticed that they are missing the requirements idna and urllib3. Not sure how they mine their data. I also checked for scipy for which my crawler failed to extract the requirements. And there they also don’t have the data at all.

As far as I know they collect their data from the wheels. Because wheels are already produced artifacts, it won’t contain setup_requires because that’s setuptools specific for building the wheel. The same goes for tests_require, which upstream still wants to remove, but that issue has stalled.

I’m currently trying to understand which benefit it will bring to support wheels. For example scipy and tensorflow are currently unsupported by mach-nix. But I’m not sure if wheel support will help.
They both release manylinux{x} wheels. I’ve seen that manylinux wheels are now supported since nixos 20.03. I tested using their wheels with buildPythonPackage. They build, but then fail during import because they link against libstdc++.so.6. This proposal here seems to target these linking issues, but it has been closed. Not sure why. If this could be accomplished, it would be a really nice thing for mach-nix.

In general i assume, with the current state of nixpkgs, wheels which include pre-compiled binaries will most likely not work and therefore supporting wheels won’t help for those python packages.

Apart from that, I’m aware that wheel is the current distribution standard. Therefore there are libraries out there which only release wheels and no sdist even they don’t contain any binaries. For these libraries wheel support in mach-nix would make them available. It would be interesting to know how high the number of these libraries actually is. I assume it’s low. In nixpkgs19.09 the number of libraries using wheels as an installation method is less than 10. But of course nixpkgs might not be representing the general situation well. Maybe it’s time for another pypi crawling session :wink:

All in all, supporting wheels doesn’t sound anymore like that big of an advantage as i originally thought.
Maybe it might be more beneficial to use nixpkgs itself as a provider for the resolver and take packages directly from there. Currently only sdist releases on pypi are considered. That makes tensorflow drop out which would actually be available in nixpkgs and could easily be included. Of course then only that specific version of tensorflow which is in nixpkgs would be available and as soon as you specify anything else in your requirements.txt, it will fail again.

See the manylinux packages for Python by FRidh · Pull Request #73866 · NixOS/nixpkgs · GitHub. Including a manylinux package or list of packages and autoPatchelfHook should fix that issue.

Correct, they need to be patched with this method and additional libraries that are used need to be included as well.

Currently that count is very low, but given the new backends that number will go up. New packages I create (though they’re private/work) seldom use setuptools. Also, there are already widely used packages that have such packages as dependencies. An example of such a dependency is the entrypoints package, which includes a generated setup.py for compatibility reasons and is used in packages such as flake8 and nbconvert.

1 Like

Thanks so much! That’s amazing! I now manged to build tensorflow from wheel. And it’s for sure much much faster than building from source. One of the problems of mach-nix currently is that build times can be very long. Should i consider using wheels by default wherever possible? Or are there any troubles ahead I’m not seeing right now?

As you’ve seen already, wheels are build artifacts, thus they do not list build-time dependencies. That’s no problem if your users are fine using those pre-built wheels. In Nixpkgs we prefer source builds.

Then i should probably let the user decide if using sdist releases should be preferred/enforced.

Of course wheel support might blow up the dataset for the dependency graph. To reduce the size, can i make the following 2 assumptions ?:

  1. Since my current dependency extraction uses setup.py and fails on anything else, i know that the build backend for all packages in my current database must be setuptools.
  2. If a package’s sdist release uses setuptools as build backend, the requirements specified via install_requires are exactly the requirements of the wheel release for this package on pypi.

If this is true, I don’t need to store any dependency information for the wheel release if i already know the dependencies of the sdist.
Also, I would only need to download and analyze wheel’s for packages which either don’t have ‘sdist’ or dependency extraction failed on their ‘sdist’.

Why? In case of non-universal wheels there will indeed be many more artifacts and thus your dataset size would indeed increase, but you could merge the results with markers as is done in the source. Tricky, and relies on the assumption that all combinations of platform/version are provided/covered. Given the far majority of packages are pure Python and thus universal wheels I doubt it makes much difference.

That’s fine.

In the sdist/source markers are often used to describe dependencies across platform/version. In non-universal wheels, the platform/version may be more restricted, and thus the dependencies are only correct for those platforms/versions. Also, setuptools (specifically pkg_resources module) is often an undeclared dependency. Like poetry2nix you could choose to always include it.

Correct. Using source will cover the entire platform/version range, however, the method for extracting those dependencies is backend-dependent.

Thanks again for the very helpful reply!
Merging dependency information from sdist and wheels seems unnecessary complex and error-prone.
Therefore I’m going to treat wheels as separate nodes in the dependency graph.
Today i added all URLs for wheels to the nix-pypi-fetcher index while we also optimized the format to reduce file size.
Therefore adding wheels just increased the uncompressed size from 330 MB to 435 MB while the compressed size even decreased from 136 MB to 120 MB.

One more question arises. Having more than one wheel per package candidate in the index will increase the complexity in a lot of places. It would be simpler to store only one wheel per candidate.
Therefore I’d like to know if there are any benefits at all for the nix community to have multiple wheels available.

  • The windows wheels for sure are useless.
  • What about macosx like for example tensorflow-2.1.0-cp37-cp37m-macosx_10_11_x86_64.whl? Are they useful for the darwin platform under nix?
  • Then there is i686. Looking at Phasing out i686, It doesn’t look like it’s going to be needed by anyone, right?
1 Like

It’s needed if you want to use tensorflow with python 3.7. Can’t say whether it would actually work.

Multiple files is needed in case of extension modules, there is no way around that, unless one compiles from source. In the end it depends on what you want to support.

Right.

1 Like

Can I expect manylinux weels + autoPatchelfHook will work under nix + darwin? If so, then there would be no need to include macosx wheels.

What exactly are extension modules? Could you provide an example of a package on pypi having more than one .whl file being relevant under nix (excluding the macosx ones)?

No autopatchelfHook only works on platforms using elf binaries so Linux, *BSD, Solaris. For dependencies on system libraries, patching might be not required but would make packages impure. In general the equivalent of patchelf on macOS is install_name_tool. In theory it should be possible to extend autopatchelfHook for that, but nobody has done though.

Have a look at zstd for compression.

╭─[~/pypi-deps-db]─[master]─joerg@eve──[zstd]
╰─ % zstd data/*.json -o data.zstd

╭─[~/pypi-deps-db]─[master]─joerg@eve──[zstd]
╰─ % du -sh data.zstd
19M     data.zstd
19M     total

19M vs the 120MB you reported. Might be even worth to store it in that format locally. Otherwise there is also rocksdb, which supports zstd as compression algorithm.

This looks very promising. Can’t wait to try it.

I forgot to consider that for manylinux releases there is one file per python version. Therefore obviously we need to maintain multiple wheel files.

Yeah zstd is great. I’m using it with btrfs. Most important feature when dealing with raw data :slight_smile:
But for the project I’m not really sure how to implement better compression than the one of github’s .tar.gz without losing the following benefits:

  • Human readability: Currently I’m pushing updates every 12h to github. People can see what changed with every commit and easily search through all data. If i compress the data, then commits will become ugly and most likely also inefficient (not sure about the last one).
  • github’s tarball feature: If i would use anything else than github, it would probably be much more difficult to provide a tarball for every single version of the database.

Please let me know if I’m overseeing something and there are better ways to distribute this data.

EDIT: One idea just came up. To save disk space for users of mach-nix, one could fetch the index and dependency-graph tar.gz files from github and run them directly through a builder which unpacks the containing json files and repacks them on the fly with something efficient like lz4 or zstd. Instead of .json files one would store .json.lz4 files in /nix/store.

1 Like

We can host this on our nix-community infrastructure if you want: GitHub - Mic92/git-serve-zstd: Provide a repository export server for serving zstd tarballs via http
Basically takes a git repository as an input and uses git-archive to produce zstd archives. With some caching up front this solution would make mach-nix a lot more CI friendly.

Thanks, that sounds great! This will decrease download times a lot. Still the data will be unpacked and stored uncompressed on the client side which currently is 1GB+ including the dependency graph. We could either solve this on the client itself by repacking the json files or we modify the git serve server to already deliver an archive containing .json.zstd files. Not sure what would be better.

If I have a project with a pyproject.toml, do I have any advantages of using your project over using GitHub - nix-community/poetry2nix: Convert poetry projects to nix automagically [maintainer=@adisbladis] ?