Mach-nix: Create python environments quick and easy

FRidh · April 29, 2020, 1:49pm

See the manylinux packages for Python by FRidh · Pull Request #73866 · NixOS/nixpkgs · GitHub. Including a manylinux package or list of packages and autoPatchelfHook should fix that issue.

Correct, they need to be patched with this method and additional libraries that are used need to be included as well.

Currently that count is very low, but given the new backends that number will go up. New packages I create (though they’re private/work) seldom use setuptools. Also, there are already widely used packages that have such packages as dependencies. An example of such a dependency is the entrypoints package, which includes a generated setup.py for compatibility reasons and is used in packages such as flake8 and nbconvert.

DavHau · April 29, 2020, 4:41pm

Thanks so much! That’s amazing! I now manged to build tensorflow from wheel. And it’s for sure much much faster than building from source. One of the problems of mach-nix currently is that build times can be very long. Should i consider using wheels by default wherever possible? Or are there any troubles ahead I’m not seeing right now?

FRidh · April 29, 2020, 5:11pm

As you’ve seen already, wheels are build artifacts, thus they do not list build-time dependencies. That’s no problem if your users are fine using those pre-built wheels. In Nixpkgs we prefer source builds.

DavHau · April 30, 2020, 3:22am

Then i should probably let the user decide if using sdist releases should be preferred/enforced.

Of course wheel support might blow up the dataset for the dependency graph. To reduce the size, can i make the following 2 assumptions ?:

Since my current dependency extraction uses setup.py and fails on anything else, i know that the build backend for all packages in my current database must be setuptools.
If a package’s sdist release uses setuptools as build backend, the requirements specified via install_requires are exactly the requirements of the wheel release for this package on pypi.

If this is true, I don’t need to store any dependency information for the wheel release if i already know the dependencies of the sdist.
Also, I would only need to download and analyze wheel’s for packages which either don’t have ‘sdist’ or dependency extraction failed on their ‘sdist’.

FRidh · April 30, 2020, 5:47am

Why? In case of non-universal wheels there will indeed be many more artifacts and thus your dataset size would indeed increase, but you could merge the results with markers as is done in the source. Tricky, and relies on the assumption that all combinations of platform/version are provided/covered. Given the far majority of packages are pure Python and thus universal wheels I doubt it makes much difference.

That’s fine.

In the sdist/source markers are often used to describe dependencies across platform/version. In non-universal wheels, the platform/version may be more restricted, and thus the dependencies are only correct for those platforms/versions. Also, setuptools (specifically pkg_resources module) is often an undeclared dependency. Like poetry2nix you could choose to always include it.

Correct. Using source will cover the entire platform/version range, however, the method for extracting those dependencies is backend-dependent.

DavHau · May 1, 2020, 9:31am

Thanks again for the very helpful reply!
Merging dependency information from sdist and wheels seems unnecessary complex and error-prone.
Therefore I’m going to treat wheels as separate nodes in the dependency graph.
Today i added all URLs for wheels to the nix-pypi-fetcher index while we also optimized the format to reduce file size.
Therefore adding wheels just increased the uncompressed size from 330 MB to 435 MB while the compressed size even decreased from 136 MB to 120 MB.

One more question arises. Having more than one wheel per package candidate in the index will increase the complexity in a lot of places. It would be simpler to store only one wheel per candidate.
Therefore I’d like to know if there are any benefits at all for the nix community to have multiple wheels available.

The windows wheels for sure are useless.
What about macosx like for example tensorflow-2.1.0-cp37-cp37m-macosx_10_11_x86_64.whl? Are they useful for the darwin platform under nix?
Then there is i686. Looking at Phasing out i686, It doesn’t look like it’s going to be needed by anyone, right?

FRidh · May 1, 2020, 9:47am

It’s needed if you want to use tensorflow with python 3.7. Can’t say whether it would actually work.

Multiple files is needed in case of extension modules, there is no way around that, unless one compiles from source. In the end it depends on what you want to support.

Right.

DavHau · May 1, 2020, 10:05am

Can I expect manylinux weels + autoPatchelfHook will work under nix + darwin? If so, then there would be no need to include macosx wheels.

What exactly are extension modules? Could you provide an example of a package on pypi having more than one .whl file being relevant under nix (excluding the macosx ones)?

Mic92 · May 1, 2020, 3:01pm

No autopatchelfHook only works on platforms using elf binaries so Linux, *BSD, Solaris. For dependencies on system libraries, patching might be not required but would make packages impure. In general the equivalent of patchelf on macOS is install_name_tool. In theory it should be possible to extend autopatchelfHook for that, but nobody has done though.

Mic92 · May 1, 2020, 3:46pm

Have a look at zstd for compression.

╭─[~/pypi-deps-db]─[master]─joerg@eve──[zstd]
╰─ % zstd data/*.json -o data.zstd

╭─[~/pypi-deps-db]─[master]─joerg@eve──[zstd]
╰─ % du -sh data.zstd
19M     data.zstd
19M     total

19M vs the 120MB you reported. Might be even worth to store it in that format locally. Otherwise there is also rocksdb, which supports zstd as compression algorithm.

louwers · May 1, 2020, 7:21pm

This looks very promising. Can’t wait to try it.

DavHau · May 2, 2020, 4:12am

I forgot to consider that for manylinux releases there is one file per python version. Therefore obviously we need to maintain multiple wheel files.

Yeah zstd is great. I’m using it with btrfs. Most important feature when dealing with raw data
But for the project I’m not really sure how to implement better compression than the one of github’s .tar.gz without losing the following benefits:

Human readability: Currently I’m pushing updates every 12h to github. People can see what changed with every commit and easily search through all data. If i compress the data, then commits will become ugly and most likely also inefficient (not sure about the last one).
github’s tarball feature: If i would use anything else than github, it would probably be much more difficult to provide a tarball for every single version of the database.

Please let me know if I’m overseeing something and there are better ways to distribute this data.

EDIT: One idea just came up. To save disk space for users of mach-nix, one could fetch the index and dependency-graph tar.gz files from github and run them directly through a builder which unpacks the containing json files and repacks them on the fly with something efficient like lz4 or zstd. Instead of .json files one would store .json.lz4 files in /nix/store.

Mic92 · May 2, 2020, 12:48pm

We can host this on our nix-community infrastructure if you want: GitHub - Mic92/git-serve-zstd: Provide a repository export server for serving zstd tarballs via http
Basically takes a git repository as an input and uses git-archive to produce zstd archives. With some caching up front this solution would make mach-nix a lot more CI friendly.

DavHau · May 3, 2020, 3:53am

Thanks, that sounds great! This will decrease download times a lot. Still the data will be unpacked and stored uncompressed on the client side which currently is 1GB+ including the dependency graph. We could either solve this on the client itself by repacking the json files or we modify the git serve server to already deliver an archive containing .json.zstd files. Not sure what would be better.

Mic92 · May 4, 2020, 3:49pm

If I have a project with a pyproject.toml, do I have any advantages of using your project over using GitHub - nix-community/poetry2nix: Convert poetry projects to nix automagically [maintainer=@adisbladis] ?

FRidh · May 4, 2020, 4:07pm

You should not need to build when locking because dependencies have already been extracted from “builds”.

DavHau · May 5, 2020, 4:04am

pyproject.toml is nothing poetry specific as you may know. It doesn’t lock dependencies either.
Neither poetry2nix nor mach-nix can handle a project with only a pyproject.toml.

poetry2nix requires pyproject.toml + poetry.lock, while mach-nix only reads requirements.txt.

If the project contains all 3 files, you could choose between the tools, but i would go for poetry2nix as it uses the lock file and produces an environment which is closer to what the author specified.

If you only got pyproject.toml and nothing else, you might be able to create the poetry.lock with poetry if you are lucky and use poetry2nix.

I think pyproject.toml integration for mach-nix would be a great thing to avoid that extra step. Also you might prefer the dependency resolution of mach-nix over the one from poetry, because it allows you to build from source instead of wheels.

EDIT: Another thing you might consider while choosing between poetry2nix and mach-nix is how extra (non-python) dependencies are handled. Mach-nix takes that information from nixpkgs, while poetry2nix provides it’s own set of overrides. Since i never used poetry2nix so far i cannot make a statement about what works better.

adisbladis · May 5, 2020, 10:20am

Also you might prefer the dependency resolution of mach-nix over the one from poetry, because it allows you to build from source instead of wheels.

This is wrong. Poetry2nix builds from sources (sdist) by default but allows you to opt-in to prefering wheels, either on a granular per-package basis or for the whole build (the latter is on poetry2nix master but not in a release).

Mach-nix takes that information from nixpkgs, while poetry2nix provides it’s own set of overrides.

It’s pretty much impossible to solve an environment correctly and picking packages from nixpkgs at the same time.
I don’t think you can make a python2nix solution without making your own override overlay (or re-use the one from poetry2nix).

Poetry2nix falls back on nixpkgs in some very limited cases, mainly around pyproject.toml build-system but also in some cases where a dependency specification from upstream was incomplete and we supplement the graph via an override.

DavHau · May 5, 2020, 11:00am

I should have taken a closer look at your tool before doing such statement. Thanks for correcting

Then I either solved the impossible or my tool is horribly broken . Why do you think it doesn’t work?

I don’t really ‘pick’ python packages from nixpkgs, without modifying them, btw. Mach-nix generates an overlay over the existing python packages in nixpkgs which replaces sources to upgrade versions, adds missing python build inputs, or generates completely new python packages in case a package doesn’t exist at all. I must mention, it needed some extra tweaks to prevent package collisions which are very likely otherwise.
Of course, if you require a very recent version of some py-package which needs non-python build inputs which are not defined in nixpkgs for that package, then mach-nix will fail. But those are rare cases i think. Still they might need to be fixed it the future to gain full compatibility.

I would love to hear your opinion about it. I’m sure i can benefit from your experiences.

FRidh · May 5, 2020, 5:47pm

We really need to improve the Python builder to either require separate arguments for Python and non-Python packages, or perform the splitting in the builder (this can be done but seems backward) and offer attributes in the passthru for each.

Given the poetry2nix overrides are basically taken as-is from Nixpkgs I disagree. You could override the Nixpkgs expressions, and just extend dependencies to be at the safe side, taking deps from both Nixpkgs and your tool. It’s ugly, causing you at times to have more deps than strictly needed, but it is possible.

The question is of course what to do with other custom things, like patches. But this you will never get correct because its so version-dependent. Of course tests should be disabled as well. Makes me wonder if that would be a good idea for poetry2nix, instead of bundling its own default overrides. That would avoid duplicate work.