Mach-nix: Create python environments quick and easy

FRidh · April 30, 2020, 5:47am

Why? In case of non-universal wheels there will indeed be many more artifacts and thus your dataset size would indeed increase, but you could merge the results with markers as is done in the source. Tricky, and relies on the assumption that all combinations of platform/version are provided/covered. Given the far majority of packages are pure Python and thus universal wheels I doubt it makes much difference.

That’s fine.

In the sdist/source markers are often used to describe dependencies across platform/version. In non-universal wheels, the platform/version may be more restricted, and thus the dependencies are only correct for those platforms/versions. Also, setuptools (specifically pkg_resources module) is often an undeclared dependency. Like poetry2nix you could choose to always include it.

Correct. Using source will cover the entire platform/version range, however, the method for extracting those dependencies is backend-dependent.

DavHau · May 1, 2020, 9:31am

Thanks again for the very helpful reply!
Merging dependency information from sdist and wheels seems unnecessary complex and error-prone.
Therefore I’m going to treat wheels as separate nodes in the dependency graph.
Today i added all URLs for wheels to the nix-pypi-fetcher index while we also optimized the format to reduce file size.
Therefore adding wheels just increased the uncompressed size from 330 MB to 435 MB while the compressed size even decreased from 136 MB to 120 MB.

One more question arises. Having more than one wheel per package candidate in the index will increase the complexity in a lot of places. It would be simpler to store only one wheel per candidate.
Therefore I’d like to know if there are any benefits at all for the nix community to have multiple wheels available.

The windows wheels for sure are useless.
What about macosx like for example tensorflow-2.1.0-cp37-cp37m-macosx_10_11_x86_64.whl? Are they useful for the darwin platform under nix?
Then there is i686. Looking at Phasing out i686, It doesn’t look like it’s going to be needed by anyone, right?

FRidh · May 1, 2020, 9:47am

It’s needed if you want to use tensorflow with python 3.7. Can’t say whether it would actually work.

Multiple files is needed in case of extension modules, there is no way around that, unless one compiles from source. In the end it depends on what you want to support.

Right.

DavHau · May 1, 2020, 10:05am

Can I expect manylinux weels + autoPatchelfHook will work under nix + darwin? If so, then there would be no need to include macosx wheels.

What exactly are extension modules? Could you provide an example of a package on pypi having more than one .whl file being relevant under nix (excluding the macosx ones)?

Mic92 · May 1, 2020, 3:01pm

No autopatchelfHook only works on platforms using elf binaries so Linux, *BSD, Solaris. For dependencies on system libraries, patching might be not required but would make packages impure. In general the equivalent of patchelf on macOS is install_name_tool. In theory it should be possible to extend autopatchelfHook for that, but nobody has done though.

Mic92 · May 1, 2020, 3:46pm

Have a look at zstd for compression.

╭─[~/pypi-deps-db]─[master]─joerg@eve──[zstd]
╰─ % zstd data/*.json -o data.zstd

╭─[~/pypi-deps-db]─[master]─joerg@eve──[zstd]
╰─ % du -sh data.zstd
19M     data.zstd
19M     total

19M vs the 120MB you reported. Might be even worth to store it in that format locally. Otherwise there is also rocksdb, which supports zstd as compression algorithm.

louwers · May 1, 2020, 7:21pm

This looks very promising. Can’t wait to try it.

DavHau · May 2, 2020, 4:12am

I forgot to consider that for manylinux releases there is one file per python version. Therefore obviously we need to maintain multiple wheel files.

Yeah zstd is great. I’m using it with btrfs. Most important feature when dealing with raw data
But for the project I’m not really sure how to implement better compression than the one of github’s .tar.gz without losing the following benefits:

Human readability: Currently I’m pushing updates every 12h to github. People can see what changed with every commit and easily search through all data. If i compress the data, then commits will become ugly and most likely also inefficient (not sure about the last one).
github’s tarball feature: If i would use anything else than github, it would probably be much more difficult to provide a tarball for every single version of the database.

Please let me know if I’m overseeing something and there are better ways to distribute this data.

EDIT: One idea just came up. To save disk space for users of mach-nix, one could fetch the index and dependency-graph tar.gz files from github and run them directly through a builder which unpacks the containing json files and repacks them on the fly with something efficient like lz4 or zstd. Instead of .json files one would store .json.lz4 files in /nix/store.

Mic92 · May 2, 2020, 12:48pm

We can host this on our nix-community infrastructure if you want: GitHub - Mic92/git-serve-zstd: Provide a repository export server for serving zstd tarballs via http
Basically takes a git repository as an input and uses git-archive to produce zstd archives. With some caching up front this solution would make mach-nix a lot more CI friendly.

DavHau · May 3, 2020, 3:53am

Thanks, that sounds great! This will decrease download times a lot. Still the data will be unpacked and stored uncompressed on the client side which currently is 1GB+ including the dependency graph. We could either solve this on the client itself by repacking the json files or we modify the git serve server to already deliver an archive containing .json.zstd files. Not sure what would be better.

Mic92 · May 4, 2020, 3:49pm

If I have a project with a pyproject.toml, do I have any advantages of using your project over using GitHub - nix-community/poetry2nix: Convert poetry projects to nix automagically [maintainer=@adisbladis] ?

FRidh · May 4, 2020, 4:07pm

You should not need to build when locking because dependencies have already been extracted from “builds”.

DavHau · May 5, 2020, 4:04am

pyproject.toml is nothing poetry specific as you may know. It doesn’t lock dependencies either.
Neither poetry2nix nor mach-nix can handle a project with only a pyproject.toml.

poetry2nix requires pyproject.toml + poetry.lock, while mach-nix only reads requirements.txt.

If the project contains all 3 files, you could choose between the tools, but i would go for poetry2nix as it uses the lock file and produces an environment which is closer to what the author specified.

If you only got pyproject.toml and nothing else, you might be able to create the poetry.lock with poetry if you are lucky and use poetry2nix.

I think pyproject.toml integration for mach-nix would be a great thing to avoid that extra step. Also you might prefer the dependency resolution of mach-nix over the one from poetry, because it allows you to build from source instead of wheels.

EDIT: Another thing you might consider while choosing between poetry2nix and mach-nix is how extra (non-python) dependencies are handled. Mach-nix takes that information from nixpkgs, while poetry2nix provides it’s own set of overrides. Since i never used poetry2nix so far i cannot make a statement about what works better.

adisbladis · May 5, 2020, 10:20am

Also you might prefer the dependency resolution of mach-nix over the one from poetry, because it allows you to build from source instead of wheels.

This is wrong. Poetry2nix builds from sources (sdist) by default but allows you to opt-in to prefering wheels, either on a granular per-package basis or for the whole build (the latter is on poetry2nix master but not in a release).

Mach-nix takes that information from nixpkgs, while poetry2nix provides it’s own set of overrides.

It’s pretty much impossible to solve an environment correctly and picking packages from nixpkgs at the same time.
I don’t think you can make a python2nix solution without making your own override overlay (or re-use the one from poetry2nix).

Poetry2nix falls back on nixpkgs in some very limited cases, mainly around pyproject.toml build-system but also in some cases where a dependency specification from upstream was incomplete and we supplement the graph via an override.

DavHau · May 5, 2020, 11:00am

I should have taken a closer look at your tool before doing such statement. Thanks for correcting

Then I either solved the impossible or my tool is horribly broken . Why do you think it doesn’t work?

I don’t really ‘pick’ python packages from nixpkgs, without modifying them, btw. Mach-nix generates an overlay over the existing python packages in nixpkgs which replaces sources to upgrade versions, adds missing python build inputs, or generates completely new python packages in case a package doesn’t exist at all. I must mention, it needed some extra tweaks to prevent package collisions which are very likely otherwise.
Of course, if you require a very recent version of some py-package which needs non-python build inputs which are not defined in nixpkgs for that package, then mach-nix will fail. But those are rare cases i think. Still they might need to be fixed it the future to gain full compatibility.

I would love to hear your opinion about it. I’m sure i can benefit from your experiences.

FRidh · May 5, 2020, 5:47pm

We really need to improve the Python builder to either require separate arguments for Python and non-Python packages, or perform the splitting in the builder (this can be done but seems backward) and offer attributes in the passthru for each.

Given the poetry2nix overrides are basically taken as-is from Nixpkgs I disagree. You could override the Nixpkgs expressions, and just extend dependencies to be at the safe side, taking deps from both Nixpkgs and your tool. It’s ugly, causing you at times to have more deps than strictly needed, but it is possible.

The question is of course what to do with other custom things, like patches. But this you will never get correct because its so version-dependent. Of course tests should be disabled as well. Makes me wonder if that would be a good idea for poetry2nix, instead of bundling its own default overrides. That would avoid duplicate work.

tbenst · May 7, 2020, 10:57pm

I’ve been playing around with both Mach-nix and poetry2nix the past two days. I’m super impressed by both projects, and think that they have complementary strengths. It’s awesome to see the effort and creativity by the authors to improving Nix’s python dev story!

If I may present a lay analysis (please correct me where I’m wrong!)

Mach-nix does a great job of choosing overrides based on nixpkgs, and deduplicating work of transferring all the knowledge already embedded in nixpkgs to a new build system. As I understand things, it can even choose between multiple overrides based on the specific version.

poetry2nix uses an industry hardened resolver, and makes it easy to collaborate with non-nix users. It does a great job of building wheels, too.

Nixpkgs has many fewer supported python packages, as each is manually added, and only one version of each in general, but the packages that are there have the highest working rate. e.g. nixpkgs PyTorch works (I know there’s an issue tracking this for both Mach-nix & poetry2nix).

I would love to see efforts coalesce between these three approaches! Imagine the coverage and ease of use that could be achieved. I honestly believe it could rival and surpass Anaconda if everyone works together.

I am a bit concerned about our already-small nix/python community fragmenting across these three projects, based on what works where.

IMHO, the ultimate approach might:

have no requirement to manually write derivations
support poetry.lock for collaborating with non-nix users
support per-package override versioning
leverage existing work in nixpkgs where possible rather than recreating
ultimately be mainlined into NixOS/nixpkgs

Thoughts??

DavHau · May 10, 2020, 4:52am

Thanks @tbenst for the nice analysis.

I’d like to mention that the resolver used by mach-nix is currently being integrated into pip and already available in pip 20.1 via –unstable-feature=resolver. Therefore this is going to be the most industry hardened resolver soon.

I think both poetry2nix and mach-nix have the goal to simplify collaboration with non-nix users. Both tools just have mutually different approaches to achieve that goal. Poetry2nix provides an adapter for Poetry which works on other platforms, while mach-nix provides an adapter to use nix on other platforms without the necessity to understand nix.

While poetry2nix follows poetry, mach-nix tries to provide an alternative to poetry itself.

Currently both solutions can have advantages and disadvantages dependeing on the situation. I agree that we should collaborate wherever possible, but the tools might not become one and the same.

I think it would be good to find a common place to maintain python packages and i think nixpkgs would hypothetically be the best place for it. I say hypothetically because right now it is quite difficult to build ontop of nixpkgs for a python package management tool. Therefore i can fully understand if people have chosen not to do it. I guess nixpkgs could be modified to make it much simpler. I’ll just list some issues which i remember i had and some ideas of mine:

There is no clear mapping from pypi package names to nix attribute names or the other way round. I currently match names by stripping non-aplhanumeric chars and lowercase the rest. Works for most packages but isn’t a guarantee (I guess?).
The same thing goes for the version attribute of each package. I don’t remember any specific problems with versions, but i guess that there is no strict rule enforced in nixpkgs.
In case there are multiple versions of a package in nixpkgs, there is no clear rule, how to specificallty find them. Most of the times they share a common prefix and end with underscore seperated version numbers like django_1_11 and django_2_2. But sometimes the underscores are left out or there are differen uppercase vs. lowercase writings of the same name.
Sometimes multiple attributes point to the exact same package. If i recall correctly, overriding both of them leads to an infinite recursion in nixpkgs in this line. How to detect this? Currently i consider attributes duplicates if their version attribute is the same.
Some python packages in nixpkgs are built with python dependencies they don’t really need (probably by mistake). If you don’t take care of overriding those dependencies also, you might end up having package conflicts because you didn’t expect those packages in your closure. One could run the dependency graph which I maintain for mach-nix against nixpkgs to find and correct those mistakes.
Is there a clear guidline on when a package should be included in nixpkgs and when it should not? To use nixpkgs as a base we need to rely on the possibility to put all distinct hard-to-build python packages in nixpkgs.
For creating advanced overrides, it might be beneifical do be able to inspect which buildInputs/propagatedBuildInputs/checkInputs/… are already defined for a given python package in nixpkgs. Is there are good way to do that without building the package or regexing the nix code?

I totally share your vision to rival and surpass anaconda and similar things. I believe nix itself provides a much better base for these workflows. We just need to make using nix simple enough which is one of my main goals with mach-nix.

FRidh · May 10, 2020, 6:33am

New attribute names are supposed to be normalized names. Older packages may not follow this rule. Additionally, packages that start with a number cannot follow this, so they are prefixed with something (I think py). The latter can be solved by having the attributes be strings, if I recall correctly. Difficult can be bindings that we include that are not on PyPi but may shadow a PyPI package. Those should be fixed. Renaming old packages to follow the normalized names rule would ideally also be done, but would break backwards compatibility unless aliases are added.
Versions should be fine I think. You are right we do not enforce a rule, but I highly doubt this is an issue with Python packages.
Multiple versions indeed has a convention, but I can imagine it can be hard to use that in an automated way. In principle we should not have multiple versions of packages in the Python package set.
That’s quite rare and primarily exists because of 3). The other cause could be variants (see e.g. h5py-mpi). I think those should be removed and a parameter should be added to the original expression.
Cleanup is indeed needed here, and ideally this is a part that would be automated further. Separating expressions in automatically generated and manually overridden is still an idea, but contrary to other package sets in Nixpkgs we have quite a lot of manually added code, especially for testing.
No, there is not. I wish there was. There are some different views on this as well. Hopefully when/if we get flakes we can have a good discussion around that.
You could evaluate? nix eval -f . python3.pkgs.numpy.version

tbenst · May 10, 2020, 7:03am

It’s quite cool that one could theoretically have all python packages from nixpkgs in one environment, something that may not be possible with pip/poetry/conda. But practically, as a developer, I want to use the exact versions as specified by authors. Scipy 1.19 and 1.23 have breaking changes between them for example, and this happens all over the place. Not all packages follow semantic versioning, and an override written for one package version may not work for another version. I strongly feel that we must support users installing the precise version that they desire. Using multiple virtualenv is industry standard and required for most developers.

I understand from the nixpkgs perspective that we have limited build resources, etc, and I certainly wouldn’t expect everything to be cached / go through CI. But does it really matter

if the scipy folder in nixpkgs has 10 or 20 or 100 nix files in it?

I have a feeling that both poetry2nix and Mach will have high overlap in failure cases, the packages that need buildInputs. I understand that it may not be possible to combine, but it would be great to identify a concrete way in which the two projects could avoid duplicating work.

Claim: I think it’s in everyone’s interest if we can have one set of overrides that are used by both projects. Each project would benefit, by having more people contribute to these overrides, and maintainers benefit by not having to duplicate work. If this doesn’t happen, my concern is that we’ll (or at least I’ll…) be jumping between the two approaches when a particular package isn’t supported by one or the other.

I would like to see python packages that can be automatically generated by poetry2nix, pypi2nix, Mach, and/or python-package-init NOT in nixpkgs. Only python requiring manual attention should go in, and ideally this work should be used by the respective projects.

If the needs of current nixpkgs python users and those of more automated tooling are too far apart for now, then perhaps Nix-Community could host a shared overlay/overrides. Maybe @zimbatm has some thoughts

Edit: or at least a subset of overrides that are shared by both projects as possible…