Scaling the Python package set in Nixpkgs

FRidh · August 18, 2019, 10:39am

Whether Nixpkgs scales is a hot topic and I want to further discuss this topic with respect to the Python package set.

Earlier I posted a graph showing a steady increase in size of the package set. In 4 years time it increased from ~800 expressions to ~2500. Maintaining the package set takes a lot of effort because they’re (semi-)manually maintained expressions. For why they’re not automated, see an RFC in development.

Leaf packages are not much of an issue as they are easy to review and cannot cause much breakage. However, as we get more of these, the size of the core set of packages grows as well. Updating any of these core packages typically causes breakage elsewhere typically resulting in chains of updates/fixes needed. For these reasons, even if ofborg says <100 rebuilds are needed, one can be pretty sure that it is going to break reverse dependencies.

This adds significantly to the maintenance burden and the question is now how to manage this. It has been suggested to limit the scope of the package set, but to exactly what then? If we limit the scope, how would users then get their expressions? Clearly, if we go in this direction, we need a way to extend the package set and compose it.

Even though such a way is not really available yet, I feel very much for not accepting any more new Python libraries unless it’s needed for a core package. This can give time to work on changes that are more important, like setting up a workflow that allows us to more easily test and integrate changes to the Python libraries.

I, personally, do not want to spend as much time on integrating updates and new packages as it’s grown into a big time sink. Maybe performing a batch upgrade in a joint effort once a quarter would be a solution. What are your views on this?

timokau · August 18, 2019, 2:49pm

Couldn’t we instead distribute the update and maintenance burden better? Like we do for the rest of nixpkgs, updates could be done by maintainers or drive-by contributors. They would be responsible for testing at least a subset of the reverse dependencies and either fix them themselves or ping their respective maintainers. If nobody ever updates a package, that probably means that nobody cares for it enough and that’s fine.

domenkozar · August 18, 2019, 3:11pm

My 2c: development of those needs to move into kernel “forking” model.

With good CI infrastructure, you’d receive pull requests of those that might maintain science, etc. That way you only merge as that part has been tested and is known to work.

What’s important is to tell community what nixpkgs has to offer extra, I think Known Good Set of Python packages would be gold. Since Conda brings in millions of dollars, I’m pretty sure this could even be a sustainable OSS business model (maybe with just donations).

ryantm · August 18, 2019, 3:15pm

I think there are different levels of questions:

What Python packages should have expressions in nixpkgs?
Of those in #1, which should be maintained so they build and function?
Is the cost of creating and maintaining an minimal coherent set of Python packages that supports the core nixpkgs packages much less than the efficiency and disk space gains?

For me, the answers are:

Every version of every Python package, done in some very automated way
The ones people care about, done using an overlay on top of the #1 expressions
I’m not sure, but I strongly suspect this is the hardest part, and could instead be done by each core package having a “lockfile” of python package versions it needs.

Having every version sounds daunting, but if it is done in an automated way, I think it solves a lot of the manual labor associated with maintaining the package set. When someone wants to get some Python package to work (that is, #2 in my list), they don’t have to focus on figuring out how to get the source files and do the boiler plate setup, they’d only have to focus on adding overlays to the fix up versions that don’t work by default.

@FRidh Do you feel like most of the current work is in my #3, what @domenkozar calls the “Known Good Set of Python Packages”?

datakurre · August 18, 2019, 3:30pm

I’m very interested to see and learn how flakes could work for extending nixpkgs Python package set. And how package sets from different flakes could be merged into new Python environments so that their dependencies could be solved (in automated or manual manner).

I’m quite skeptic if the ideal “every version of every Python package, done in some very automated way” could ever be achieved for Python. I would prefer the policy/goal to be realistic.

lsix · August 18, 2019, 9:27pm

Hi, even if I rely on many libraries from the python package set and really appreciate having them available in nixpkgs, I agree this is not easily sustainable in the long run as it is.

My current position would be

libraries required for applications in nixpks (even if I mainly use libraries on their own)
everything in #1
I am not sure

but I have to admin that going to what @ryantm suggests seems adequate and probably desirable in the long run.

jonringer · August 19, 2019, 2:30am

So one counter argument against trying to do all versions of all packages would be composability. Unfortunately python uses sys.path (which is seeded from PYTHONPATH, which we use) to look up packages. This also means you will only be able to refernce a single version of a package (whichever appears first in the sys.path). So an example would be the jupyter package, in which the latest version requires jsonschema>=3,<4 and other packages you may include require jsonschema>=2,<3. There’s no way currently to have two seprate packages reference difference versions of a dependencies in the same python session.

Compiled binaries have the “luxury” of being able to patchelf or wrap their runtime dependencies to potentially incompatible versions of a given library (E.g. gtk2 vs gtk3), python does not :(.

We already implement (to some degree) a curated list for python. It’s just that it heavily relies on nix-review or other tooling to ensure an update isn’t breaking reverse dependencies.

However, it would be nice to have a some other mechanism than just “latest” to support more package dependencies. But something like this would also scale the complexity of maintaining the packages greatly.

One compromise might be to start pinning major versions of packages, and allowing people to override them, when needed. However, this also seems like a huge maintenance burden as well.

costrouc · August 19, 2019, 12:45pm

My personal opinion on this is we need more automation as right now it is still too manual and this is the pain that maintainers are feeling.

Right now I want to develop automation tools for nixpkgs (mostly python) but there is a major thing that is blocking me. That is formatting code – or more specifically transforming code. There are many people working on formatting code but all I want is the concrete syntax tree to work from. I’ll leave formatting code to the tools that do it well. But these tools will never be good at transforming the code since there are just too many transformations that someone could possibly want to do. Also as I see it speed is really not a concern here.

I worked on a toy about 3 months ago lex/yacc but I don’t want to have yet another formatter. Here is an example of rewriting all of the accessible https urls to http → https https://github.com/costrouc/nix-transform/tree/master/examples/transformations.

I like that this project exposes the syntax tree GitHub - nix-community/nixpkgs-fmt: Nix code formatter for nixpkgs [maintainer=@zimbatm]

zimbatm · August 19, 2019, 4:21pm

@FRidh the issue is that you are serving two masters:

nixpkgs applications
developers who want to run python.withPackages

Both have different requirements.

For nixpkgs applications, the best is to bundle the dependencies specified by upstream with the application. We do that for Ruby, NodeJS, Rust, … you’ll have a much better time when talking with upstream when they ask you why you are using the untested version of the library. And resolving the right package set becomes a problem for upstream only.

For developers, what they want is a pre-built set of packages that have been tested and work together. They also want more increments where each library is in it’s own derivation because they spend a lot of time iterating on the build.

Since nixpkgs is for applications primarily I would drop the package set, or move it out of nixpkgs and make it a separate project.

FRidh · August 19, 2019, 5:08pm

We should. The difficulty is that, unlike most others parts of Nixpkgs, Python packages are so tightly coupled.

Where would you draw the line with testing? I’ve seen some contributors using nix-review to test all reverse dependencies, that is great. If more changes were needed they would (attempt to) make those as well. But PR’s tend to grow a lot because of these chains. That makes it hard for contributors to continue and difficult to review.

That’s how I would also prefer to see it, and not just for the Python packages. The question is how to do it, really. It typically means that those offering subsets will need resources to test reverse dependencies and be willing to fix many parts.

Changing core packages will break reverse dependencies. If we can agree on that being acceptable, then fine. Working with different subsets (or flakes) would make that part easier I guess.

The core packages are those that require manual efforts, because they a) may include extension modules and b) test coverage is important. Ensuring that set works while minimizing breakage towards the leaf packages is the hard part, and that is also what makes it so difficult to accept good-looking pull requests of core packages; it’s really hard to say the impact without testing all reverse dependencies.

Quite the contrary. That becomes a combinatoric problem. Generating basic expressions is relatively easy (though still not straight-forward). It’s just that relatively a lot of extra work is needed in case of extension modules or for check phases. Isn’t this the reason we aim at having only one version of any package in Nixpkgs; to reduce the maintenance burden.

Me too! That’s why I share Nicolas his skepticism as package sets have not been considered in the Flakes design.

It exactly is a curated set. I am not sure whether in Anaconda all it’s packages can exist in a single environment. I know for sure as soon as you use the conda forge that is not the case. Maybe @costrouc knows?

I think for the base expressions JSON is sufficient, no need to have those in Nix. Additional manual changes could be done in Nix.

Yes, they do, although aside from a handful applications (home-assistant, build-bot, flexget) this seems to be working fairly well.

That works in case upstream defines dependencies and such in a good way. That’s not really the case in Python. We could, in the case of applications, maybe accept not having any checkPhase, however, it is inevitable to escape packages with extension modules. Overrides are thus going to be necessary. You also do not want to have to specify those for every application again and again. For that reason, I think the approach with a shared core set is a good design decision; we just need better tooling to build other sets (e.g. for applications) on top of it.

That would discard in one go the whole scientific computing field that uses Python. One could argue it may be split of though.

That’s the curated package set. One should be able to trust it more than any set you put together using pip considering we aim at having tests enabled for all packages.

I don’t follow you here. Increments and iterations of what, their project or their dependencies?

abathur · August 22, 2019, 12:44am

This may all fold into “tested and work together”, but it struck me that the concept of developers-using-Python might be hiding multiple use cases, so I decided to try and outline the ways (that I see as somewhat distinct) I’ve been using Nix+Python over the past ~21 months.

I’m not sure it adds much, but I fleshed it out to share just in case. Alongside each I’ve tried to list what I value in a given case, and any anxieties it causes.

as a user

Coincidentally-Python user apps/tools (such as yadm) in my profile. In pre-nix life I might have cared that these ran Python under the hood (or, more precisely I cared that they didn’t require a whole new language stack unless they were essential and there wasn’t an alternative). I don’t think about this much anymore. Maybe if I’m on slow wifi somewhere. Values: works > resource-efficiency > quick install.

as a developer-user

Rarely-used coincidentally-Python tools (such as diffuse) that get run in a nix-shell once in a blue moon. In pre-nix life I would prefer installing these in a misc VM if they were command-line tools. Values: works > gets garbage-collected > installs quickly
Editor toolchain. I hate it, but my profile includes tools to lint, format, test, check coverage, etc. to make sure sublime text plugins can readily find them. They’re also inevitably duplicated in the derivations of any projects that actively use/require them. In pre-nix life these still required global packages or IDE/plugin/project config/overrides. It just stands out a lot more because nix makes it easy to sweep most crap like this under a rug somewhere.

Somewhere down my todo list is an item on looking into making a global default python environment that has these tools bundled in such that:
- they aren’t defined in my top-level user profile directly
- are trivially overridable in an individual project
- it’s still trivial to get a “bare” python environment when needed
Values are harder to pin down here; shooting from the hip: simple to get a working global version > don’t have to update 10 places to bump versions > possible to override if I have to work with some project that just can’t use my global version > simple to override in such a circumstance > still work when hacking on a project that doesn’t care.

Anxiety: It’s “ideal” for these to be able to come from individual projects if I find myself needing to work on a project with opinions. I’ve been spared having to think through how to make this work (I haven’t needed conflicting versions of these). For now, I prefer having extra crap in my profile over losing an unknown slice of my life chasing the ideal config.

actual development work

In pre-nix life, I ran almost all projects of substance in vaguely-reproducible VMs managed with Vagrant (though I sometimes used the same VM for multiple projects). I guess a lot of people use virtualenv, and I started there, but I found it a lukewarm choice (especially if you have any non-Python dependency whatsoever). Still, this wasn’t amazing. It was always a coin-flip that I’d lose a few hours hunting down some new compatibility/config/dependency issue any time I updated macOS, VirtualBox, Vagrant, Ruby, the VM’s Ubuntu version, or had to replace the base image (find an issue with it, the old one is inactive/removed, etc.)

Directly developing something that is (or is intended) to be distributed as a module for others to use. Values: simple to test/develop against latest dependency updates > easily reproducing conditions from issue reports > not having to maintain the same damn list of dependencies in multiple formats.
Directly developing and deploying a Python web app that runs on Heroku, where it installs dependencies via pip/requirements.txt. This one has a pretty big stack of direct and indirect dependencies. I still had to put a fair amount of work into getting successful builds beyond pypi2nix.

Values: not breaking production > avoiding full-VM resource use on a laptop > not having to maintain multiple dependency specs

Anxiety: tension between regularly re-generating pip->nix for versions/additions/deletions, and all of the work it took to get it building initially. FWIW, the goal is-not the kind of reproducibility nix is shooting for–the goal is reproducing our runtime Heroku environment well enough to avoid most surprises. Re-running pip with the same inputs on a new Heroku instance (or after making and reverting a change to our requirements…) may well yield different results with a sufficiently-large set of dependencies that aren’t all version-pinned. I sometimes need to let these drift and see where they land.
Getting set up to hack/bugfix/test some established OSS project. I’ve noticed I have a smidge more trepidation around setting up for small contributions like this now (I realize this may be unfounded). Values: shedding overhead so I make more simple fixes > making the damn change > not getting embarassed by tests that pass local and fail on their CI > dropping the directory and GCing the dependencies ASAP
Directly developing a script to do some little thing. Values: it works when I write it > it keeps working unless I break it. Anxiety: losing a few hours of my life just to tread water after I go to use it and find it doesn’t run, or check a log and find it’s been producing the wrong output for months.

My own thoughts/takeaways

It didn’t strike me this way before, but after writing all of this out: the use of single-rolling-version packages from nixpkgs (or any system-level package manager) in all of the development cases I listed smells a lot like a misfeature or antipattern:
- I don’t want (i.e., it’s not in any of my projects’ long-term interest…) to develop against an old version of a dependency just because it hasn’t been bumped in the system packages yet.
- It’s not in any ecosystem’s (nix, python, oss, linux, macos, unix…) interest for developers to all individually spend time writing overrides for nixpkgs in order to develop against the latest versions of their dependencies.
- I don’t want my development projects to quietly shift after channel updates. It’s tolerable if I’m made tersely aware in a place I have any hope of noticing it that shows both the previous version and the new version. Ideally, it’s elective. (Yes, I can pin, but I might want to update my channel as a user. Yes, I can pull a special channel/nixpkgs or write an overlay–but all of these are raising signifciant hurdles in the form of nix-specific knowledge someone needs to use nix and write code.
- For any project that isn’t nixpkgs, the developers would probably prefer not to have to triage issue reports from people using nix at different channel commits with subtly different dependency versions. Especially if they don’t know what nix is. Doubly so if they already published very specific dependency version requirements (in any of the ever-proliferating Python dependency spec formats…) that the nixpkgs derivation doesn’t use.
Maybe the orientation is wrong (perhaps nixpkgs is too much of a center-of-gravity here). There’s a lot of focus on converting -to-nix, but I wonder if nix can add more value to other-language ecosystems if nix is the canonical/authoritative dependency spec, with toolchain for generating other language-idiomatic specs from there.
I’ve been on the fence about some of the push--packages-out-of-nixpkgs discussions lately. I’m conceptually on board now, but I worry about sentencing a lot of people to reinvent wheels (both by making it nontrivial for anyone who doeesn’t already have a full nixpkgs clone and know how to search deep git history to recover old package derivations to start from, and by limiting shared maintenance of common dependencies that aren’t common “application” dependencies). A secondary worry is making it harder for people to get traction. I would feel a lot better about this if I had some sense that there was a clear stewardship/discoverability model for where these go and how new nix users will find them.
I’ve skimmed the threads/issues discussing reproducibility with pip/pypi and realize there’s a lot of detail beyond my knowledge here, but I wonder about the possibility–if nix was the canonical spec–of overlaying/proxying pypi to add semantics aimed at some of the reproducibility challenges it causes (i.e., the maintainer manually uploads multiple different binaries) in a way that at least makes those issues obvious?

jonringer · August 22, 2019, 9:21pm

I think most of these points are mitigated in some fashion, and express a kind of trade-off with what you want to gain.

I don’t want (i.e., it’s not in any of my projects’ long-term interest…) to develop against an old version of a dependency just because it hasn’t been bumped in the system packages yet.

if you’re fine with developing off of master, i find that most packages (especially the commonly used ones) are usually quite up to date. If nix had a larger contributor base, im sure this would be less of a problem. With how easy it is to upstream override/overlay changes, just take the changes and put them back into master, then everyone benefits.

It’s not in any ecosystem’s (nix, python, oss, linux, macos, unix…) interest for developers to all individually spend time writing overrides for nixpkgs in order to develop against the latest versions of their dependencies.

Agreed, however, as long as packages are just a simple version bump, then r-ryantm handles that use case pretty well. For python, there are ways to create a virtualenv so you can use a normal pip work-flow.

I don’t want my development projects to quietly shift after channel updates. It’s tolerable if I’m made tersely aware in a place I have any hope of noticing it that shows both the previous version and the new version. Ideally, it’s elective. (Yes, I can pin, but I might want to update my channel as a user. Yes, I can pull a special channel/nixpkgs or write an overlay–but all of these are raising signifciant hurdles in the form of nix-specific knowledge someone needs to use nix and write code.

Even if you’re following a channel like 19.03, you’ll still receive security fixes and updates to major packages. I think they are valuing stability and security over preventing staleness. If you’re taking a priority on having the latest packages for everything, then you should really be following nixpkgs-unstable (yes, some things will break occasionally, but it’s a trade-off).

For any project that isn’t nixpkgs, the developers would probably prefer not to have to triage issue reports from people using nix at different channel commits with subtly different dependency versions. Especially if they don’t know what nix is. Doubly so if they already published very specific dependency version requirements (in any of the ever-proliferating Python dependency spec formats…) that the nixpkgs derivation doesn’t use.

Most of the time when i encounter this, it’s because something isn’t using the latest version of some dependency. At least for myself, if i can patch the version bounds and their tests still pass, I’ll open a PR on the upstream repo with the version bump. If the tests don’t pass, then I’ll open up a ticket that the lastest version of XXX isn’t compatible (E.g TopicModel broken with scikit-learn 0.21 · Issue #260 · chartbeat-labs/textacy · GitHub) For nix contributors, this is really how much time and effort do you want to spend on upstreaming fixes; for users this would be a bad experience, like the texacy example, i had to mark the package as broken until they add scikit-learn 0.21 support (However, this would be present in a pip install where one package would need scikit-learn >=0.21, and another needs scikit-learn <0.21. Given how common scikit-learn is in the ML space, i think this would be very common)

Maybe the orientation is wrong (perhaps nixpkgs is too much of a center-of-gravity here). There’s a lot of focus on converting -to-nix, but I wonder if nix can add more value to other-language ecosystems if nix is the canonical/authoritative dependency spec, with toolchain for generating other language-idiomatic specs from there.

I frequently upstream version information, see above.
Could there be a better way that someone could leverage nix infrastructure to automate compatibility? Probably, nix-review comes to mind.
For nixpkgs to be an authoritative source, it would mean that nixpkgs would need to grow significantly in terms of contributors and usage (to the point that people care to put effort and time to remedy issues). However, I think the barrier to entry for nix/nixpkgs is too high for mainstream adoption and for the short/medium term it will be meant solely for power-users.

abathur · August 22, 2019, 11:19pm

Just wanted to make a little clarification here while I chew on the rest.

I’m not sure what I mean (nor that it leads anywhere interesting), but I mean something roughly the opposite of this. I’m just wondering aloud:

Has nixpkgs–as a collection of packages, the primary focus of the community, and a damn-good hammer–made the specific case of developer-centric language-specific packaging look more like a nail than it actually is? (leading to many convert-language-idiomatic-package-format-to-nix projects)
If the focus (orientation) isn’t “pulling” language-specific authoritative package/dependency specs towards nix-lang/nixpkgs, are there better ways the nix language/toolchain/ecosystem could address the use-case(s)/needs/problems of developers working in those ecosystems?

nyarly · August 23, 2019, 5:12pm

Although I suspect this will be ill-received, the solution to this is to transition Python in Nix to the Ruby model of package management.

This would look like incorporating pypi2nix expressions into nixpkgs directly, with something like the ruby default-config set for distro-specific overrides. The up-to-date-ness of packages becomes an issue for the application expression maintainer to support. The task of maintaining python is scoped to its default-config.

My understanding of the chief objection to this is that the Python packages in Nix would lose their direct curation, which has impacts, for instance, on security. This is also an issue for Ruby, and I’d think that the solution there is better tooling - a component of OfBorg or something separate might scan nixpkgs and compare to known CVEs much the same way that existing tooling (including that provided directly by Github) scans package dependency files today.

costrouc · August 26, 2019, 6:31pm

My crazy idea to “fix” or provide a hack to the python package single version problem. Python only allows one version of each package at a time (whatever is found in the python path first). What if a tool was written to rewrite import statements for a given package? For example import numpy → import ab12abc45_numpy_1_16 for example. This would allow there to be many versions of numpy being used in the same environment. is this a crazy idea? Would this be of any benefit to nixpkgs? This could be used only of the package overrides within a given package. This would also prevent downstream packages from depending on the package version provided from a downstream packages.

timokau · August 26, 2019, 10:02pm

I’ve thought about something similar. It may be possible to do something like that with GitHub - google/pasta: Library to refactor python code through AST manipulation., ideally writing a nixpkgs wrapper that allows us to easily rewrite imports on the fly.

costrouc · August 27, 2019, 1:23am

That’s neat I’ll have a look at that! I wrote a POC tonight that uses rope. It is able to rename variables from the import (obeying locals vs globals). I tested it with a fairly large project airflow. It preserves all comments etc I. The code. It’s not the fastest (1-2 seconds per rename) for a large project. I’m going to create a more polished working example in the next few days.

EDIT: I’ve published a tool to do these import rewrites GitHub - nix-community/nixpkgs-pytools: Tools for removing the tedious nature of creating nixpkgs derivations [maintainer=@costrouc]. Now I am going to see what this looks like when building python packages in nixpkgs.

costrouc · August 27, 2019, 6:34pm

Okay so this PR pythonPackages.apache-airflow: init at 1.10.3 by costrouc · Pull Request #65026 · NixOS/nixpkgs · GitHub is a working example of rewriting imports with GitHub - nix-community/nixpkgs-pytools: Tools for removing the tedious nature of creating nixpkgs derivations [maintainer=@costrouc]. It allows multiple versions of python packages existing within the same PYTHONPATH.

How does it work? It uses rope GitHub - python-rope/rope: a python refactoring library to refactor certain imports. Rope makes this approach quite robust. Here is an example of how it correctly refactors changing the numpy import https://github.com/nix-community/nixpkgs-pytools/blob/b24384090a348118c4ebb7d9bf5749964734380a/tests/test_import_rewrite.py.

I have left the “tricks” done in rewriting the imports as explicit as possible in this PR. In hopes that if others think this is worth moving forward with we can create a wrapper function that will automatically apply these changes in a more user friendly way. This approach should have minimal impact on libraries that depend on it. Personally I think this is a feasible approach to different versions of python package coexisting without affecting other packages. Would love to hear others thoughts.

In this PR I demonstrate that you can have packages that would normally break airflow in the propagated build inputs ( pendulum , flask-appbuilder ). Because of the import rewrites airflow does not use these packages and downstream packages can depend on the latest version.

Shados · August 30, 2019, 1:39am

This is very cool, although I think perhaps it is orthogonal to the problems at hand.

I believe the core issues here are:

Python packages are typically only going to be tested and maintained upstream against “the set of dependencies at the latest available versions that meet the specified version constraints” (AKA “whatever pip gives us”), rather than whatever versions we may have in nixpkgs.
We want a set of curated Python applications that are tested working, and that users can install and run without necessarily having any knowledge that they’re Python programs.
We also want to make it easy for developers and admins to create Python environments using Nix for arbitrary projects.
This should take a minimum of ongoing manual work to maintain.

I’ve read the rest of the thread, and done some thinking based on my own fairly extensive usage of Nix for Python stuff, as well as what I’ve seen others doing. Here’s what I would like to suggest as a path forward:

Generate baseline expressions for every version of every Python package, including dependency version constraint information in passthru.
Pick the set of Python-using applications we have in nixpkgs. Work out the closure of all their dependencies, within the version constraints they specify. Which will include multiple versions of some dependencies. Which is fine.

This then forms the list of “core” packages that actually get testing. Maybe take a leaf from one of the suggestions in RFC 0046 and use an explicit supportLevel attribute on Python packages.
Implement a function that builds a Python environment for a specified list of packages, performing version-constraint-respecting dependency resolution. This is then used to compose the environments for applications in the “core” Python list, and also used to provide python.withPackages.

Of course, it is possible to ask for a set of Python packages that has unsatisfiable dependencies. Same as with pip. Ideally if you’re in such a situation and can’t use two separate Python environments, you should patch one of the projects’ dependency version requirements, and upstream that. But @costrouc’s tool offers another option.
Implement an override mechanism that allows an override function to be applied to a range of package versions; overrides like adding non-Python dependencies are likely to be consistent for many/all versions of a given package, so this simplifies maintaining multiple versions of the same package.
Use the override mechanism from #4 to get all the applications from #2 to work, using their preferred dependency versions.
Accept drive-by PRs that fix random libraries etc.; they only need stringent testing if they affect “core” packages.

#1, #2, and #4 should reduce the amount of manual maintenance required through a combination of automation and restricting the scope of tested packages.

#5 nicely satisfies the core nixpkgs goal of “provide a set of working, tested applications”. Additionally, they’ll even be using the dependency versions non-Nix users are likely on, which should make it easier when filing bugs to upstream.

#3 nicely satisfies the additional goal of “make it easy to build Python dev/app environments with Nix”. Of course, there’s no guarantee that libraries will work, but hopefully we get people contributing to #6 to make that less of an issue :).

Ultimately, I think that attempting to stick to “single known good” versions of libraries apps is a futile endeavour. Minimising the effort required to maintain “many known good” versions of at least libraries seems more fruitful.

milahu · April 5, 2022, 8:59am

“that cant be efficient” says the common sense. some numbers:

see also six million sha256 hashes for 300K packages - a use case · Issue #2969 · dolthub/dolt · GitHub

i have analyzed some download statistics of pypi
and (as expected) the download-counts follow an exponential curve

y axis: downloads per month, x axis: 150K most popular releases
= 2.5% of all releases (6M)

so this means: only few releases are popular,
and (let’s say) 90% of pypi is unused junk data

for example
some packages are nightly packages with one release per day
many packages have zero downloads (spam)

pypi has “only” 400K packages …
NPM has 2.3M packages, see https://libraries.io/
i have no exact number for NPM releases
but assuming an average of 15 releases per package
that would be 35M releases.
alone the sha256 hashes (stored as binary) would take 1120MByte …

so … as expected, scraping the whole pypi database is a waste of disk space