New RFC draft: Standardize updater scripts (successor of RFC 109)

perchun · October 13, 2024, 10:25am

Hey, I wanted to improve updater scripts in nixpkgs, so wrote this RFC. For easier feedback, you can leave comments on Notion: Notion – The all-in-one workspace for your notes, tasks, wikis, and databases.

I am not a native speaker, so will gladly appreciate any grammar nitpicks!

Though here is a copy in Markdown:

Summary

Nixpkgs contains non-trivial amounts of generated, rather than handwritten code. There is an update script (usually in Python) for each. We want to start systematizing to make it easier to maintain and use. We could do plenty of future work to build upon this, but we stop here for now to avoid needing to change any tools (Nix, Hydra, etc.).

This RFC standardizes only updater scripts, it doesn’t touch almost any Nix code and many things previous RFC 109 tried to do. This means that this RFC tries to create a single library for all updaters, mainly to improve the maintainability of hundreds of random scripts.

Terms

the library — the package that is described by this RFC. It is named “library” because update scripts will use it as a library for implementing most of the boring logic, and only ecosystem-specific details will be implemented by the updater script.
An updater script — this is what users would run to add/update plugins in specific ecosystems.
An input — list of references to the outputs. As an example, for Vim plugins, this would be a CSV table with GitHub URLs and other data.
An output — what is generated at the end. This can be for example a plugin or an extension.
An entry — a single unit of what updater updates. For Vim plugins this would be a plugin, for Gnome extensions this would be an extension.

Motivation

Why generate packages?

Many ecosystems have central package repositories (e.g. pypi.org, npmjs.com, luarocks.org, etc) that are parsable and we can easily get everything we need to know about every existing package. This allows us to automatically generate a list of all packages for each ecosystem.

Or, if we don’t want to fill nixpkgs with unused packages, we can store a list of package names in nixpkgs and then generate a .nix code for all of them. This is what update scripts currently do: they get a list of all entries and then prefetch all required info for building the output.

This is much more desirable and maintainable than packaging every single package manually. Especially if the packaging process is almost identical for each entry.

Why create this RFC?

The current approach is to create an updater script, which receives a list of names of entries and returns (usually) a .nix file with all packages updated to the latest version. As you can imagine, having (probably) hundreds of independent random scripts is not a good solution at all. Sometimes such scripts are really dated and poorly written, they may have obvious and obscure bugs.

As an example, we could take pluginupdate.py. I created https://github.com/NixOS/nixpkgs/pull/336137, so I dug deep into the code and found a dozen bugs, a huge amount of suboptimal code, and a few major issues. This doesn’t mean the original version was poorly written, it probably was a small and straightforward script, to which some features were added one at a time for years. As usual in software, this resulted in ugly Frankenstein, which is no one’s fault. A more specific description of problems in pluginupdate.py is described in the “Prior art” section.

Some of the pros of creating a standardized ecosystem for updaters are:

Reuse of code: There are a ton of update scripts in nixpkgs currently, and almost every single one duplicates the logic of each other, but in an entirely different way.
Easier adoption for new updaters: I expect, that implementing a few methods in a class, would result in a fully working update script for an entirely new ecosystem! A more detailed explanation of implementation details is in the “Detailed design” section.
Adding one feature to one updater automatically adds it to every other updater: For example, I added a feature for updating individual plugins. If the end updater script properly implements the interface, it would also gain this feature (note: this is an example; updating individual plugins is a critical feature (see the “Goals” section) and would be shipped from day 1).

Detailed design

I am choosing Python as a language for the update library, just because it’s what I am most familiar with. Python is also really great for writing any kind of scripts. I also choose OOP and inheritance as the main focus of the library, as I think it is what suits the most the use case here (I know it is weird using super dynamic language, OOP, and inheritance in a community of 100% pure and functional language; but I believe it’s the best choice if we prioritize extensibility and developer experience).

Also, everything in the library must be asynchronous. This allows us easier parallelize network requests, and I believe this is what the script will do most of the time. threading library should be used only for loop-blocking operations (e.g. running some bash command).

Main design idea

The library creates a class, which is then inherited by an updater script. The library implements boring methods, like:

Asynchronously fetching info about entries
Adding a new entry to the input file and updating the output
Abstract implementations of dataclasses for
- EntryInfo (not prefetched entry; only data that is stored in the input file)
- Entry (result info that is written to the output)

The library also should implement some other functionality, that is common between updaters (for example prefetching scripts, running some command, running nix-env, etc).

Updater scripts implement only ecosystem-specific functions:

Fetching info about entries
Implementations for different data models (EntryInfo, Entry)
Getting all possible entries (if it is possible; e.g. GNOME Extensions updater scraps info about all existing extensions and adds them to the output file). There must be an implemented function for scrapping a list of all entries OR an input file.

Having generated code inside its repository

(NOTE: this is just an idea, needs to be discussed)

As K900 wrote in matrix, it will make nixpkgs not evaluatable offline, which is a thing people do in fact care about. Any solution?

I like this approach, as it unbloats nixpkgs in quite a complicated, but non-painful way. This way we can make updates to generated code extremely often without bloating nixpkgs with huge diffs and huge files in the git tree.

I imagine we would have a separate repository that doesn’t depend on nixpkgs, and only has JSON files. Then CI once in a while will run an update for each ecosystem.

Then in nixpkgs, we could provide data about generated packages using something like generatedPkgs.vimPlugins.plugin-name which contains:

{
  name = "plugin-name";
  version = "...";
  src = ...;
  # etc
}

So this is basically: (builtins.fromJSON (builtins.readFile "${generatedPkgs.src}/${ECOSYSTEM_NAME}/generated.json")).${PKG_NAME}. This data is used the same as that generated.json file was inside nixpkgs. This allows us to avoid bootstrapping and similar issues while also allows to get the benefits of slim nixpkgs. The generatedPkgs.src is just a fetchgit that is updated once in a while.

To clarify, all updaters and such code are stored in nixpkgs, not that separate repository. The separate repository is not fetched until a user tries to use it.

Examples and Interactions

Goals

Abstract goals are:

Create a technical-first solution, without trying to do multiple things at once.
Create a centralized library for managing updaters, with extensibility and developer experience in mind.

The goals are not:

Create a universal autogen script.

This RFC straights to create a better system for updaters, not a better system for autogenerating code. All this does, is create a better version of all update.py scripts in nixpkgs, and do not take any action on how the individual ecosystem wants to package its entries == the library doesn’t provide any code in Nix, only Python for now.
Minimalism

As we prioritize developer experience, I think it would be fine if the library had dependencies. Those probably would be: click/typer (CLI), aiohttp (async HTTP), attrs (better dataclasses, see also “Unresolved questions”)

It would also be great if the library would implement some functionality that is used by many updater scripts but is not required (e.g. fetcher for GitHub repositories).

Technical features that must be supported by the first version of the library:

Updating individual output.

This is crucial for adding a new entry, as without this, adding one entry would require updating every single entry to add generated values to the output file.
The output file must be JSON.

Mainly because of the previous feature. This allows easier loading of already existing outputs as if we generate a .nix file, it becomes quite hard to get this information.

Why get information about already existing outputs? If we add only one entry, we have to load into memory the entire list of all outputs, append there a new output, and then write the file again. We cannot easily know where we should write our new output, and using JSON for this is just easier. It also doesn’t affect performance, which is great news.
Unit tests.

Tests are really important to do at first steps, as it will be really hard to do in the future. Adding tests should be as easy as possible, so contributors won’t be lazy about them.

Technical features that would be nice to have, but are not required:

TODO: I remember I had some ideas, but I forgot them

Cool! When can I use it?

After the first few feedback loops, I will try to start implementing the described here library and will open a draft PR to nixpkgs.

Drawbacks

What are the disadvantages of doing this?

TODO: I am unsure if there are any, but be so kind as to provide your feedback.

Alternatives

What other designs have been considered? What is the impact of not doing this? For each design decision made, discuss possible alternatives and compare them to the chosen solution. The reader should be convinced that this is indeed the best possible solution for the problem at hand.

TODO: Provide your ideas!

Prior art

`pluginsupdate.py`

This is a file used for sharing some updater logic between the Vim plugins updater, Luarocks updater, and Kakoune plugins updater. Because of bad design, it is extremely hard to add new features and there are a ton of bugs. As I can see from the code structure, it was designed as a simple 100-line script: load CSV table, fetch last commit and hash, write it to generated.nix. But as time passed, the number of Vim plugins in nixpkgs got to thousands and the script got extremely slow.

Thus, someone improved it, added paralelization, and some other features probably, and as a result, we now have Frankenstein. To avoid the same thing, the library is designed with extensibility in mind and learns from mistakes made in pluginsupdate.py.

Here is a list of main major issues with pluginsupdate.py (some things are fixed in https://github.com/NixOS/nixpkgs/pull/336137, some are impossible to fix without a complete rewrite):

Unflexible structure: there is a single class that has a few methods, but the main work is done in one method. pluginsupdate.py only implements parallelization of fetching and fetchers for Git & GitHub, you cannot properly modify some steps of execution without rewriting a lot of code in the core logic.
Independent Plugin and PluginDesc: the second class contains info that we got from the input file, the first contains already fetched information. You cannot get PluginDesc from Plugin and vice versa without making network calls or assuming some values are the same (there may be fetched and provided different names).
Calls to nix-prefetch-... are repeated if you want to resolve multiple different attributes. This makes adding a new feature for automatic resolving of nvimRequireTest painfully hard. I would need to rewrite completely the cache system or create a second cache system specifically for those prefetch calls. Even if you already ran nix-prefetch-... and it has the result in the /nix/store, the command takes ~1s to complete.
Sorting is completely broken. This is because empty values in CSV are returned as an empty string, not None. So it tries to sort by an alias, which is usually an empty string (it is set only for a small number of plugins, where using the repository name would cause a conflict).
Etc etc etc…

See also https://github.com/NixOS/nixpkgs/pull/336137.

Lessons from RFC 109

This RFC is based on rejected RFC 109, which proposed a very similar idea but lacked specifics for technical implementation and tried to do many things at once. I created https://github.com/NixOS/nixpkgs/pull/336137, so I hope this could count as a v0 for this RFC. To improve upon #109, I approach things from a technical perspective first, and only then think about other problems (e.g. I started to design the updater library even before thinking about all the problems described in RFC 109 — this RFC tries to create a conservative and working-for-now solution, and don’t focus on things bigger, like using from2nix tools inside nixpkgs).

RFC 109 also wanted to split update scripts into two parts: one is impure and fetches the absolute minimum required for updating the outputs and the other one is a pure derivation that then fetches all the rest information, so the generated output file is reproducible. This is unreasonable from my perspective, as fetching the absolute minimum usually includes fetching almost the same amount of information, that we would fetch in the second part. Update scripts are impure by design because they fetch info that cannot be fetched reproducibly (e.g. latest commit or version). Otherwise, we would not have update scripts at all.

Unresolved questions

How to name the library?
Do dataclasses work fine with inheritance? I don’t think so. Use attrs?

Future work

Analyze all update scripts in nixpkgs, and create a list of inputs and outputs.

We should support different use cases, starting from simple Git clones of GitHub repos to complicated Python builds.

7c6f434c · October 13, 2024, 1:46pm

For more careful scoping — this shoud cover specifically updater scripts in Python, right? Because for some ecosystems the internal tools provide useful functionality helpful for the updaters, so unifying with a Python codebase is a loss.

perchun · October 13, 2024, 1:55pm

I don’t want anyone to force switching, if someone wants to keep things as they are - I don’t see any problem. Instead, switching to the library, should give many benefits, so it would make sense to switch.

Also, updater scripts can call bash underneath, but they must produce JSON as the result output file (life is just so much easier that way). So during prefetching, they can just call their internal tool (this is actually what is done for GitHub repositories too, script just calls nix-prefetch-github).

Atemu · October 13, 2024, 4:02pm

Honestly, I don’t see the need for an RFC on this.

Using standardised, centralised updater infra written in a sensible language should be extremely uncontroversial. People don’t create bespoke scripts because they like to, they do it because they need something that looks like an updater and quickly hacked something together that works well enough.

The only thing I see that would require any form of consensus would be the language and frameworks it’s written in. Though, while different people will almost certainly prefer other languages to python, I don’t think anyone would really object to it either.

I think the core of the issue is that we need the code to exist and people to know that it exists and use it. Neither of those issues require an RFC.

If you showed people an updater framework where they can simply configure it in an abstract manner and have it just work without needing to think about the details, I don’t think they’d need a done-and-decided RFC in order to be convinced to throw away their garbage bash script that only works half of the time.
If someone were to really like their crufty perl script or whatever, I also wouldn’t see a need to force them to switch either.

While you can (and should!) ask for comments on the design, I don’t think it’s something that we as the greater community need to decide. Just… do it.

If you really want to write an RFC on this, all it’d need is 3 paragraphs:

Short background; why we should have standardised updater scripts (most a formality TBH)
Decision to use python for it
Rough outline on how the interface for package maintainers should look like

If anyone cares about the implementation details, they should just get involved in the implementation themselves or stay silent about it IMHO.

7c6f434c · October 13, 2024, 4:55pm

OK, I am against using Python to wrangle language-specific ecosystem-metadata if someone has already written an update script using language-native libraries to talk to the package collection. Especially if the language-native libraries used are first-party w.r.t. the package collection.

Any good enough library will see quite a lot of adoption, but 100% adoption is an example of misguided uniformisation.

perchun · October 13, 2024, 5:54pm

While you can (and should!) ask for comments on the design, I don’t think it’s something that we as the greater community need to decide. Just… do it.

Thank you for the feedback! You are right, I will probably just do a PR. I don’t even remember why I thought I would need to create an RFC for this

I am against using Python to wrangle language-specific ecosystem-metadata if someone has already written an update script using language-native libraries to talk to the package collection.

I agree, I don’t force users to use only Python. As I said, during prefetching, they can call bash or really do anything else.

Luarocks updater, for example, calls luarocks CLI for generating an entry: nixpkgs/pkgs/by-name/lu/luarocks-packages-updater/updater.py at d7a38a56893692ae64ca565e9b03bc6c5cae47cb · NixOS/nixpkgs · GitHub. The only downside of this, is that luarocks CLI is extremely slow (and works half of the time because of weird timeouts). The library would be needed only for implementing the boring parts: concurrency, updating only one entry, etc