(Future) of npm packages in nixpkgs?

Every once in a while someone tries to add packages with generated node-packages.nix,
which adds on average 10000 Lines of code to nixpkgs.
This is obviously not a great solution but also adding those depdencies to our nodePackages package set becomes unbearable because it got very big and slow - regenerating takes almost 1/2 hour even with a fast gigabyte connection.
There has been other alternatives proposed:

  • npmlock2nix: read the packages-json.lock → this does not make the amount of code any smaller, the package-json.lock is at least as big as the code generated in node2nix.
  • fetchNodeModules:
    • Advantages: only needs a single checksum, small footprint
    • Disadvantages: like buildGoModule or buildRustPackage it may break if npm decides to generate different output, the sharing is not that great between different packages → This addressed be solved by using recursive nix
  • nodePackages with node2nix:
    • Advantages: node packages gets shared between different packages
    • Disadvantages: Does not scale, packages can not be updated in isolation, big amount of code changes are impossible to review
  • Ban all node packages from nixpkgs and ask authors to create third-party repos
7 Likes

Every once in a while someone tries to add packages with generated node-packages.nix,
which adds on average 10000 Lines of code to nixpkgs.
This is obviously not a great solution but also adding those depdencies to our nodePackages package set becomes unbearable because it got very big and slow - regenerating takes almost 1/2 hour even with a fast gigabyte connection.

If your concerns are only about the combined set of node-packages: We
could stop doing that entirely as that even makes reviewing changes
feasible again. Whenever someone proposes to add something to the
“global” list of node packages it can’t realistically be reviewed as the
diff is just too big to judge or even to be displayed on GitHub.

If we are concerned about repository size none of these approaches are
really great as node packages traditionally have tons of dependencies.

Having one node-packages.nix per package in nixpkgs is not great for
keeping eval performance up but it would probably not incur such big
amounts of repository growth. Keep in mind that each modification of any
file will always add another copy of that file to the git object
storage. Updating one large file is therefore likely less efficient in
terms of keeping the repo size down then a set of seldom updated packages.

There has been other alternatives proposed:

  • npmlock2nix: read the packages-json.lock → this does not make the amount of code any smaller, the package-json.lock is at least as big as the code generated in node2nix.

I am biased here (I started the project): While I believe it is the best
approach of packaging anything with just the native package manager it
currently fails in many situations as not every project has started
using lockfiles yet. Asking users to create a lockfile just to package a
program is not a great user experience and even worse: a manual step.

It would work well for all other cases where there is a lockfile as it
doesn’t require code generation and for the most part just passes the
lockfile through to npm.

  • fetchNodeModules:
    • Advantages: only needs a single checksum, small footprint
    • Disadvantages: like buildGoModule or buildRustPackage it may break if npm decides to generate different output, the sharing is not that great between different packages → This addressed be solved by using recursive nix

It might seems like an attractive option as there isn’t much to be done
by the package maintainer except for providing a few hashes. The risk
here is that the fixed outputs will yet again depend on the platform (or
system) that the initial fixed output has was determined on. One of the
situations where it will likely fail is npm install across different
architectures / operating systems (think x86_64-linux vs
aarch64-linux vs aarch64-darwin…). Node packages are famous for
downloading “random” binaries as part of the package installation step.
Those dependencies will not be discovered unless we restrict FODs to
specific platforms. Introducing multiple output hashes (one for
platform) will kill maintainability. I doubt many people have access to
both x86_64 darwin and linux machines and are willing to do that dance
whenever a package has to be bumped.

We’ve had these issues with the go packaging as well as the rust
packaging and they are wasting a lot of time when things go wrong. It
isn’t exactly straight forward to debug these - especially if hydra
already cached the hash which wouldn’t be reproduced on your platform.

  • nodePackages with node2nix:
    • Advantages: node packages gets shared between different packages
    • Disadvantages: Does not scale, packages can not be updated in isolation, big amount of code changes are impossible to review

We can have one node packages set per package as outlined above. It doesn’t
sound great but I believe it is not worse than this approach.

  • Ban all node packages from nixpkgs and ask authors to create third-party repos

If we accept subpar packaging and/or repository size is our primary
objective (it shouldn’t be) then that is the right way as many of these
packages have a questionable cost/benefit relationship compared to many
of our core packages.

1 Like

In the long term I think what we need in the Nixpkgs repo is essentially a store with versions and hashes. You define for a certain language what applications you want, and it adds to the store all the versions with hashes it needs for those applications. This way, applications sharing the same dependencies will use less lines of code, they can be updated individually, and there is a potential to force versions for multiple applications, e.g. when patch updates are available. Having said this, I have no idea how realistic this is with node.

Some related issues:

2 Likes

Not super relevant here but the issue with go is only case normalization in filesystems and its output is rather stable and standardized compared to the things that nodejs does.

Agreed, I have experienced that as well

This is also a concern I have: both as a contributor and as a reviewer I can’t realistically test or ‘properly’ review such a diff. Also, the likelihood of conflicts is approaching ‘1’…

I still like the idea of ‘sharing what can be shared’, though: building some node modules can be heavy and being able to fetch them from the cache is nice.

I think my ‘dream approach’ would be to keep the ‘shared’ nodeModules set, but have a node2nix feature where you add or update one node package and its dependencies, leaving the rest of the set untouched.

  • Disadvantages:
    • This would make the set bigger, since when an update to dependency ‘X’ is released, not all modules will update to that newer version of ‘X’ at the same time
    • This would increase the number of PR’s
  • Advantages:
    • Compared to ‘current’ node2nix: each individual PR is smaller in scope, so less likely to break unrelated things, easier to test, easier to review, and (somewhat) less likely to conflict
    • Compared to other approaches: the modules can be shared and cached

Now I’m well-aware that “talk is cheap”, and adding such a feature would be a major effort - just wanted to throw it out there :wink: . There has been some discussion around this in Partial regenerations · Issue #192 · svanderburg/node2nix · GitHub

1 Like

The objects are compressed so the actual difference is not that huge but git performance with very large files is not that great.

fails in many situations as not every project has started
using lockfiles yet.

or wants to. What do we do when upstream does not want to maintain one?

Node packages are famous for
downloading “random” binaries as part of the package installation step.

puppeteer, something needs to be done about that. We can’t have more chromiums in the store.

This is also a concern I have: both as a contributor and as a reviewer I can’t realistically test or ‘properly’ review such a diff. Also, the likelihood of conflicts is approaching ‘1’…

I personally take a basic look over it but if something breaks like netlify-cli a while ago I don’t really care about that. There are also cases where hashes for other things change which are not done automatically.

I wonder if a file per node package would be the better long-term approach. Have a default.nix that uses self = fold {} (a: b: a // b) (map (_: file: import file { inherit self; /* FIXME add more args or callPackages */ }) (builtins.attrValues (builtins.readDir ./.))) or something similar to consume the entire directory of files that contain one version of each dependency that we require. This has the advantage that adding a single node dependency only introduces additional files to the repo and those are simpler to audit on PRs. It also means that they are unlikely to ever change (-> no additional overhead when upgrading individual packages).

The downside is that we will quickly figure out how fast Nix can import files from a directory listing and eval might be hit even harder…

1 Like

My (basically outsider) impression is that the buildGoModule and buildRustPackage functions revolutionized packaging for those ecosystems. I went from spending hours learning about Go paths and directories to package something to being able to package something in 5 minutes with buildGoModule. I’m strongly in favor of fetchNodeModules from a packager ease-of-use perspective.

If the npm output format changes, I expect we will be able to fix it in a somewhat automated way.

@andir, is the issue you mentioned with fetchNodeModules and multiple architectures not a problem with the other options?

3 Likes

Doesn’t fetchNodeModules method keep the dependencies out of the nixpkgs git repo?

Hello,

I’ve noticed that there’s a new discussion about NPM packages in Nixpkgs. I’d like to also give my point of view about it and provide you some background information. Maybe this helps to decide in which direction we want to go.

Deploying NPM packages with Nix is a complicated problem – NPM is both a build and dependency manager, and obviously the latter aspect conflicts with Nix.

Bypassing NPM’s dependency management features is not very trivial – it has odd behaviour, certain dependency specifiers are more “loose” than Nix’s dependency addressing mechanism (with hashes), we must bypass the local cache (that in turn may cause downloads to be triggered), and we must deal with undeclared native dependencies. Furthermore, because NPM is also a build manager, we must still make sure that it can run the build scripts.

Furthermore, every major NPM release may introduce new features that could break the integration approach.

How the current NPM packaging approach came to be was mostly just an accident :slight_smile: . Long time Nixpkgs project members may probably already know that before we started using node2nix, there was npm2nix – for a while it worked well enough. Sometimes we ran into breakages when new major versions of NPM were released, but for a while problems seemed to be easily fixable.

At some point, packages with circular dependencies were introduced (I don’t think this is a feature intentionally supported by the developers of NPM, but they were there all of a sudden!). npm2nix couldn’t cope with these kinds of dependencies, and I got stuck with the same problem as well. Because Shea Levy (the npm2nix author) no longer had any time to work on npm2nix, I’ve decided to investigate.

After lots of experiments, I realized that there was a fundamental problem with npm2nix: just symlinking dependencies into the node_modules/ sub folder of each dependency no longer sufficed to cope with circular dependencies. Instead, we must memorize earlier deployed dependencies and copy them. When a dependency has been encountered previously, then it should be skipped. This basically required a fundamental rewrite of the entire tool.

In the beginning, my implementation was somewhat controversial. The first attempt relied on building derivations in the evaluation phase. Later I did a major revision, that computed the entire dependency before the evaluation stage, solving this problem.

Still, my implementation was rejected by some people in the community (mostly because it looked too complicated, which was IMO “necessary evil” to cope with NPM’s peculiar behaviour).

Some of my findings were also integrated into npm2nix, making it possible to still deploy some packages with circular dependencies. Because this solution worked “well enough” for most people, npm2nix was kept and I’ve decided to not push my implementation forward anymore. Nonetheless, I kept using it for my own projects and decided to it to call it node2nix. For quite some time (more than a year) npm2nix and node2nix co-existed.

Then roughly a year later, a new controversial feature was introduced in a new major release of NPM called dependency de-duplication. Basically, dependencies of a package are no longer deployed in the node_modules sub folder of each dependency, but “moved up” in the dependency tree until a conflict has been encountered.

This major NPM change (again!) broke npm2nix – now all of a sudden npm2nix could no longer be used to deploy any package. As a result, the entire NPM package set in Nixpkgs was completely broken.

For a while, Nixpkgs appeared to be unfixable, until I gave node2nix a try. It seemed to work fine and I basically proposed to use it for Nixpkgs instead of npm2nix. This basically explains how node2nix was introduced into Nixpkgs, and why it was named node2nix, and not npm2nix. Basically, node2nix came about because there was not a better alternative :slight_smile:

Although node2nix mostly works for a large category of packages, its design also had be overhauled several times. When package-lock.json files and offline installations were introduced, I had to do another major rewrite to bypass the local cache.

Sadly, NPM is still subject to evolution and certain changes in newer NPM versions have introduced new kinds of problems to node2nix. Furthermore, there are also new kinds of use cases for which node2nix was initially not “designed”.

From a functional perspective, we have the following NPM deployment use cases with Nix:

  • Deploying end-user packages (e.g. from the NPM registry)
  • Deploying local development projects
  • Deploying remote development projects (this a typical Nix use case, NPM does not have an equivalent)
  • Deploying NPM dependencies in a non-NPM project

node2nix was only designed for the first two use cases. The latter two can only be done by creatively using certain integrating of the generated code, which isn’t trivial at all.

Furthermore, there are other drawbacks as well:

  • For deploying end-user packages, you always have to regenerate the Nix expression for the generated package set as a whole. The advantage is that common dependencies are reused in the generated Nix expression (so that the amount of code churn is smaller), but the drawback is that regeneration is very time consuming, and typically does not map well to the Nixpkgs development workflow (in which each commit refers to a single, or well identified group of packages)
  • The node-env.nix has evolved into a very complicated beast. This is mostly caused by coping with lock files and bypassing the local cache – as a result, in its current state it is very hard to rewrite it in such a way that we can use it as a deployer for NPM dependencies in non NPM projects
  • Another result of a very complicated node-env.nix, is that it runs out of memory when the nesting of dependencies is too deep.
  • We also cannot easily implement support for deploying remote development projects.
  • The NPM dependency resolution algorithm is wrong. In newer versions, NPM makes a distinction between the origins of the packages. For example, async 0.2.2 from the NPM registry is considered a “different” dependency than async 0.2.2 deployed as a source tarball from a HTTPS URL. In older implementations, they were considered the same. Supporting these differences, requires a substantial rewrite of the dependency resolution algorithm in node2nix.
  • node2nix does not use any newer features of Nix. For example, newer versions of Nix also supports SRI hashes, but node2nix doesn’t use them.
  • some newer NPM features are not yet supported, e.g. workspaces
  • the build process isn’t tweakable and its overriding capabilities are somewhat limited. Again, the fact that node-env.nix is a mess is a major impediment.

As you may probably observe, to implement these new use cases and address some of the fundamental flaws of node-env.nix, it is not possible to just implement a “quick fix”. Instead, a fundamental revision/rewrite is required for the base layer: node-env.nix.

I’m already working on a new base layer for a while (that I only have as a private PoC), but sadly it’s progressing very very slowly. This can be attributed to the fact that I can only do the development in my spare time.

Basically, the idea behind the new base layer is that instead of a Nix expression (node-env.nix), you can run a tool that “tricks” NPM in such a way that all dependencies are present by providing the Nix store paths to them. It already works fine locally, but the remainder of the integrations are not done yet, and deploying from local sub directories is still unsupported.
A companion tool can take care of obtaining the dependencies (either via a lock file, or by another companion tool that performs a dependency resolution on an ordinary package.json file)

However, what I also realized is that with node2nix I have always been aiming for accuracy, and maybe this isn’t what everyone wants/needs. For Nixpkgs’ use cases, we may also be able to live with a less accurate, and more hackable approach.

From what I know from talking to people in the Nixpkgs community, I know that an incremental package deployment approach would be desirable.

I have also been thinking about a completely different integration approach:

  • Instead of making/maintaining our own implementation of NPM’s dependency resolution algorithm (which node2nix uses to deploy end-user packages), we can generate a package.json file with the package we intend to deploy as its only dependency. By running ‘npm install’ in an isolated environment, we can generate a package-lock.json file that contains the resolved dependency tree. Then the resulting package-lock.json file can be consumed by a Nix expression to deploy the dependencies. (For Git dependencies, we must still somehow compute the output hash, because these aren’t known by NPM).
  • We don’t run ‘npm install’ in the derivation anymore. If everything works out properly, the dependencies should already be there. However, the build steps (if any) must still be performed by other means
  • To support incremental deployments: we generate a data file JSON format that can be easily read (e.g. with builtins.fromJSON) and updated, rather than generating a Nix expressions. The disadvantage of using Nix expressions for incremental updates is that they are difficult to read, and need to be evaluated.
  • Alternatively, if we can live with the large amount of code churn, we can also save the package-lock.json files for each package that we want to deploy in the Nixpkgs repository). This prevents us from having to implement some kind of de-duplication method. Furthermore, this also maps better to our Nixpkgs development workflow in which each commit refers to a package.
  • We must make sure that the derivation can be easily “tweaked” with hooks and/or overrides, to correct potential deployment discrepancies

The above solution obviously has the disadvantage that it doesn’t deploy as quickly as node2nix (but this can alleviated with incremental updates), does not support older NPM versions, and does not support build management facilities out of the box (but this can still be fixed with overrides).

Maybe a different solution that is fully lock driven, incremental, and tweakable would be a better for Nixpkgs.

Anyway, regardless of the approach that we think is best, some major changes are required to make NPM deployments more future proof.

What do you think?

P.S. If you want even more context, I have written a number of blog posts about this subject. They have a ‘node2nix’ tag, and can be found here: Sander van der Burg's blog: node2nix. These will give you even more details on what I have done in the past.

15 Likes

(I created the fetchNodeModules PR)

Yes, that’s precisely the reason why scripts are disabled by default with --no-scripts in the PR.

2 Likes

i’m going to be doing some bit and pieces with node soon… so any improvements to this ecosystem is much appreciated.

whats this about node downloading random binaries … i thought it was javascript and c/c++ stuff?..

Some packages think it’s a good idea to download compiled (usually C/C++, but not always) libraries and binaries at build time so that their packages can link against them/use them in their build steps.

Instead of asking users to install them before using the package, it’s common practice to run a curl as part of the script hooks of your npm build. Usually these are custom builds hosted somewhere, so it’s not even trivial to check versions. Some are patched downstream.

I think this is a side effect of how there are always far too many dependencies, that probably means people never read installation instructions of dependencies deep in the tree, so packages that are more reasonable see little use.

You can work around this somewhat in node2nix, thankfully.

1 Like

Yeah there’s been a couple packages I’ve wanted to add but when I went to regenerate the list it took forever so I just gave up :upside_down_face:

2 Likes

NPM mixes package download, build and install + side effects from “package.json”.scripts.*`. For caching to work properly, state and side effects generated between dependencies install has to be eliminated/snapshotted. Reproducibility patches will have to manually done.

I propose we forget NPM and start manually packaging JS packages applying the patches for reproducibility. Objections?

Edit: I plan doing some prototyping to see if I’m into some dumbshit.

1 Like

I propose we forget NPM and start manually packaging JS packages applying the patches for reproducibility. Objections?

This sounds absolutely insane and is already a big enough problem for Python, an ecosystem with far smaller dependency graphs.

I would suggest to adopt something like GitHub - nix-community/npmlock2nix instead and let leaf packages deal with the package graphs.

1 Like

Not all packages are used. Popular dependencies are fewer. But I agree I should start this as an independent experiment. It needs to be prototyped. I’d love to see the JS packages being built in parallel using Nix.

Thanks for input @adisbladis .

Just in the interest of discussion, I made a yarn fetcher that handles both yarn.lock and package-lock.json.

it’s probably not 100% ready, but if we can get something to help with the js transition sooner (waiting for npmlock2nix) then I think it’s all the better.
Note that I haven’t had the time to test on different platforms. So in essence I don’t know if the sha256 is going to be dependent on the platform.

1 Like

:wave: Hello everyone,

My name is Oleg, and I am relatively new to Nix. Allow me to add my 2 cents :slight_smile:

I think we can rely on the popular Node.js package manages here and get all the checksums out from package-lock.json or yarn.lock(or like) files rather than download the whole world to infer checksums, and provide a mechanism to specify missing checksums if any (for some edge-cases yarn doesn’t provide checksums, like direct github-hosted dependency) so we don’t infer anything behind user’s back.

Hence, we don’t make any assumptions about if a package has been changed without a version bump (but I am not sure if this is possible with the current npm registry policy). Anyway, the majority of packages goes with checksums so a bit of extra manual work for the newly introduced Node.js based package should not be a problem but the approach can solve this issue of slowness here.

If we still want to have Node.js based packages within nixpkgs in a way that they can be reused/shared between the other Node.js based packages and granularly cached, I think we still have to have such packages’ declaration (a lock file) somewhere. The same works for any other Node.js project, not just for nixpkgs. However, a reasonable question could be - if we really need all the packages we have, i.e. have them publicly cached? The same question should be asked when adding a non-Node.js package as well? That is, should we introduce a policy about how we introduce the new Node.js package into the nixpkgs?

These are both great projects and I have been playing around with them. The node2nix project is pretty useful when you need a binary that is provided within a Node.js project, that works quite well for particular cases. However, I see a fundamental problem with this approach - they do not provide packages as individual Nix packages so they would be available for reuse, tweak, override, ie no granularity provided, they do share/cache the tarballs but this is just a part of the solution. If we would have a mechanism that turns a Node.js project dependency tree into a Nix dependency tree that would be highly beneficial. This approach is about what @sander is saying.

There is a project that aims to implement this approach - GitHub - Profpatsch/yarn2nix: Build and deploy node packages with nix from yarn.lock files. Have you had a chance to check that?