Status of lang2nix approaches

volth · August 10, 2021, 10:25am

I tried to summarize all the approaches to package applications which during their normal build process interact with online repos (Maven, NPM, Cargo, …) to compute the list of dependencies.

To package them with Nix many utilities have been created (yarn2nix, crate2nix, … referred as lang2nix hereinnafter) which generate either Nix code or a bundle of dependencies using the source trees of those apps as input.

It seems that every of the known approaches receives good portion of critics and we still have no simple and convenient idea on how a deterministic builder like Nix should work with mutable online repositories.

(The text below supposes to be an initial version of a wiki page, but i’d like to receive some feedback; from the authors of the proposals and from people who may use other approaches in their private forks)

A. All deps are downloaded into a huge bundle, whose output is secured with additional hash cargoSha256, mavenSha256, … (widely used)

[-] inconvenient (and easy to forget) to update hash, especially when src points to a local dir or fetchGitLocal or peeks HEAD of a development branch
[-] the bundles are typically big (100s megabytes)
[-] difficult to replace some of the downloaded deps with locally built ones (important for jars with executables (or .so) inside which do not run on NixOS)
[-] hashes drift over time, because Maven/Cargo/etc are mutable
[-] @edolstra had more arguments against “abusing fixed-output derivation” (Restrict fixed-output derivations · Issue #2270 · NixOS/nix · GitHub)

B. Keeping lang2nix-generated .nix-file in <nixpkgs> next to the derivation code (widely used)

[+] huge bundle is spitted to smaller artifacts, which can be shared between different projects and substituted with NixOS-aware versions
[-] inconvenient (and easy to forget) to manually run lang2nix, especially when src points to a local directory or fetchGitLocal or peeks HEAD of a development branch

C. Import from derivation, … (in use by HerculesCI and seen in some people’s public repositories)

stdenv.mkDerivation {
  src = ...
  buildInputs = import 
    (stdenv.mkDerivation {
      inherit src;
      # it produces nix file with content line `[ (fetchurl ...) (fetchgit ...) ]`
      buildCommand = "lang2nix $src > $out";
    });
}

[+] runs lang2nix automatically when src (or lang2nix itself) is changed
[-] the inner derivation (with lang2nix) must have network access, at least to Maven/Cargo/etc (or be fixed-output, but then it is the case A), but allowing network access results in non-determinism because Maven/Cargo/etc are mutable and we can get different result on the next run.
[-] Import from derivation has some problem with distributed build (AFAIK, solved in HerculesCI)

D. recursive nix (nix-build in nix-build)

stdenv.mkDerivation {
  src = ...
  preBuild = ''
    lang2nix $src > deps.nix
    DEPS=$(nix-build deps.nix)
    ... try do adopt $DEPS
  '';
}

[+] replaces IFD with a more distributed build-friendly approach
[-] still, lang2nix must have network access, at least to Maven/Cargo/etc

E. @Ericson2314’s variant (nix-instantiate in nix-build [RFC 0092] Computed derivations by Ericson2314 · Pull Request #92 · NixOS/rfcs · GitHub)

Will it perform better than D ?
At first glance, no, but I might miss something, I did not grok CA yet.

F. C, D and E would be more pure if we maintain (or convince upstream to do) immutable snapshots of online packages repositories and allow network access only to particular immutable snapshots from within non-fixed-output derivations.

There is already

snapshoting metadata of Python repositories on GitHub every 12 hours

H. Simply trying to minimize the drawbacks of the above methods, we can get something like lang2nix-generated file next to derivation code, but generated automatically (have not seen yet, just an idea)

# pseudo-code, not tested
let
  src = ...
  depNix = stdenv.mkDerivation {
    inherit src;
    # it produces nix file with content line `[ (fetchurl ...) (fetchgit ...) ]`
    buildCommand = "lang2nix $src > $out";
  }

  depFile =
    let
      # or <nixpkgs/.cache/> + "deps-${depNix.drvHash}.nix"
      localFile = ./. + "deps-${depNix.drvHash}.nix";
    in 
      if builtins.fileExists localFile then
        # if generated file exists and matches `src`, it is exactly `B`
        # there is no IFD nor `builtins.exec`
        localFile
      else
        # if the file does not exist (or `src` changed)
        # `lang2nix` will be run once at eval time under the eval user
        # and added to `git`, just like the manual run of `lang2nix` in `B`
        builtins.exec ''
          src=${depNix.src} out=${toString localFile} ${depNix.buildCommand}

          git add ${toString localFile}
          echo '"${toString localFile}"'
        '';
in
  stdenv.mkDerivation {
    inherit src;
    buildInputs = import depFile;
  }

andir · August 10, 2021, 10:40am

You are missing out on the tools that consume lockfiles and don’t have to produce any nix files (with or without IFD/recursive nix). Some of such tools are poetry2nix and npmlock2nix. Bot read the language native lockfiles within (pure) eval and produce derivations without requiring nix specific changes (for most cases).

volth · August 10, 2021, 10:49am

yes, it worth its own category.
So far the source directory is local, it is cute, but builtins.readFile a file from $src (for example fetchFromGitHubed) is IFD too, with all its problems back.

jtojnar · August 10, 2021, 11:21am

See also https://www.nmattia.com/posts/2019-11-12-language-support-overview-nixcon.html

matklad · August 11, 2021, 9:48am

(drive by comments of someone relatively familiar with Cargo, and relatively unfamiliar with Nix)

hashes drift over time, because Maven/Cargo/etc are mutable

Cargo (crates.io) packages are immutable – once a particular version of a package is published, it is frozen and can’t be changed/deleted. Only the overall set of packages changes over time (when folks publish new versions of packages)

One thing that feels conspicuously missing from the description is dependency resolution. My understanding is that nixpkgs and Cargo are fundamentally different in this respect. In Nix, every package specifies concrete dependencies. In Cargo, a package specifies constraints on the packages (version ranges), and it’s up to Cargo to run a version resolution algorithm and to select concrete packages given the current state of the registry (the state of all packages published so far).

Naively and ignoring prior art (which is easy for me – I don’t know a thing about prior art!) I would expect Rust packaging for Nix work as a two-phase process.

In the first phase, nix just gets a subset of crates.io registry, to create an alternative registry for use for software packaged with nix. This subset will generally include only the latest versions of libraries.

In the second phase, actual Rust applications (ripgrep, hyperfine, etc) are packaged, using this nix-specific registry to resolve dependencies to specific versions.

Cargo’s side of this functionality is realized in local registry sources and alternative registries.

volth · August 11, 2021, 12:18pm

Yes, I mainly mean this as mutability, not the changes in released binaries: packages refer to version range and releasing new version of a dependency affects the result of lang2nix, making it unstable. Also, at least in Maven, old versions can disappear from the repository, forcing users to upgrade.

andir · August 11, 2021, 12:23pm

It depends on in which context you think about these. For nixpkgs we could do the same that we do for some poetry2nix packages and have a copy of the lockfile in the repo. In general you are obviously right but I see those solutions as the best in class for packages that are not within nixpkgs for all the benefits of not having to care much about them. It is the easiest to integrate with “native” workflows and doesn’t require “a nix person” on the team to constantly look after it.

volth · August 11, 2021, 12:38pm

What to do with langs which do not use lock-files (we will have to invent own lock format for Maven/Ant/Gradle/SBT/…) ?
With multi-language project ?
With projects which use sort of lang2nix to generate .nix-file with list of dependencies, but have no lang at all (such as LibreOffice, Firefox-bin, TeX, Chromium-git, …) ?

bobvanderlinden · August 11, 2021, 12:42pm

I was dabbling with the idea of introducing a recording/playback HTTP(S) proxy.

When creating a lock file, the proxy is set to record. All HTTP calls that a package manager does to the outside world will go through the proxy. The proxy will proxy the request and record the URL and hash the response body. The result is a lock file with URLs that Nix can resolve and check.

Next, within Nix, the HTTP proxy is used in playback mode. The lock file URLs and hashes can be resolved by Nix using fetchurl. The HTTP proxy will use the URLs and the results of fetchurl to mimic the traffic that happened during recording.

This could potentially support many different package managers; without having a massive build result with a single hash.

I have been dabbling with the idea, but have not yet committed to build it yet. Would this be a viable way to go?

volth · August 11, 2021, 12:51pm

I’d say, this is implementation details of lang2nix: currently they capture $HOME/.cache or $HOME/.m2, the same can be done with proxy.

The problems of

“who will run the recording phase of lang2nix (the user manually | some script under the user account before the build | nix-build making fixed-output | nix-build in relaxed sandbox allowing network | …)” and
“where the lang2nix’s result to be stored/cached (in nixpkgs under git | in nix store | …)”

are still there.

andir · August 11, 2021, 1:06pm

Those will need code generation no doubt about that. Perhaps the maven ecosystem should invest into reproducible builds where also the dependencies are properly recorded (including a source hash). We can’t force other ecosystems and thus we will always have some level of code generation. I am just saying that if you have the choice (for local/private/non-nixpkgs) packages the preference should be towards an approach that doesn’t require code generation. Multi-language projects can still be built with Nix even if those tools are used. You will then need some sort of Nix code plubming. I don’t see a way around them. For those multi-language projects large FODs could be used but that requires the language/build tool to provide a separation of build and fetch phases.

Ericson2314 · August 11, 2021, 4:18pm

This is one of the reason’s I am so keen on pushing https://www.softwareheritage.org/ 's SWHIDs as the universal standard. If everyone uses the same content addresses it will remove a bunch of friction.

Ericson2314 · August 11, 2021, 4:22pm

In fairness, I think the perf benefit is indirect. I fear being able to imperatively nix-build within derivations will lead people to write bad code, and my thing will lead them to write good code. I think that bad code will be less performant. (If the nix-build was always a tail-call it might be fine, but that’s not going to happen.)

volth · August 12, 2021, 5:50am

What we need from Maven/Crate/NPM/… snapshots is not the preservation of tarballs (they are normally found on backup locations: archive.apache.org, web.archive.org, tarballs.nixos.org, …) but rather the directory structure without new versions released after a certain timestamp (plus metadata .xml-files without those new versions).
It is not about a petabyte archive requiring corporate sponsors, it is about few gigabytes which could fit in a single GitHub repo.

volth · August 12, 2021, 5:52am

Sorry, my bad wording. I did not mean the benchmark performance rather than advantages (in features, conciseness, flexibility, solving the drawbacks of other approaches) over recursive nix

Ericson2314 · August 12, 2021, 6:10am

Note I said SWHIDs — I want everyone to content-address the data and do it in the same fews so we can reuse each other’s content addresses. That would, for example, help crate2nix avoid needing network access.

Quite to the contrary of everyone depending on a central single point of failure archive, the standardization of content addresses should allow people to “pin” their own dependency graphs in a uniform way and therefore be more self-reliant.

Also, the software heritage foundation is interested in decentralized redundancy of this sort, BTW.

Ericson2314 · August 12, 2021, 6:24am

This is easier to answer. In general, Nix derives much of it’s power from having such a static build plan. We do relatively little eval, and then we have orders of magnitude more CPU hours of actual building meticulously planned out.

The dynamism needed for land2nix is in opposition of that, and should therefore be kept as minimal as possible. But nix-build in derivations makes it oh-so-tempting to fall back on sequential/imperative building, and thus far dynamism than is needed. My drvs-that-build-drvs is designed to be very powerful/efficient but also not so easy, in order to incentivize people to try to get as much static plan out of as little dynamic computation as possible.

volth · August 13, 2021, 6:44am

Yes, SWHIDs look really cool and compliment Nix model well, thank you for pointing.

Although the task of daily/weekly crawling Maven/Cargo/NPM/, making snapshots with SWHID for a whole repo and a fake webserver for the immutable Internet (which would be allowed to access from builders of non fixed-output derivation) is still here.

But that would allow to get rid of many lang2nix cases, as the “recording phase” is already done.

UPD: According to https://www.tweag.io/blog/2020-06-18-software-heritage/, SWH unpacks tarballs and stores individual files so tarball’s hashes are lost. I wonder how it works with .jars which are essentially .zip-archives (also, Maven has some .tar.gz too). The hashes are better be kept, Maven does use .sha1-files next to .jars and .poms. SWH might need to develop a new data schema (different from code source archiving) for preserving language repositories. Anyway, this direction looks promising.

danieldk · August 14, 2021, 8:35am

Agreed. However, the main issue are vendoring changes. E.g. relatively recently there was a regression in cargo vendor that changed permissions when unpacking sources. Many derivations that were updated during that window had wrong cargoSha256s once the regression was fixed. I recently checked the cargoSha256/cargoHash of all buildRustPackage derivations and more than 300 had incorrect hashes. We have had to fix cargoSha256s several times over the last couple of years.

But nixpkgs is pretty much only packaging binary crates, which have dependencies locked through Cargo.lock. So, for that use case Cargo and Nix both specify concrete dependencies.

To me, the primary issues are:

Cargo.toml/Cargo.lock do not provide enough metadata to make derivations for all dependencies at eval time (e.g. we do not know what dependencies are activated through features). We need to download and unpack all crates and read their Cargo.tomls, e.g. through cargo metadata, to get the necessary metadata. This means that either the derivations should be generated through a separate step (e.g. running crate2nix) or by using IFD.
Generating derivations for a substantial part of crates.io would probably add a lot of eval time to nixpkgs. This could have been acceptable with lazy eval, but as Rust trickles through the FLOSS ecosystem, it would add a lot of transitive dependencies (e.g. building GHC now depends on Rust/Cargo/various crates). Now such dependencies are cheap, because in the Rust build hooks we defer all dependency handling to Cargo. Though I guess flakes will make this more feasible, because evaluations can be cached.

stephank · August 17, 2021, 11:45am

I maintain two projects that integrate with the language package manager as a plugin, which is a slight variation on B or H, looks like. The idea is that this automates keeping the generated file up-to-date as the developer makes changes.

Downside is that it may not be as useful from a Nixpkgs perspective, because that wasn’t really a goal. But imagining if, it’d probably require upstream actually use this Nix-specific plugin, and then I’m also not yet sure how Nixpkgs would consume the output.