Status of lang2nix approaches

In fairness, I think the perf benefit is indirect. I fear being able to imperatively nix-build within derivations will lead people to write bad code, and my thing will lead them to write good code. I think that bad code will be less performant. (If the nix-build was always a tail-call it might be fine, but that’s not going to happen.)

What we need from Maven/Crate/NPM/… snapshots is not the preservation of tarballs (they are normally found on backup locations: archive.apache.org, web.archive.org, tarballs.nixos.org, …) but rather the directory structure without new versions released after a certain timestamp (plus metadata .xml-files without those new versions).
It is not about a petabyte archive requiring corporate sponsors, it is about few gigabytes which could fit in a single GitHub repo.

Sorry, my bad wording. I did not mean the benchmark performance rather than advantages (in features, conciseness, flexibility, solving the drawbacks of other approaches) over recursive nix

Note I said SWHIDs — I want everyone to content-address the data and do it in the same fews so we can reuse each other’s content addresses. That would, for example, help crate2nix avoid needing network access.

Quite to the contrary of everyone depending on a central single point of failure archive, the standardization of content addresses should allow people to “pin” their own dependency graphs in a uniform way and therefore be more self-reliant.

Also, the software heritage foundation is interested in decentralized redundancy of this sort, BTW.

2 Likes

This is easier to answer. In general, Nix derives much of it’s power from having such a static build plan. We do relatively little eval, and then we have orders of magnitude more CPU hours of actual building meticulously planned out.

The dynamism needed for land2nix is in opposition of that, and should therefore be kept as minimal as possible. But nix-build in derivations makes it oh-so-tempting to fall back on sequential/imperative building, and thus far dynamism than is needed. My drvs-that-build-drvs is designed to be very powerful/efficient but also not so easy, in order to incentivize people to try to get as much static plan out of as little dynamic computation as possible.

2 Likes

Yes, SWHIDs look really cool and compliment Nix model well, thank you for pointing.

Although the task of daily/weekly crawling Maven/Cargo/NPM/, making snapshots with SWHID for a whole repo and a fake webserver for the immutable Internet (which would be allowed to access from builders of non fixed-output derivation) is still here.

But that would allow to get rid of many lang2nix cases, as the “recording phase” is already done.

UPD: According to https://www.tweag.io/blog/2020-06-18-software-heritage/, SWH unpacks tarballs and stores individual files so tarball’s hashes are lost. I wonder how it works with .jars which are essentially .zip-archives (also, Maven has some .tar.gz too). The hashes are better be kept, Maven does use .sha1-files next to .jars and .poms. SWH might need to develop a new data schema (different from code source archiving) for preserving language repositories. Anyway, this direction looks promising.

1 Like

Agreed. However, the main issue are vendoring changes. E.g. relatively recently there was a regression in cargo vendor that changed permissions when unpacking sources. Many derivations that were updated during that window had wrong cargoSha256s once the regression was fixed. I recently checked the cargoSha256/cargoHash of all buildRustPackage derivations and more than 300 had incorrect hashes. We have had to fix cargoSha256s several times over the last couple of years.

But nixpkgs is pretty much only packaging binary crates, which have dependencies locked through Cargo.lock. So, for that use case Cargo and Nix both specify concrete dependencies.

To me, the primary issues are:

  1. Cargo.toml/Cargo.lock do not provide enough metadata to make derivations for all dependencies at eval time (e.g. we do not know what dependencies are activated through features). We need to download and unpack all crates and read their Cargo.tomls, e.g. through cargo metadata, to get the necessary metadata. This means that either the derivations should be generated through a separate step (e.g. running crate2nix) or by using IFD.
  2. Generating derivations for a substantial part of crates.io would probably add a lot of eval time to nixpkgs. This could have been acceptable with lazy eval, but as Rust trickles through the FLOSS ecosystem, it would add a lot of transitive dependencies (e.g. building GHC now depends on Rust/Cargo/various crates). Now such dependencies are cheap, because in the Rust build hooks we defer all dependency handling to Cargo. Though I guess flakes will make this more feasible, because evaluations can be cached.
4 Likes

I maintain two projects that integrate with the language package manager as a plugin, which is a slight variation on B or H, looks like. The idea is that this automates keeping the generated file up-to-date as the developer makes changes.

Downside is that it may not be as useful from a Nixpkgs perspective, because that wasn’t really a goal. But imagining if, it’d probably require upstream actually use this Nix-specific plugin, and then I’m also not yet sure how Nixpkgs would consume the output.

1 Like

I missed buildBazelPackage, whose fetchAttrs.sha256 is hash of directory of vendored deps, similar to cargoSha256.

It is not a lang2nix, but it suffers from the same problems: the hash is drifting too as there is something changing on the servers it downloads from (for example, python3.pkgs.tensorflow_2's fetchAttrs.sha256 in nixpkgs's master is not valid:

hash mismatch in fixed-output derivation '/nix/store/wmnf16gin2pcqgawjk26rggprnxfbdb8-tensorflow-gpu-2.4.2-deps'
  wanted: sha256:10m6qj3kchgxfgb6qh59vc51knm9r9pkng8bf90h00dnggvv8234
  got:    sha256:1xjmfp743vmr6f36d15dlmkgiin89g31f68hhfzgk3sm1xpk1mj2

BTW, snapshoting of PyPi and Conda on 12hr basis is already here:

The H variant - self-modification of Nix-code from builtins.exec in case of outdated lock-file - can be extended to managing sha256 of fetch*-functions, removing this burden from people too:

If the url passed to fetch* function exists in <nixpkgs/fetch.lock> - it will be used, otherwise - calculated and added.

Also, fetch.lock would be the single source of truth for everything that nixpkgs downloads, for offline installation (it is of demand: Offline build "source closure", Using NixOS in an isolated environment, …)

Deleting of fetch.lock would initiate mass-test for dead and changed links.

A related case: NixOS’s system.requiredKernelConfig which is currenty broken (it does not enable or check the kernel options)
Enabling a kernel config option affects other options, a process which has something to do with Maven resolving.
Implementing system.requiredKernelConfig puts us in front of the choice:

  1. always recompile the kernel adding the requested options when a relevant NixOS config setting (zram.enable or swapDevices or adding "amdgpu-pro" to services.xserver.videoDrivers, …) is changed. Even if the requested option is already enabled on default kernel implicitly.
  2. use IFD to read the final kernel options
  3. memorize somewhere the result of resolving process (requested options → final options), similar to (requested artifacts → final artifacts); if the lang2nix-problem can be solved in a general way, its solution can also serve system.requiredKernelConfig
2 Likes

BTW, if we are heading to adopt SWHIDs, we should retire fetchFromGitHub (and its tarball-downloading friends for other git-hostings) in favor of fetchgit.

Tarballs from https://githib.com/$owner/$repo/archive/$revision.tar.gz miss .gitignore files, and they can be patched by git-archive

Example https://github.com/cryfs/cryfs/blob/3f66c7ceda4f934a78a6a83d0d735f911aaaecf8/src/gitversion/_version.py#L21-L27 - the files in tarball and repository are different:

index 207dc691b..1c24d6422 100644
--- a/nix/store/k8vzbf7isybzyqh48fhsbf5hnn8rzdlc-fetchgit-rcf3023406969b14610df03a043fca8a078c9c195-2019-06-08/src/gitversion/_version.py
+++ b/nix/store/6m0ain6g8pwrp63676ymd62xgxw92n95-source/src/gitversion/_version.py
@@ -23,8 +23,8 @@ def get_keywords():
     # setup.py/versioneer.py will grep for the variable names, so they must
     # each be defined on a line of their own. _version.py will just call
     # get_keywords().
-    git_refnames = "$Format:%d$"
-    git_full = "$Format:%H$"
+    git_refnames = " (tag: 0.10.2)"
+    git_full = "cf3023406969b14610df03a043fca8a078c9c195"
     keywords = {"refnames": git_refnames, "full": git_full}
     return keywords

if we fall back downloading sources from softwareheritage.org, where these macros are not extended, the build will fail. So buildPhase should not rely on the sources being pre-processed with “git-archive”.

Also:

  1. those $Format macros could even have current time Git keyword expansion — Git Memo v1.1 documentation making tarball content volatile. Yes, not even the tarball itself could vary with tar/gzip upgrade/command line switches on git-hosting’s cloud, but the files inside tarball could be changed in the next instance of the tarball, with the same commit-id and tree-id.
  2. It seems that choices which files not to include in the tarball and whether to preprocess them depends on an ad hoc decision on git-hosting, and could change. Moving to stable ids like SWHIDs means we could not download tarballs and rely on preprocessors on git-hostings
1 Like

I’m really excited about this idea.

Below is a small demo. The demo is not about lang2nix (first, to avoid language-specific objections - “rust has lock files”, “there is a sbt plugin for that”, …, and second, I have not yet implemented it for language frameworks, and by the time it happens, I will probably get rid of bash and I will have trouble showing a working demo example on that Nix we can all read).

An example about shaderc which has vendored dependencies published in a separate branch https://github.com/google/shaderc/tree/known-good shortly before or after the release and maintained in nixpkgs manually: https://github.com/NixOS/nixpkgs/blob/67c4132368dd7612d5226a99ec8a2e3c1af68b76/pkgs/development/compilers/shaderc/default.nix

The setting very similar to lang2nix, isn’t it?

{ lib, stdenv, fetchgit, cmake, python3, pkgsCurrent }:

let
  version = "2021.2";
  #         -^^^^^^- to upgrade, just change this
  #                  (and even that can be automated)

  src = fetchgit {
    url      = "https://github.com/google/shaderc";
    rev      = "v${version}";
    memoFile = ./.memo.nix; # memoFile is optional; there is a global default
    #
    # Look, ma, no `sha256`.
    #
    # There is a magic inside `fetchgit` which is explained below on example
    # of `known-good`.
    #
    # `fetchgit` is a bit more complex, there are 2 memoization steps:
    #   1. `rev`     -> `fullRev`
    #   2. `fullRev` -> (`sha256`, `commitTime`, `narsize`)
    #
    # Shortly, if `sha256` is in `import ./.memo.nix`, it is just used, without
    # any IFD. Otherwise, we pause here, run `git` in sandbox and mutate
    # `./.memo.nix`
    #
    # It lacks parallelism of @Ericson2314's .drv.drv, but has the advantage
    # that the memo files are local, and can (should) be placed under version
    # control, similar to the ubiquitous .lock files.
    #
    # Actually, `./.memo.nix`'s attrset is maintained in memory and
    # flushed to disk once on Nix's exit, so this is just another
    # obstacle to parallelism.
    #
  };

  # Tolerate "known-good" branch updated within a day after the release.
  # `builtins.timeToString` and `builtins.timeFromString` are guests
  # from the future.  Nothing magical here: just pure functions which
  # could be implemened in pure Nix
  commitTime-nextday =
    builtins.timeToString (builtins.timeFromString src.commitTime + 86400);
  # But look, ma, there is not only auto-maintained `src.sha256`,
  # but also `src.commitTime`.
  # and `src.fullRev`
  # and could be auto-maintained `src.swhid`, `src.ipfs`, `src.magnet`, ....

  known-good =
    builtins.head (lib.memoize {
      # `memoFile` is optional. the global default is usually a good choice
      memoFile   = ./.memo.nix;
      # `memoFile` is a Nix file with attrs set inside.
      # Here we define the keys of that attrset we are interested in
      # There could be more than 1 key (e.g. `fetchurl`'ing from multiple urls)
      memoKeys   = [ "version=${version} fullRev=${src.fullRev}" ];
      # Either Nix function (on top of functions like `builtins.fetchGit`) or
      # the code to run in sandbox when `memoFile` has no requested `memoKeys`
      # (`lib.memoize` also has `mode` which could be "all" or "any"  to tell
      # if we need values for all the `memoKeys` or for any one) or `memoKeys`
      # are obsolete (there is also `memoRevision`  to tell if it is desirable
      # to try to calculate the value again; useful for `pkgs.geoip-database`)
      calcValues =
        # `pkgsCurrent` is defined next to `pkgsi686Linux`
        # overriding `system=builtins.currentSystem`
        # this is usualy `x86_64-linux` even if we build for/on something else.
        pkgsCurrent.stdenvNoCC.mkDerivation {
          # it should be actually not an IFD-derivation,
          # but `builtins.sandboxedExec`, which is not yet implemented.
          # Creation of derivation in Nix Store is needless and
          # reuse the existing results from Nix Store is undesirable
          name = "known-good-${toString builtins.currentTime}.nix";
          # The derivation is not FOD, so let's allow networking explicitly
          # `__allowNetworking` - another guest from the future -
          # works only in IFD-derivations.  again: it is actually
          # `builtins.sandboxedExec` simmulated via an IFD-derivation
          __allowNetworking = true;
          GIT_SSL_CAINFO = "${pkgsCurrent.cacert}/etc/ssl/certs/ca-bundle.crt";
          buildInputs = [ pkgsCurrent.gitMinimal ];
          # Get the newest https://github.com/google/shaderc/tree/known-good
          # but not newer than (`src.commitTime`+1day) and then
          # store `known_good.json`'s content to `./.memo.nix`'s attrset
          # under key `memoKeys`
          buildCommand = ''
            git init
            git remote add origin ${lib.escapeShellArg src.url}
            git fetch origin known-good
            git checkout $(git rev-list -n1 --before=${commitTime-nextday} \
                           origin/known-good)

            # emit a list of the same size as `memoKeys`
            # each value corresponds to a key
            # (with memoMode="any", it is possible to return `null` for some)
            echo "[ { json = '''$(cat known_good.json)'''; # FIX: proper escape
                    } ]"  > $out
          '';
        };
    });

# and the rest is trivial...

in stdenv.mkDerivation rec {
  pname = "shaderc";
  inherit version src;

  outputs = [ "out" "lib" "bin" "dev" "static" ];

  patchPhase =
  let
    # parse JSON of
    # https://github.com/google/shaderc/blob/ee00a6bc9388acbc332b1ef2290ff6481b78b2cf/known_good.json
    p             = lib.listToAttrs (
                      map (args: lib.nameValuePair args.name args)
                          (builtins.fromJSON known-good.json).commits
                    );
    glslang       = fetchgit {
                      memoFile = ./.memo.nix;
                      url = "https://github.com/${p.glslang      .subrepo}";
                      rev = p.glslang      .commit;
                    };
    spirv-tools   = fetchgit {
                      memoFile = ./.memo.nix;
                      url = "https://github.com/${p.spirv-tools  .subrepo}";
                      rev = p.spirv-tools  .commit;
                    };
    spirv-headers = fetchgit {
                      memoFile = ./.memo.nix;
                      url = "https://github.com/${p.spirv-headers.subrepo}";
                      rev = p.spirv-headers.commit;
                    };
  in ''
    mkdir -p ${p.glslang      .subdir}
    mkdir -p ${p.spirv-tools  .subdir}
    mkdir -p ${p.spirv-headers.subdir}
    #
    # `fetchgit` by default produces tarballs, so `tar xf` instead of `cp`
    #
    tar xf ${glslang      } --strip-components=1 -C ${p.glslang      .subdir}
    tar xf ${spirv-tools  } --strip-components=1 -C ${p.spirv-tools  .subdir}
    tar xf ${spirv-headers} --strip-components=1 -C ${p.spirv-headers.subdir}
  '';

  nativeBuildInputs = [ cmake python3 ];

  postInstall = ''
    moveToOutput "lib/*.a" ${placeholder "static"}
  '';

  cmakeFlags = [ "-DSHADERC_SKIP_TESTS=ON" ];
}
4 Likes

[RFC 109] Allow "import from derivation" in Nixpkgs, simply-stupidly, and safely by Ericson2314 · Pull Request #109 · NixOS/rfcs · GitHub I hope can stimulate the development and adoption of lang2nix work.

2 Likes

During Summer of Nix, @DavHau started work on dream2nix, which is a framework for wrapping up the various lang2nix tools in an easy-to-use and easy-to-implement manner: GitHub - DavHau/dream2nix: A generic framework for 2nix tools

This is still in the early phases, but I am hopeful this will simplify the lang2nix ecosystem and make it much easier for new tools to be created. Along with the RFCs from @Ericson2314 (like [RFC 0092] Computed derivations by Ericson2314 · Pull Request #92 · NixOS/rfcs · GitHub and [RFC 109] Allow "import from derivation" in Nixpkgs, simply-stupidly, and safely by Ericson2314 · Pull Request #109 · NixOS/rfcs · GitHub), this could be a nice improvement for Nixpkgs and the wider Nix ecosystem.

5 Likes

The badness of IFD is not (only) the incompatibility with CI.

IFD is basically an eval-time computation cached in Nix Store. Thus, it can be GC’ed at any moment forcing to re-evaluate on the next eval: to generate fresh Nix code to import which no one reviews, no one controls its re-evaluation cycles, and there is no way to undo to the old code.

@voltth I think the proposal covers that: allImportedDerivations is manual rooting of all imported derivations, so nothing need be GC’d, and everything can and should be reviewed.

and there is no way to undo to the old code

I don’t get this? The imported derivations ought to be determinstic as we always strive for.

Don’t you plan to allow networking for IFD computations ?

Only fixed output ones – just like normal.