Status of lang2nix approaches

Yes, SWHIDs look really cool and compliment Nix model well, thank you for pointing.

Although the task of daily/weekly crawling Maven/Cargo/NPM/, making snapshots with SWHID for a whole repo and a fake webserver for the immutable Internet (which would be allowed to access from builders of non fixed-output derivation) is still here.

But that would allow to get rid of many lang2nix cases, as the “recording phase” is already done.

UPD: According to Long-term reproducibility with Nix and Software Heritage, SWH unpacks tarballs and stores individual files so tarball’s hashes are lost. I wonder how it works with .jars which are essentially .zip-archives (also, Maven has some .tar.gz too). The hashes are better be kept, Maven does use .sha1-files next to .jars and .poms. SWH might need to develop a new data schema (different from code source archiving) for preserving language repositories. Anyway, this direction looks promising.

2 Likes

Agreed. However, the main issue are vendoring changes. E.g. relatively recently there was a regression in cargo vendor that changed permissions when unpacking sources. Many derivations that were updated during that window had wrong cargoSha256s once the regression was fixed. I recently checked the cargoSha256/cargoHash of all buildRustPackage derivations and more than 300 had incorrect hashes. We have had to fix cargoSha256s several times over the last couple of years.

But nixpkgs is pretty much only packaging binary crates, which have dependencies locked through Cargo.lock. So, for that use case Cargo and Nix both specify concrete dependencies.

To me, the primary issues are:

  1. Cargo.toml/Cargo.lock do not provide enough metadata to make derivations for all dependencies at eval time (e.g. we do not know what dependencies are activated through features). We need to download and unpack all crates and read their Cargo.tomls, e.g. through cargo metadata, to get the necessary metadata. This means that either the derivations should be generated through a separate step (e.g. running crate2nix) or by using IFD.
  2. Generating derivations for a substantial part of crates.io would probably add a lot of eval time to nixpkgs. This could have been acceptable with lazy eval, but as Rust trickles through the FLOSS ecosystem, it would add a lot of transitive dependencies (e.g. building GHC now depends on Rust/Cargo/various crates). Now such dependencies are cheap, because in the Rust build hooks we defer all dependency handling to Cargo. Though I guess flakes will make this more feasible, because evaluations can be cached.
4 Likes

I maintain two projects that integrate with the language package manager as a plugin, which is a slight variation on B or H, looks like. The idea is that this automates keeping the generated file up-to-date as the developer makes changes.

Downside is that it may not be as useful from a Nixpkgs perspective, because that wasn’t really a goal. But imagining if, it’d probably require upstream actually use this Nix-specific plugin, and then I’m also not yet sure how Nixpkgs would consume the output.

2 Likes

I missed buildBazelPackage, whose fetchAttrs.sha256 is hash of directory of vendored deps, similar to cargoSha256.

It is not a lang2nix, but it suffers from the same problems: the hash is drifting too as there is something changing on the servers it downloads from (for example, python3.pkgs.tensorflow_2’s fetchAttrs.sha256 in nixpkgs’s master is not valid:

hash mismatch in fixed-output derivation '/nix/store/wmnf16gin2pcqgawjk26rggprnxfbdb8-tensorflow-gpu-2.4.2-deps'
  wanted: sha256:10m6qj3kchgxfgb6qh59vc51knm9r9pkng8bf90h00dnggvv8234
  got:    sha256:1xjmfp743vmr6f36d15dlmkgiin89g31f68hhfzgk3sm1xpk1mj2

BTW, snapshoting of PyPi and Conda on 12hr basis is already here:

The H variant - self-modification of Nix-code from builtins.exec in case of outdated lock-file - can be extended to managing sha256 of fetch*-functions, removing this burden from people too:

If the url passed to fetch* function exists in <nixpkgs/fetch.lock> - it will be used, otherwise - calculated and added.

Also, fetch.lock would be the single source of truth for everything that nixpkgs downloads, for offline installation (it is of demand: Offline build "source closure", Using NixOS in an isolated environment, …)

Deleting of fetch.lock would initiate mass-test for dead and changed links.

A related case: NixOS’s system.requiredKernelConfig which is currenty broken (it does not enable or check the kernel options)
Enabling a kernel config option affects other options, a process which has something to do with Maven resolving.
Implementing system.requiredKernelConfig puts us in front of the choice:

  1. always recompile the kernel adding the requested options when a relevant NixOS config setting (zram.enable or swapDevices or adding "amdgpu-pro" to services.xserver.videoDrivers, …) is changed. Even if the requested option is already enabled on default kernel implicitly.
  2. use IFD to read the final kernel options
  3. memorize somewhere the result of resolving process (requested options → final options), similar to (requested artifacts → final artifacts); if the lang2nix-problem can be solved in a general way, its solution can also serve system.requiredKernelConfig
2 Likes

BTW, if we are heading to adopt SWHIDs, we should retire fetchFromGitHub (and its tarball-downloading friends for other git-hostings) in favor of fetchgit.

Tarballs from https://githib.com/$owner/$repo/archive/$revision.tar.gz miss .gitignore files, and they can be patched by git-archive

Example https://github.com/cryfs/cryfs/blob/3f66c7ceda4f934a78a6a83d0d735f911aaaecf8/src/gitversion/_version.py#L21-L27 - the files in tarball and repository are different:

index 207dc691b..1c24d6422 100644
--- a/nix/store/k8vzbf7isybzyqh48fhsbf5hnn8rzdlc-fetchgit-rcf3023406969b14610df03a043fca8a078c9c195-2019-06-08/src/gitversion/_version.py
+++ b/nix/store/6m0ain6g8pwrp63676ymd62xgxw92n95-source/src/gitversion/_version.py
@@ -23,8 +23,8 @@ def get_keywords():
     # setup.py/versioneer.py will grep for the variable names, so they must
     # each be defined on a line of their own. _version.py will just call
     # get_keywords().
-    git_refnames = "$Format:%d$"
-    git_full = "$Format:%H$"
+    git_refnames = " (tag: 0.10.2)"
+    git_full = "cf3023406969b14610df03a043fca8a078c9c195"
     keywords = {"refnames": git_refnames, "full": git_full}
     return keywords

if we fall back downloading sources from softwareheritage.org, where these macros are not extended, the build will fail. So buildPhase should not rely on the sources being pre-processed with “git-archive”.

Also:

  1. those $Format macros could even have current time Git keyword expansion — Git Memo v1.1 documentation making tarball content volatile. Yes, not even the tarball itself could vary with tar/gzip upgrade/command line switches on git-hosting’s cloud, but the files inside tarball could be changed in the next instance of the tarball, with the same commit-id and tree-id.
  2. It seems that choices which files not to include in the tarball and whether to preprocess them depends on an ad hoc decision on git-hosting, and could change. Moving to stable ids like SWHIDs means we could not download tarballs and rely on preprocessors on git-hostings
1 Like

I’m really excited about this idea.

Below is a small demo. The demo is not about lang2nix (first, to avoid language-specific objections - “rust has lock files”, “there is a sbt plugin for that”, …, and second, I have not yet implemented it for language frameworks, and by the time it happens, I will probably get rid of bash and I will have trouble showing a working demo example on that Nix we can all read).

An example about shaderc which has vendored dependencies published in a separate branch GitHub - google/shaderc at known-good shortly before or after the release and maintained in nixpkgs manually: https://github.com/NixOS/nixpkgs/blob/67c4132368dd7612d5226a99ec8a2e3c1af68b76/pkgs/development/compilers/shaderc/default.nix

The setting very similar to lang2nix, isn’t it?

{ lib, stdenv, fetchgit, cmake, python3, pkgsCurrent }:

let
  version = "2021.2";
  #         -^^^^^^- to upgrade, just change this
  #                  (and even that can be automated)

  src = fetchgit {
    url      = "https://github.com/google/shaderc";
    rev      = "v${version}";
    memoFile = ./.memo.nix; # memoFile is optional; there is a global default
    #
    # Look, ma, no `sha256`.
    #
    # There is a magic inside `fetchgit` which is explained below on example
    # of `known-good`.
    #
    # `fetchgit` is a bit more complex, there are 2 memoization steps:
    #   1. `rev`     -> `fullRev`
    #   2. `fullRev` -> (`sha256`, `commitTime`, `narsize`)
    #
    # Shortly, if `sha256` is in `import ./.memo.nix`, it is just used, without
    # any IFD. Otherwise, we pause here, run `git` in sandbox and mutate
    # `./.memo.nix`
    #
    # It lacks parallelism of @Ericson2314's .drv.drv, but has the advantage
    # that the memo files are local, and can (should) be placed under version
    # control, similar to the ubiquitous .lock files.
    #
    # Actually, `./.memo.nix`'s attrset is maintained in memory and
    # flushed to disk once on Nix's exit, so this is just another
    # obstacle to parallelism.
    #
  };

  # Tolerate "known-good" branch updated within a day after the release.
  # `builtins.timeToString` and `builtins.timeFromString` are guests
  # from the future.  Nothing magical here: just pure functions which
  # could be implemened in pure Nix
  commitTime-nextday =
    builtins.timeToString (builtins.timeFromString src.commitTime + 86400);
  # But look, ma, there is not only auto-maintained `src.sha256`,
  # but also `src.commitTime`.
  # and `src.fullRev`
  # and could be auto-maintained `src.swhid`, `src.ipfs`, `src.magnet`, ....

  known-good =
    builtins.head (lib.memoize {
      # `memoFile` is optional. the global default is usually a good choice
      memoFile   = ./.memo.nix;
      # `memoFile` is a Nix file with attrs set inside.
      # Here we define the keys of that attrset we are interested in
      # There could be more than 1 key (e.g. `fetchurl`'ing from multiple urls)
      memoKeys   = [ "version=${version} fullRev=${src.fullRev}" ];
      # Either Nix function (on top of functions like `builtins.fetchGit`) or
      # the code to run in sandbox when `memoFile` has no requested `memoKeys`
      # (`lib.memoize` also has `mode` which could be "all" or "any"  to tell
      # if we need values for all the `memoKeys` or for any one) or `memoKeys`
      # are obsolete (there is also `memoRevision`  to tell if it is desirable
      # to try to calculate the value again; useful for `pkgs.geoip-database`)
      calcValues =
        # `pkgsCurrent` is defined next to `pkgsi686Linux`
        # overriding `system=builtins.currentSystem`
        # this is usualy `x86_64-linux` even if we build for/on something else.
        pkgsCurrent.stdenvNoCC.mkDerivation {
          # it should be actually not an IFD-derivation,
          # but `builtins.sandboxedExec`, which is not yet implemented.
          # Creation of derivation in Nix Store is needless and
          # reuse the existing results from Nix Store is undesirable
          name = "known-good-${toString builtins.currentTime}.nix";
          # The derivation is not FOD, so let's allow networking explicitly
          # `__allowNetworking` - another guest from the future -
          # works only in IFD-derivations.  again: it is actually
          # `builtins.sandboxedExec` simmulated via an IFD-derivation
          __allowNetworking = true;
          GIT_SSL_CAINFO = "${pkgsCurrent.cacert}/etc/ssl/certs/ca-bundle.crt";
          buildInputs = [ pkgsCurrent.gitMinimal ];
          # Get the newest https://github.com/google/shaderc/tree/known-good
          # but not newer than (`src.commitTime`+1day) and then
          # store `known_good.json`'s content to `./.memo.nix`'s attrset
          # under key `memoKeys`
          buildCommand = ''
            git init
            git remote add origin ${lib.escapeShellArg src.url}
            git fetch origin known-good
            git checkout $(git rev-list -n1 --before=${commitTime-nextday} \
                           origin/known-good)

            # emit a list of the same size as `memoKeys`
            # each value corresponds to a key
            # (with memoMode="any", it is possible to return `null` for some)
            echo "[ { json = '''$(cat known_good.json)'''; # FIX: proper escape
                    } ]"  > $out
          '';
        };
    });

# and the rest is trivial...

in stdenv.mkDerivation rec {
  pname = "shaderc";
  inherit version src;

  outputs = [ "out" "lib" "bin" "dev" "static" ];

  patchPhase =
  let
    # parse JSON of
    # https://github.com/google/shaderc/blob/ee00a6bc9388acbc332b1ef2290ff6481b78b2cf/known_good.json
    p             = lib.listToAttrs (
                      map (args: lib.nameValuePair args.name args)
                          (builtins.fromJSON known-good.json).commits
                    );
    glslang       = fetchgit {
                      memoFile = ./.memo.nix;
                      url = "https://github.com/${p.glslang      .subrepo}";
                      rev = p.glslang      .commit;
                    };
    spirv-tools   = fetchgit {
                      memoFile = ./.memo.nix;
                      url = "https://github.com/${p.spirv-tools  .subrepo}";
                      rev = p.spirv-tools  .commit;
                    };
    spirv-headers = fetchgit {
                      memoFile = ./.memo.nix;
                      url = "https://github.com/${p.spirv-headers.subrepo}";
                      rev = p.spirv-headers.commit;
                    };
  in ''
    mkdir -p ${p.glslang      .subdir}
    mkdir -p ${p.spirv-tools  .subdir}
    mkdir -p ${p.spirv-headers.subdir}
    #
    # `fetchgit` by default produces tarballs, so `tar xf` instead of `cp`
    #
    tar xf ${glslang      } --strip-components=1 -C ${p.glslang      .subdir}
    tar xf ${spirv-tools  } --strip-components=1 -C ${p.spirv-tools  .subdir}
    tar xf ${spirv-headers} --strip-components=1 -C ${p.spirv-headers.subdir}
  '';

  nativeBuildInputs = [ cmake python3 ];

  postInstall = ''
    moveToOutput "lib/*.a" ${placeholder "static"}
  '';

  cmakeFlags = [ "-DSHADERC_SKIP_TESTS=ON" ];
}
5 Likes

[RFC 0109] Nixpkgs Generated Code Policy by Ericson2314 · Pull Request #109 · NixOS/rfcs · GitHub I hope can stimulate the development and adoption of lang2nix work.

2 Likes

During Summer of Nix, @DavHau started work on dream2nix, which is a framework for wrapping up the various lang2nix tools in an easy-to-use and easy-to-implement manner: GitHub - nix-community/dream2nix: Simplified nix packaging for various programming language ecosystems [maintainer=@DavHau]

This is still in the early phases, but I am hopeful this will simplify the lang2nix ecosystem and make it much easier for new tools to be created. Along with the RFCs from @Ericson2314 (like [RFC 0092] Computed derivations by Ericson2314 · Pull Request #92 · NixOS/rfcs · GitHub and [RFC 0109] Nixpkgs Generated Code Policy by Ericson2314 · Pull Request #109 · NixOS/rfcs · GitHub), this could be a nice improvement for Nixpkgs and the wider Nix ecosystem.

7 Likes

The badness of IFD is not (only) the incompatibility with CI.

IFD is basically an eval-time computation cached in Nix Store. Thus, it can be GC’ed at any moment forcing to re-evaluate on the next eval: to generate fresh Nix code to import which no one reviews, no one controls its re-evaluation cycles, and there is no way to undo to the old code.

1 Like

@voltth I think the proposal covers that: allImportedDerivations is manual rooting of all imported derivations, so nothing need be GC’d, and everything can and should be reviewed.

and there is no way to undo to the old code

I don’t get this? The imported derivations ought to be determinstic as we always strive for.

Don’t you plan to allow networking for IFD computations ?

Only fixed output ones – just like normal.

I’ve been following this thread, and I really want to get involved to help with this. I’m primarily a Rust and Typescript programmer. When I’m developing an application, I use a shell.nix to just bring the normal language toolchains into scope, but then eventually I have to face the lang2nix problem in order to make a NixOS-acceptable installer. The extreme difficulty of this right now is a major pet peeve of mine, and one of the reasons I don’t actually try to encourage others to use NixOS.

So, I’m not sure how I can contribute, but if it’s just as simple as wrangling discussions into a coherent piece of documentation or applying experimental tools to projects I already have on hand, I’ll help.

9 Likes

bump

Let’s get this project moving again.

Can anyone describe to me some next steps that I could take right now? Would it be helpful if I provided an essay describing the use cases I have for this?

Let’s say I want to stabilize Rust development with a) being able to identify exact compiler versions (i.e., oxalica) and b) being able to build everything from a Cargo.lock file? Is there work I can do to helping that along, and has anyone thought about how to do the same thing with package-lock.json and other lock file formats?

How about for multi-lingual projects that may multiple distinct build steps?

3 Likes

IMHO, the silent consensus is Nix and Nixpkgs are too much a Yenga built by adding tons of features trying not to break 15 years old compatibility so adding even a small feature is a big challenge (just look at poetry2nix and mach-nix: the majority of their code is not the business logic but hooks and overlays to adapt to existing codebase - and that glue locks both the lang2nix projects and nixpkgs even harder).
So, there is no easy answer to how to move forward. Maybe to build a Nix-Lite first as the playground sandbox for experimental features?

6 Likes

I’d be interested in that. We could target certain languages as a proof-of-concept (say, Python, Go, Rust, Haskell) and then RFC it.

What would be the strategy? I don’t know what the consensus is from reading the discussion. Myself, I would be most inclined towards a strategy that reads the language’s lockfile directly into Nix and evaluates from there. But I’ve had better luck with tools like crate2nix. OTOH, I also see the comment that danielkd wrote about how Cargo.lock doesn’t actually contain all of the necessary information (I’m not actually sure that’s the case… I have a project that includes reqwest, with json, and I see serde_json included in the Cargo.lock).

This is the kind of thing that I would experiment with, but I need some idea at first of what direction we should go, and some idea that the community would be interested if I got started.

3 Likes

the majority of their code is not the business logic but hooks and overlays to adapt to existing codebase

I disagree with this statement.

Most of the poetry2nix code base is overrides which are mostly fixing the fact that Python dependencies are not aware of native dependencies. This is about half of the code base.

The rest of the code is split between business logic domains (rough estimates):

  • About 10% is API surface
  • Python environment markers, wheel parsing, external fetchers etc accounts for about 20-25%
  • Hooks are another 5% or so
  • Another few percent are shared across CI, small random utilities and so on.

As you can see that’s nowhere near “the majority”.

2 Likes