Strong opinion: Library packaging

nyarly · May 30, 2018, 7:20pm

This turned out to be pretty long. Please read it from a kind of pugnacious socratic perspective. I have strong opinions about how language library distribution works and ought to work, but I’ll admit I don’t have my arms around the whole problem (and am skeptical, that anyone’s arms are long enough by themselves, honestly). With that forward, “O Socrates…”

I just saw the suggestion that there be a rubyPackages set akin to the haskellPackages set. ([meta] The perfect ruby development platforms · Issue #13945 · NixOS/nixpkgs · GitHub) and it touches on a line of reasoning I’ve been trying to find a time and place to discuss.

My experience so far with nixpkgs is that the library packaging strategies follow one of two basic approaches. First is the one that is used for Haskell and Python: maintain a kind of “sub-distro” of libraries, and applications refer to members of e.g. haskellPackages to get their dependencies. The other approach is endemic to Ruby and Node: application dependencies are transformed into a local expression and added in an ad-hoc fashion to the repo, typically as a support expression (e.g. gemset.nix). I’ve been thinking of these approaches a “distro-scoped” and “app-scoped” dependencies.

As I understand it, the motivation for distro-scoping dependencies is that is reduces the sheer number of expressions in nixpkgs, the lines of code be executed in running systems (since installed applications share their dependencies), and security updates are easier to make if there’s a single expression for a given package.

From my point of view, one of the chief virtues of Nix is that dependencies are scoped to the execution context. The code executed in a single process can be determined completely by the dependencies required by the executable. It’s a system wide application of the technical advance that was e.g. from gem install to bundle exec. (The language specificity of that example is ironic, I think: we’re all coming at this issue from the perspective of the languages we’re used to, and it’s surprised me more than once when I’ve had to tell the story of Rubygems and the glacial migration to Bundler.)

(There’s a whole other paean to Nix that is: why are we reinventing library packaging/development environments/etc for each language, when right here is the universal adapter, but I digress. (I mean, one of my first encounters with Nix was trying to address the problem of Erlang packaging…))

The result of choosing one approach over the other is to pick a position on a continuum: on the app-scoped end, it’s possible that we have a unique expression for each dependency for each consuming application. On the distro-scoped end, it’s possible that we squash every version of a dependency down to a single ideal version, and consumers use that.

To try to find a middle ground on that continuum, I spent a little time considering a kind of version satisfaction tool, where every Ruby program in nixpkgs would be considered a “root”, and we’d try to minimize the number of gem expressions that still satisfied all the gemspecs and Gemfiles of the applications. It was an entertaining half-hour, generalizing version SATs to multiple roots.

But the result of such a system would be Ruby programs that ship in Nix with dependency versions different than they were ever tested with. And in truth, modern Ruby programs tend to ship with Gemfile.lock which exactly specify their dependencies (as opposed to providing ranges - the bundle tool satisfies the ranges specified in a Gemfile and the transitive dependencies in the resolved gemspecs.

To apply distro-scoping to Ruby, we’d need to either ignore their lock files (which is a recipe for wasted packaging time) or respect them, use bundix to produce gemset.nix files, and then union those files into a single ruby-modules.nix. I start to lose the thread, though, of why we’d do that.

Most crucially, if Nix packages a different dependency set, we can’t make real guarantees about the behavior of the software. We’re abrogating the prerogative of the application author to specify those dependencies, which I think is reasonable for them to demand in exchange for supporting the software.

For languages where there exist a distribution repository (e.g. rubygems.org, pypi, cabal (I think? I’m a Haskell neophyte)), it seems like the nixpkgs $lang-module.nix becomes a labor-intensive mirror of part of index of those repositories. The “partial mirror” can’t be avoided, but the labor-intensive part certainly can, by following the model of tools like bundix or npm2nix. (For the record, these are opinions I was forming well before (jira-cli) init at 2.2 by nyarly · Pull Request #30833 · NixOS/nixpkgs · GitHub but that didn’t help :))

In short, on the one hand I do believe that there should be a single approach in Nixpkgs to packaging libraries and applications written in languages with their own package/distro/repo ecosystem. I just happen to believe that it should look more like Ruby (and Rust) (and maybe Node? But, like, with so many provisos…) than like haskell-packages.nix.

samueldr · May 31, 2018, 3:11am

~~(Do tell if it isn’t relevant)~~

One thing I was just thinking about about two hours ago is how it’s not possible to use a ruby gem without going the whole bundler way with nix. I mean, AFAIUI it’s possible to write a python script, use the proper -p pythonPackage.lib in a nix-shell and start hacking, right? I would have loved to -p ruby_2_5.packages.nokogiri or something else.

Could it be possible to get a system that works both ways? Where nixos handles bundler dependencies with specific versions for software (using their generated gemset.nix file), but for each gem, provide the recent revisions in a packages set? This would (I hope) ease the relatively low burden of spinning up a bundler environment for writing a quick script.

Maybe provide a selection of gems (and dependencies) in there? Mirroring the whole repo (even in the other cases) may not be the best solution?

But in the end, I have the strong opinion that gemset.nix is the right way to go, using dependencies as the author defined them is the way to go for packaged end-user software, and libraries too.

manveru · May 31, 2018, 4:01pm

Initially I just wanted to quickly write down some of my thoughts, but it turned into a pseudo-blogpost, I’ll probably clean this up and publish it later as a series, just don’t have time to expand on everything properly atm.

After thinking more about it, I think a big disconnect we have is in the way people see their dependencies.

Languages like C and C++ projects lack a unified package manager for their language, this has traditionally always been handled by Distributions, given that they themselves consist largely of applications written in C, and C++ piggy-backed along, being used mainly for the UI parts and larger applications.

That means that for people using C/C++ it’s natural to expect their distribution to have all libraries they might want to use, but I don’t think this extends far beyond that community.

I’m generally biased towards the approach of having a dependency list for each application in nixpkgs, and I don’t think having it as a general library package repository (other than maybe for C/C++ lacking alternatives) is very scalable in terms of human effort and repo size.

While it’s super convenient to be able to run a nix-shell with some version of the library you’d like to use without having to create any nix file for it, I don’t think this benefit is what the majority of users value about nix, I might be wrong about this of course, but since none of the languages I use have libraries in nixpkgs I can’t say I’ve missed it much.

One solution to that is having fully automated repositories outside of nixpkgs, maintained by their respective communities. Then instead you’d write e.g. nix-shell '<ruby-2.5>' -p nokogiri-1.8.2 (which in turn depends on <nixpkgs>.libxml2 etc) and the only difference is that you’d have to subscribe to a specific ruby channel or pass the full URL to that channel. I don’t think that’s an undue burden for such a feature and might actually boost our ecosystem by having development more distributed and promoted outside of nixpkgs.

Since I started using Nix, I’ve written many half-assed nix wrappers for a bunch of languages, and in all cases I favor using the dependency resolution of the package manager used by the application authors, simply because sometimes it’s very domain specific and hard to get right, plus nix doesn’t provide any dependency discovery and resolution mechanism other than the nix-prefetch-* tools, which isn’t exactly performant or ergonomic.

A third way

I think we’re missing a third option here. A tool that generates the needed derivation on the fly and loads it into a nix-shell or creates a default.nix with it on demand based on the existing language-specific lockfile.

Initially this would be slow, given that there are no caches right now, but I don’t think it’d be prohibitive given enough feedback to the user about progress.

What would be needed for that is an intermediary layer for each 3rd party package manager that does the dependency resolution and provides a solution to us.

Here’s where things get tricky, because in many languages you don’t have access to this solution easily, you either let the package manger do its thing with full network and disk access, or it dies.

Now there’s a few things that could possibly make life easier:

The Lockfile

What I would propose is a common lockfile format for all languages, which ideally would be written in .nix format so we can use it on ofborg. From what I’ve seen they all address the same issues in slightly different ways, but all want the same result, a reproducible build with fixed version dependencies.

In some languages the files look something like this:

{
  foo = {
    type = "github";
    repo = "example/foo";
    commit = "fff...";
    sha256 = "fff...";
    dependencies = ["bar"];
  };
  bar = {
    type = "rubygems";
    version = "1.0";
    sha256 = "fff...";
  };
}

In Haskell they look like this:

  "Dust-crypto" = callPackage
    ({ ... }:
     mkDerivation {
       pname = "Dust-crypto";
       version = "0.1";
       sha256 = "112prydwsjd32aiv3kg8wsxwaj95p6x7jhxcf118fxgrrg202z9w";
       libraryHaskellDepends = [
         base binary bytestring cereal containers crypto-api cryptohash
         directory entropy ghc-prim network random random-extras random-fu
         random-source skein split threefish
       ];
       librarySystemDepends = [ openssl ];
       testHaskellDepends = [
         base bytestring cereal Dust ghc-prim HUnit QuickCheck
         test-framework test-framework-hunit test-framework-quickcheck2
         threefish
       ];
       description = "Cryptographic operations";
       license = "GPL";
       hydraPlatforms = stdenv.lib.platforms.none;
     }) {inherit (pkgs) openssl;};

As you might notice, that’s not a lockfile, that’s already a full derivation and not just the source, but also all its dependencies, both for the haskell and for all system dependencies must be in the same scope.

I would argue that this, while quite impressive, isn’t something people want to look at, they want a simple list of dependencies, not the dependencies of their dependencies and so on.

What’s the “issue” here is that each derivation depends on other library derivations directly, so they need to be actually derivations, and that requires this kind of complexity in the “lockfile”. While your actual buildInputs would then simply use this as Dust-crypto and you’re done with it, which is quite nice.

We approached that in Ruby using a global default configuration called defaultGemConfiguration which lives in parallel to the gemset.nix and captures most of the things people expect by default to happen so a gem is usable (but can still be modified when passing it to bundlerEnv).

Since we talked about nokogiri already, here’s the entry for it:

  nokogiri = attrs: {
    buildFlags = [
      "--use-system-libraries"
      "--with-zlib-dir=${zlib.dev}"
      "--with-xml2-lib=${libxml2.out}/lib"
      "--with-xml2-include=${libxml2.dev}/include/libxml2"
      "--with-xslt-lib=${libxslt.out}/lib"
      "--with-xslt-include=${libxslt.dev}/include"
      "--with-exslt-lib=${libxslt.out}/lib"
      "--with-exslt-include=${libxslt.dev}/include"
    ] ++ lib.optional stdenv.isDarwin "--with-iconv-dir=${libiconv}";
};

Now, the thing is, Nokogiri probably won’t switch to a different way of configuring anytime soon, I know it hasn’t changed in the past decade or so, and I expect that to stay the same unless they rewrite it from scratch. And most popular libraries are like that, they have one configuration that works, and little is needed outside of that.

For the other library dependencies of the gem, we simply lookup the string keys in their dependencies list, which is a bit slower, but means that dependency specification and their derivation can live apart from each other, and the derivation can be generated dynamically.

I’m not sure there’s a good super-general solution for all languages for the lockfile, but some convergence would be nice to be able to build better tools that are user-friendly and flexible.

I’m also not aware of any other language that has something like the LTS system in Haskell. Many languages rely more and more on decentralized dependency management, where you fetch them directly from their source, so a comprehensive list of them ranges from hard to impossible. And it still doesn’t cover things that are only privately available.

Distributed language libraries

So, the most pragmatic solution comes down to generating a lock for each application, and having it distributed alongside the derivation. I’ll go through some of the languages.

Javascript

Number of libraries: ~700.000

That brings us to good old Javascript, where applications tend to have thousands of dependencies, and that hinders their adoption into nixpkgs. So in theory, if we’d have a single source of npm packages, we could simply add the application itself, say what it depends on, and be done.

However, the large number of possible libraries plus their frequent updates means we’ll have a lot of churn if they get added to nixpkgs. I’m also not aware of an efficient way to get a list of all the packages of npm, and don’t think they provide a DB dump like rubygems does.

We’ve got 3 major projects for dealing with JS applications, node2nix, yarn2nix, and yarn2nix (yeah, god knows how that happened). They each use different approaches, are compatible with different codebases, and are configured differently.

I think there’s a lot of room to improve here, but it doesn’t help that JS has multiple package managers (like bower, jspm, component, duo, etc…) where in some cases it’s very hard to emulate them in Nix.

There is also no shared configuration for packages, especially ones that require native dependencies or come with precompiled binaries that have to be patchelf’d for use on NixOS.

Ruby

Number of libraries: ~143,000

I’ve experimented with creating derivations for every gem from their weekly DB dumps, but didn’t have enough time or reason to continue with it, the code should still be around somewhere, and in theory it might be useful to someone.

But the average application requires a handful of those, and even the biggest rails applications I know of use maybe 2-400. That makes the overhead of fetching all package definitions quite large, and adds a lot of dead weight to nixpkgs that still has to be maintained.

We still have issues where people depend on the bundler gem in nixpkgs directly, without taking into account the version of bundler the application specifies, and that causes a lot of headaches. In hindsight I think it was a mistake exposing it directly like that.

Overall I think the bundix + bundlerEnv approach here has been a success so far, consolidation of effort definitely paid off and made the life of everyone easier.

Crystal

Number of libraries: ~3240

Not a ton of libraries here, but also not many popular applications written in it. I’ve also written a wrapper for this in order to package the Mint language written in Crystal.

I think we could offer big benefits to the Crystal community by having a tool for them that makes static compilation trivial, because right now it’s tough without Nix.

Elm

Number of libraries: ~1072

While I haven’t made a separate project for this yet, I’ve packaged a few Elm applications using their lockfile and some simple prefetching just like bundix. From there it’s simply building a directory tree that matches what their packaging system does.

Addition to nixpkgs would be entirely reasonable were it not for the low demand.

nyarly · May 31, 2018, 7:15pm

I think we’re thinking along similar lines. The idea of a lingua franca for application dependencies is a really excellent one. @samueldr’s point about Python hacking (nix-shell -p ...) is cogent as well, I think.

So, here’s what I’d like to propose:

Two general types of expression. (I don’t know if they need to be formal types or just “the set accepted by this function”). First, something like haskellPackages or python.modules - a set of sets, a la

{ 
  libraryname = {
    "13.17.1": {
      source = "github",
      sha256 = "...",
      #...,
      buildInputs = [libzip, sqlite, libxml2];
    };
    #...
  };
  otherLib = {
    "20180530" = {
      source = "langrepo",
      sha256 = "...",
    }
  #...
};

Notably, languagePackages shouldn’t care about the other-library dependencies of libraries.

Based on the experiences of Python packagers, each set of library modules should probably be a sub-tree of files, each imported into the top level packages set, as a hedge against merge collisions.

Second, a generalization of defaultGemConfig, to allow overrides of the derivations produced by the languagePackages. The nokogiri attribute from @manveru’s example is perfect.

From there, applications assemble their “lockfile” by doing something like:

stdenv.mkDerivation {
  #...
  buildInputs = [ postgres ] + 
    (langInputs language.modules [ "libraryname-13.17.1" ]);
};

The nice thing about all this is that I think it can all be generated automatically. We leave it to per-language tools to satisfy dependency files and produce lockfiles. The tools already produced and in use by the language communities should be used, and any gaps filled on our side.

In this scheme, a tool like bundix would examine Gemfile.lock, conceivably running bundle lock under the covers to generate it if needed (as it does when run as bundix -m). The result would be a list of strings (names of gems with versions, in this case) in a lockfile.nix suitable for import. In the same operation, bundix would consult the ruby.modules set and see which members of the lockfile are already provided, and produce a separate gemset.nix set of the missing items.

The idealistic result, then, would be that an in-development expression for a Ruby app would look like:

stdenv.mkDerivation {
  #...
  buildInputs = rubyInputs (ruby.modules // import ./extraset.nix) (import ./lockfile.nix);
}

and that the expression for a Haskell app would look like:

stdenv.mkDerivation {
  #...
  buildInputs = haskellInputs (haskell.modules // import ./extraset.nix) (import ./lockfile.nix);
}

I think there’s a lot of potential to something like this, including standardizing tooling and naming around it, reduced complexity for onboarding and transferabillity of skill across languages within Nix.

nyarly · May 31, 2018, 7:17pm

Because I know they have a stake in this discussion, I’d like to hear from @zimbatm and @FRidh. I’m sure there are other parties with useful input here, but I don’t know who they are.

bobvanderlinden · May 31, 2018, 9:32pm

A standard sounds like a great idea. Usually we just want to create a single directory that includes many packages, but do not want to include all binaries. It would help to have a single format to convert .lock and package definitions to so that we can reuse most of our tooling.

I just want to add that you might want to make a function of each of the packages, so that each package can have its own version of buildInputs, like so:

    "13.17.1": { libzip, sqlite, libxml2 }: {
      source = "github",
      sha256 = "...",
      #...,
      buildInputs = [libzip, sqlite, libxml2];
    };

Something like callPackage would resolve those automatically, but it should also be possible to override each package.

nyarly · May 31, 2018, 9:49pm

Bah, I missed the edit there.

The idea, from defaultGemConfig would be to do something like

{ 
  libraryname = {
    "13.17.1": {
      source = "github",
      sha256 = "...",
      #...,
    };
    #...
  };
}

and elsewhere (e.g. language.overrides):

{
  libraryname = attrs: {
    buildInputs = [libzip, sqlite, libxml2];
  };
}

The langInputs function would be responsible for fetching from the former, setting up a derivation, and then passing the derivation to the latter function to have e.g. buildInputs provided, and then merged.

The motivation here is two-fold:

First, that kind of variation from a “simple” library package tends to be true over the lifetime of the library. The example that @manveru gave of Nokogiri needing the same build configuration for more than 10 years… The attrs argument is sufficent, I think, to do one offs or “only add these buildInputs after version X” or whatever behavior.

And second, those variations are generally not something that a tool can determine automatically. So the overrides.nix file would need to be maintained by humans anyway, but at a much lower rate of change.

zimbatm · June 4, 2018, 4:03pm

Brain dump ahead:

On package sets

One observation I would like to expand on is that dependency resolution is generally dependent on the target packaging ecosystem. Npm allows to have multiple versions of the same package installed for a same program so there is almost no force that pushes the developer to find stable interfaces. C doesn’t have a package manager so it pushes developers to be more careful with API breakage. Given that, it makes it much easier to find a common version of a C library that works for all depending programs. In the middle we have Bundler, Pip and Cabal that have SAT solvers and thus each program might use a different set of dependencies.

Related to that there is also the compiled vs interpreted distinctions. When bumping a Haskell library, if nox-review passes I have a much stronger guarantee that the other programs depending on the same library will likely work at runtime. Compared to python where I gave up a couple of times packaging an application because it would be too much work making sure that all the other python programs would still work after the upgrade.

Given that I would be inclined to say that all non-C programs should have per-application dependencies generated. Haskell is probably OK but still introduces some complexity for resolving the common set when adding new programs.

Here I only talk about application distribution, which should be the role of nixpkgs. Providing development dependencies and binary cache is not something nixpkgs should do. Most of the time like @manveru noticed, the package set is much larger than any user would need. We would then be confronted by resolving broken packages and trying to figure out if the issue is with nixpkgs or the developer has released a broken version, and this in eternity if we support all versions of each package. Providing binary cache to developers is best solved by making it easy to setup per-project binary caches, where only the subset of packages that the developer uses are being uploaded.

Package sets have still a role to play, like Stackage for Haskell and Nix could be a great tool to help them build all the combinations of packages but I think this needs to happen out of tree.

Potential future for nixpkgs

Okay so imagine we don’t have package sets anymore, excepts for C libraries. Each non-C application bundles it’s own dependency set now. We also adopt the idea of common overrides that @manveru explained above to inject C dependencies. This creates some new problems:

nixpkgs checkouts become much more heavy if all application also need a lock.nix file.
the system becomes fatter because we install more dependencies for each program.
developers can’t use ghc.withPackages or python.withPackages anymore.

(potential benefit: we don’t need recurseIntoAttrs anymore and can maintain a flat namespace?)

That’s the only issues I can think of right now, is there anything else?

To mitigate (1) we now need two things: (a) tool that can transform the language-specific lock file to nix in a pure fashion, with no network access, and (b) import-from-derivation. With that it’s possible to use the upstream lock file and transform it at runtime. As a fallback we can replace it with a fetcher like with rust where a cargoSha256 needs to be provided, but that’s less ideal. This would eliminate most of the lock files from nixpkgs.

To mitigate (2) we need to make sure that two package depending on lib-1.2.3 would use the same derivation. If lib-1.2.3 is declared twice but with all the same build inputs, it should result in the same derivation output. This is possible to do with ruby but difficult with python and node.

For (3) we would have the external package set overlays and the common overrides.

Letting upstream use Nix

As a side note, and related to point (1), it would be wonderful if the upstream developer could benefit from using nix as well. Imagine if the author of github.com/foo/bar had a default.nix file already defined, which is it necessary to re-declare the same thing inside of nixpkgs?

All we should have to do in nixpkgs should be:

bar = callPackageTwice (fetchFromGitHub {...});

It’s the same argument as letting the upstream developer be responsible to controlling the dependency set (and security updates).

Now nixpkgs is just a stdlib + a big set of fetchers pointing directly to upstream (+ the usual overrides for patching). This would definitely make nixpkgs thinner while also simplifying the packaging process.

The next thing is that if the fetchers are just a bunch of metadata we can now easily write some scripts:

$ ./scripts/update-all
Looking at all package metadata and updating to the latest version
$ ./script/package-rubygems my-gem
Creating package for my-gem (last version: 1.2.3)

I don’t know why it should be any more complex that than for the common case.

Unifying the fetcher interface

The last thing related to all of this is that I believe that the fetcher interface should be unified somehow. This is related to all the auto-updater and also to the observation that the package name and version actually come from the source that is being fetched.

Right now most packages are composed of two derivations: the main package installation description and a fetcher (and the dependencies). And we always do this dance with the version number.

Instead of:

{ stdenv.mkDerivation, fetcher }:
stdenv.mkDerivation rec {
  name = "foo-${version}";
  version = "1.2.3"

  src = fetcher {
    pname = "foo";
    inherit version;
    sha256 = "...";
  };
}

Instead we could split this up:

foo-src-1.2.3.nix

{ fetcher }:
fetcher {
  pname = "foo";
  version = "1.2.3";
  sha256 = "...";
}

mkFoo.nix

{ stdenv, foo-src }:
stdenv.mkDerivation {
  src = foo-src;
  // name is inherited from foo-src.name
  // meta is inherited from foo-src.meta, which includes the package version
}

Notice how the fetcher is now much more closer to pure metadata that can be encoded in JSON. And also notice that it’s now easy to have multiple versions of foo built on that generic builder. Composition FTW.

Ok that’s all for now

ryantm · June 5, 2018, 4:01am

The work of making an all-encompassing package set could be reduced by packaging the latest version of everything along with the dependencies that make it work. The policy would be to only package an old version if it was necessary to support the latest version of another piece of software. This would be a lot less work, and it would reduce the concern of supporting versions forever.

ryantm · June 5, 2018, 4:07am

Is the C situation really good enough that it should be treated specially under this plan? Is there other reasons to keep it as a package set?

nyarly · June 5, 2018, 4:45pm

One of the things that I like about the simple package set approach (e.g. that each library-version set is simply “here’s where to fetch it from” - anything more complex comes from @manveru’s defaultConfig function sets) is that per-application sets can be merged with it very easily, and the lock.nix files throughout nixpkgs can be consulted to determine which versions are still being used.

In other words “assemble the distro-scope set” and “prune unused versions” can both be made automated tasks. Maybe ofborg can be press-ganged to verify that the package sets are precisely what’s required to satisfy all the lockfiles.

The consequences are that adding and maintaining language packages becomes really easy - like @zimbatm was suggesting.

nyarly · June 5, 2018, 4:48pm

My one concern here is that it seems to be at odds with the existing design of Nixpkgs - expressions refer to specific hashes of source code; that’s why we can’t just gave expressions like “fetch this Ruby package and use it’s Gemfile.lock.”

Which I suddenly realize, though, is that in your ... is a sha256 - so the source and it’s expression are locked down in precisely the same way. That’s a really interesting idea, that I think merits its own discussion.