Easy source filtering with file sets

Have you ever tried to filter a derivation source using functions like builtins.path or lib.cleanSourceWith? If so, you probably already wrote your own helper function to make it easier, because it’s really hard to get it right!

Sponsored by Antithesis :sparkles:, I’ve been developing a new approach to filter local sources, with the goal of making it easier, safer and more flexible to do so.

With this post I’m asking for feedback to figure out whether the interface of the current draft achieves those goals while satisfying the necessary use cases, or whether some changes are still necessary. To that end I’ll give a brief introduction here and show you how you can try it out.
This idea is implemented here: [WIP] File set combinators by infinisil · Pull Request #222981 · NixOS/nixpkgs · GitHub

Update: The file set library is mostly implemented and usable now, notably with minor changes compared to this post here! See File set library tracking issue and feature requests · Issue #266356 · NixOS/nixpkgs · GitHub for status, updates and feature requests

Trying it out

The PR can be loaded into a nix repl as follows. We also set fs = lib.fileset for convenience.

nix repl -f https://github.com/tweag/nixpkgs/tarball/file-sets

Welcome to Nix 2.15.1. Type :? for help.

Loading installable ''...
Added 19162 variables.

nix-repl> fs = lib.fileset
Flakes

To temporarily override the default nixpkgs input in your flake.nix:

nix build --override-input nixpkgs github:tweag/nixpkgs/file-sets

I then recommend also defining fs = inputs.nixpkgs.lib.fileset for convenience.

Overview

The PR implements a file set abstraction, which as you might expect, allows representing sets of files. Common set operations are supported, including:

  • Union: fs.union a b / fs.unions [ ... ]
  • Intersection: fs.intersect a b / fs.intersects [ ... ]
  • Difference: fs.difference a b
  • Filtering: fs.fileFilter predicate a

Examples:

let

  # The file ./Makefile and recursively all files in the ./src directory
  a = fs.union ./Makefile ./src;

  # Recursively all files in the ./. directory that are not in the ./tests directory
  b = fs.difference ./. ./tests;

  # Recurlively all Nix files in the ./. directory
  c = fs.fileFilter (file: file.ext == "nix") ./.;

  # Recursively all Nix files in the ./src directory
  d = fs.intersect ./src c;

in null

To see which files are included in a file set, you can use fs.trace:

nix-repl> fs.trace {} (fs.union ./Makefile ./src) null
trace: /home/user/my/project
trace: - Makefile (regular)
trace: - src (recursive directory)
null

Notably none of these operations actually import these files into the Nix store!
Instead the only way to get the files to be imported, and therefore usable in derivations, is to use the toSource function:

# Can be used as the `src =` of a derivation
fs.toSource {
  root = ./.;
  fileset = fs.unions [
    ./Makefile
    (fs.fileFilter (file: file.ext == "c") ./src)
  ];
}

These are some of the core functions, but more are available. The best way to explore them is to build the manual locally and open the lib.fileset reference section in your browser:

nix-build '<nixpkgs/doc>' -I nixpkgs=https://github.com/tweag/nixpkgs/tarball/file-sets

firefox result/share/doc/nixpkgs/manual.html#sec-functions-library-fileset

Goals, limitations and alternatives

The goal of this abstraction is to be able to precisely specify which files should have an effect on your derivation builds. Doing this should be straightforward, with obvious semantics, explanatory error messages and good performance.

In order to achieve this, some limitations are imposed:

  • Only local files at evaluation time are supported. Files in Nix store paths are not supported.
    Rationale: The path expression-based interface would be hard to use; might require IFD; without CA, original files would still be imported.
    Alternative: Use build-time tools to create a new derivation with the desired layout.

  • Empty directories cannot be represented.
    Rationale: It’s not obvious what the semantics should be if this were allowed, it couldn’t be explained as set operations anymore.
    Alternative: fs.toSource supports an extraExistingDirs argument which can be used to ensure certain directories exist in the resulting Nix store path.

File sets are intended as a replacement for builtins.path-based filtering and the lib.sources functions.
In contrast, file sets are not a replacement for functions like pkgs.nix-gitignore, gitignore.nix, Flakes’ tracked-by-git filtering or fetchGit.
However, file sets can serve as a more performant and composable foundation to implement such functions on top of.


If this is something you could benefit from, please give it a try and use this thread for any questions or feedback about the interface!
For the implementation, see the draft pull request.

33 Likes

Nice, it looks like a more featureful version of GitHub - numtide/nix-filter: a small self-container source filtering lib

In terms of UX, it might be worth adding a glob function. It could make those expressions a lot shorter for the common case. Eg:

fs.toSource {
  root = ./.;
  fileset = fs.globs [
    "Makefile"
    "**/*.nix"
    "!nix/source.nix" # negative filter
  ];
}

Note that docs are now published at Nixpkgs Manual.

1 Like

With such a proposed function, cases where the result isn’t clear quickly arise. E.g. should this include foo/bar or not?

fs.globs [
  "!foo"
  "foo/bar"
]

And is foo in the local directory relative to the current Nix file or somewhere else?

We could have a very similar interface like this which avoids these problems:

fs.globDifference ./.
  [
    "Makefile"
    "**/*.nix"
  ]
  [ 
    "nix/source.nix"
  ]

However, this still wouldn’t be very composable then, since it requires a path as the first argument, and a list of strings as the second and third. Additionally it requires learning/knowing the globbing string DSL.

Instead we can get the same result using simple combinators and a convenience function fs.fileExtFilter (not introduced in the PR yet):

fs.difference
  (fs.unions [
    ./Makefile
    # Currently the PR only supports this with
    # (fs.fileFilter (file: file.ext == "nix") ./.)
    (fs.fileExtFilter [ "nix" ] ./.)
  ])
  (fs.unions [
    ./nix/source.nix
  ])
1 Like

No, as traversal of foo is already forbidden.

Swapping items though would include foo/bar and then exclude everything else from foo.

This at least is how it works for most backup programs I have worked with.

While we can define semantics for !, it always leads to tricky cases. Here’s an excerpt from man gitignore, giving an example for ! semantics:

Example to exclude everything except a specific directory foo/bar (note the /* - without the slash, the wildcard would also exclude everything within foo/bar):

$ cat .gitignore
# exclude everything except directory foo/bar
/*
!/foo
/foo/*
!/foo/bar

I’d argue this is definitely not obvious, and just a simple mistake of forgetting the /* leads to the wrong result. In comparison, here’s the same with the proposed file set interface:

fs.difference ./. ./foo/bar
1 Like

I’d already argue against adding more syntax on the basis of parsing opening up a huge surface for implementation errors, and therefore introducing another source of complexity. We have enough quirks and moving parts to deal with as it is.

While I understand supporting gitignore’s globbing (or a subset or variant) may add complexity it will lead to a massive usability boost. Which nix could certainly benefit from :slight_smile: When I try to introduce nix to new coworkers this is usually one of the things people frown upon the most; how complex it is to just get the right files into the derivation.

Actually, you can already use globbing to exclude files today using pkgs.nix-gitignore, and that can make sense when people are already familiar with globbing and its complexity, and composability is not needed.

File set combinators are good for when that’s not the case. But also, file set combinators are well fit as a foundation to implement functions like pkgs.nix-gitignore on top of with the benefit of composability. Does that make sense?

Here’s an example of what I mean regarding composability: Say you’re writing the Nix build for a component under ./some/project of a larger project. There is a .gitignore at the repo root you want to use, but you also only want files from the component directory.

If pkgs.nix-gitignore used file sets underneath, this would be possible:

fs.intersect ./.
  (pkgs.nix-gitignore [] ../..)

Furthermore, if you now wanted to add one file back from some other path in your project, you could do that using

fs.unions [
  (fs.intersect ./.
    (pkgs.nix-gitignore [] ../..)
  )
  ../../some/file/anywhere.sh
]

Comparatively this would be tricky using globbing.

1 Like

Update: After talking with @roberth, he agreed for me to just go ahead and start incrementally merging PR’s implementing this. The first one is merged now, though it’s a very limited interface:

The second one is much more interesting, but I just opened it, reviews appreciated!

6 Likes

Having worked on this the past months, the file set library is fairly usable now! See File set library tracking issue and feature requests · Issue #266356 · NixOS/nixpkgs · GitHub for status, updates and if you have feature requests :slight_smile:

6 Likes

I’ve wanted something like fs.union(s) for so long… Blacklisting stuff is pain

2 Likes

This is an awesome library.

I’m wondering: Is support for renaming of files and/ or folders planned?

I’m pondering a use-case in which collecting a set of files and folders would improve if I could apply renaming rules :thinking:

@don.dfh We’re limited by what the underlying builtins.path/builtins.filterSource primitive can do, which doesn’t support renaming files.

However I’d also argue it shouldn’t be in scope, because the job of lib.fileset (or the builtins) is to exactly pick the files that you want to be able to influence derivations. Past that, you can use derivations to further transform them in any way necessary, including renaming, changing and adding files. All of these operations don’t change the selected files.

We do lack a nice general function to do that, but it would be fairly easy to add: Check out Function for transforming store path contents · Issue #264541 · NixOS/nixpkgs · GitHub, where I’m proposing a pkgs.transformStorePath.

1 Like

Is there something I could use to get a file set from a glob? I understand how do use the filters, but it would be really convenient to be able to pass */**/*.{ml,mli} or load up a ./.ignore file from the project root which use the globbing syntax.

One of the goals of the fileset library is for functions to have obvious semantics, and I think globs are a bit out there regarding that.

Globs are a separate syntax, so it requires a parser to be implemented in Nix, it requires people to understand that additional language, and comes with edge cases like how files with * or { in them should be handled. Should ?, ! or other features be implemented too? How would one debug those (lib.fileset.trace wouldn’t work)? Etc.

So while I don’t think it’s a good idea for the lib.fileset library to implement that, it’s a really good foundation for other libraries to be built upon, as it takes care of all the obscurity of builtins.path underneath and exposes an easy and safe interface on top.

Notably there’s already gitignore.nix and pkgs.nix-gitignore that take care of gitignore-style glob filtering. These are still based on lib.sources for now, but could be adapted to return filesets instead for improved composability, but you can also use lib.fileset.fromSource to convert any lib.sources-based value to a fileset :slight_smile:

Oh and if you want to get Git-tracked files instead, you can use lib.fileset.gitTracked too.

2 Likes

Is there a reason that lib.fileset functions aren’t aliased to provide access from the top-level of lib? For most sub-level functions, like lib.lists.singleton, they’re also accessible via lib.singleton. Maybe I’m missing a paradigm here.

Yeah that’s intentional, because without fileset it would either be confusing what these functions do, or it could be assumed they do something else. E.g. I might think that lib.union should be a union of lists, removing duplicates. lib.{from,to}Source wouldn’t make any sense (to/from what is the value being converted?). lib.trace could be confused with builtins.trace (though arguably there should be lib.trace that can trace arbitrary values). lib.difference might be interpreted as difference between integers. Etc.

3 Likes