Filtering Source Trees with Nix and Nixpkgs

I’m preparing some articles for my coworkers who are learning Nix for the first time, and after wrapping this one up I thought it might be useful to share. For context, the code-base we work in is a monstrous mono-repo where you will get atrocious cache performance if you aren’t really explicit with filters. Caching aside, just copying sub-trees of the workspace unnecessarily can become really costly, making it especially important to filter “purely in Nix” up-front using prim-ops to avoid spurious file copies in an intermediate derivation.

I plan to extend this soon with examples of dead simple source tree construction with runCommandNoCC + cp ${./foo} . style cherry picking, and likely a proper tree-walker.

I hope y’all find it helpful, and feel free to share any critiques/additions.

Source Filtering

Many Nix expresssions, particularly those which produce “derivations”
( build recipes ) carry an attribute named src which is used to create your
build area.

In most cases src, and all of its contents, effect the hash of your derivation
such that any file modifications in that tree will trigger a cache miss ( rebuild ).

With that in mind, filtering a source tree down to the minimal list of files that
really effect a build’s outputs can make a night and day difference in caching
performance; this is especially true in large source trees such as a mono-repo.

Nix carries a simple builtin command builtins.filterSource which allows us to
perform rudimentary filtering, and Nixpkgs carries a collection of more robust
filters which you will want to become familiar with.

Source Filters Overview

  • builtins.filterSource : simple builtin filter routine.

    • Nice for removing a few files from a shallow source tree.
    • Unlike Nixpkgs’ lib.cleanSource, the builtin function cannot be
      composed with other filters, and cannot be used with the debug tracing
      function lib.sources.trace.
    • Significantly faster than the Nixpkgs filters; “closer to the metal”.
      It is implemented in C++ and takes advantage of platform specific
      filesystem operations directly.
  • lib.cleanSource : sane default filter which removes temporary and VC files.

    • This is your bread and butter filter to apply to a Git project’s top level.
    • Deletes the .git/ directory which is essential for working with locally
      cloned repositories.
    • Deletes common temporary files such as *~ and .*.sw* produced by many
      editors, build artifacts such as *.o and *.so, any result symlinks
      produced by Nix, and oddball filesystem nodes like sockets.
    • May be composed with other filters, commonly acting as the base filter to
      be applied first.
  • lib.cleanSourceWith : applies custom filters and composes them.

    • Used to run any user defined filter, and build sequences of filters.
    • Uses the exact same set of arguments seen in builtins.filterSource, and
      is actually the ancestor of the builtin.
    • Composing filters with lib.cleanSourceWith is significantly more efficient
      the filtering a tree, and passing to another filter.
      Composition merges filter predicates and applies them to a source tree
      once without making multiple copies of the source tree between filters.
    • Composed filters are applied sequentially as:
      f1: f2: ( path: type: ( f1 path type ) && ( f2 path type ) )
  • lib.sources.trace : traces source filters for debugging.

    • Generally used in a REPL when writing a filter.
    • Do not check in derivations that use trace, it’s a debugging function
      intended for interactive use.
    • Used as lib.sources.trace ( lib.cleanSource ./. )
    • I use this literally every time I’m writing a filter, I strongly suggest
      taking advantage of it.
  • lib.sources.sourceByRegex : produces a filter from regex expressions.

    • Given a list of patterns, returns true if a file/directory matches any.
    • Great for recursively pulling files by name like .*/package\.json$.
    • Uses PCRE regex, not POSIX globbing.
    • Internally this uses cleanSourceWith, so it may be composed with other
      lib.sources.* cleaners.
    • WARNING: that since Nix 2.4 Darwin and Linux platforms will produce
      different results when attempting “non-greedy” PCRE matches such as
      [^a]*? or [^a]{0,}.
      This platform dependant behavior is a result of LLVM developers failing
      to implement PCRE as specified.
      This issue is not limited to Nix, it applies to any software which uses
      LLVM’s implementation of PCRE - keep your eyes peeled for this elsewhere.
  • lib.sources.sourceFilesBySuffices : produces a filter from a list of suffices.

    • Suffices are interpreted literally - don’t use regex.
    • Almost always used to match file extensions.
    • This largely exists to account for the LLVM debacle mentioned above.
    • Internally this uses cleanSourceWith, so it may be composed with other
      lib.sources.* cleaners.
  • nix-gitignore.gitignoreSource[Pure] : filters using .gitignore syntax.

    • Accepts a list of patterns which are applied following .gitignore style
      filtering rules.
    • gitignoreSource supplements the provided patterns with any .gitignore
      files found in the source tree.
      Reading .gitignore files incurs a considerable performance cost.
      WARNING: if src is not the git repository root, this function will not
      be able to read .gitignore files in parent directories.
      With this in mind I do not suggest using this function in a large mono-repo.
    • gitignoreSourcePure ignores any .gitignore files, and is
      purely declarative.
      This has significantly improved performance and caching, I strongly
      recommend using this function over the impure form in large repositories.
    • In practice I’ve found that lib.sources.sourceByRegex is often much
      easier to reason about; gitignore syntax has wonky rules.

Basic Filtering Examples

Filtering a local project directory

{ pkgs   ? import <nixpkgs> {}
, stdenv ? pkgs.stdenv
, lib    ? pkgs.lib
}:
stdenv.mkDerivation {
  pname = "my-project";
  version = "0.0.1";
  src = lib.cleanSource ./.;
}

Nesting Filters

Discouraged, but fine for small codebases.

{ pkgs          ? import <nixpkgs> {}
, stdenv        ? pkgs.stdenv
, lib           ? pkgs.lib
, nix-gitignore ? pkgs.nix-gitignore
}:
stdenv.mkDerivation {
  pname = "my-project";
  version = "0.0.1";
  # A bit more efficient than `lib.sources.sourceByRegex' because
  # these directories are pruned immediately, and aren't traversed.
  src = nix-gitignore.gitignoreFilterPure [
    "node_modules/"
    ".yarn/cache/"
  ] ( lib.cleanSource ./. );
}

Ignore Nix files ( Composed )

{ pkgs          ? import <nixpkgs> {}
, stdenv        ? pkgs.stdenv
, lib           ? pkgs.lib
}:
stdenv.mkDerivation {
  pname = "my-project";
  version = "0.0.1";
  # `sourceFilesBySuffices' uses, `cleanSourceWith' internally, so these will
  # compose properly.
  src = lib.sources.sourceFilesBySuffices [".nix"] ( lib.cleanSource ./. );
}

Composing filters

{ pkgs          ? import <nixpkgs> {}
, stdenv        ? pkgs.stdenv
, lib           ? pkgs.lib
}:
let
  inherit (lib) cleanSource cleanSourceWith;
  # Remove `*/.yarn/cache' directories.
  cleanYarnCacheFilter = name: type:
    ! ( ( type == "directory" )                       &&
        ( ( baseNameOf name ) == "cache" )            &&
        ( ( baseNameOf ( dirOf name ) ) == ".yarn" ) # `dirOf' is like `dirname'
      );

  # Remove `*/node_modules' directories
  cleanNodeModulesFilter = name: type:
    ! ( ( type == "directory" ) && ( ( baseNameOf name ) == "node_modules" ) );

  # Clean cached Node.js artifacts, and apply Nix's default clean routine.
  cleanNodeSource = src:
    cleanSourceWith {
      filter = cleanYarnCacheFilter;
      src = cleanSourceWith {
        filter = cleanNodeModulesFilter;
        src = cleanSource src;
      };
    };
in stdenv.mkDerivation {
  pname = "my-project";
  version = "0.0.1";
  src = cleanNodeSource ./.;
}

Writing filters

Both builtins.filterSource and lib.sources.cleanSourceWith accept functions with
the prototype filter :: Name (Path/File Name) -> Type (Filetype) -> Boolean
where Type ::= "directory" | "regular" | "symlink" | "unknown", “unknown”
being Sockets and other oddball filesystem node types.
A filter function is expected to return true if a file should be preserved,
and false if a file should be removed.

An example filter which unconditionally preserves all files would be:

let myFilter = name: type: true;
in builtins.filterSource myFilter ./.
# Or equivalently ( since we never reference `name' or `type' ) :
builtins.filterSource ( _: _: true ) ./.

The filter above could also be used with Nixpkgs lib.cleanSourceWith:

lib.cleanSourceWith { filter = _: _: true; src = ./.; }

A more useful filter: Optimized node2nix ignores.

A common Nix utility node2nix uses nix-gitignore.gitignoreSourcePure
to pluck just the package.json and package-lock.json files from a
source tree.

In this case, nix-gitignore.gitignoreSourcePure is somewhat unnecessary,
since the underlying predicate “only keep these two files” is dead simple.
Really, when we are building a single package, you should really use
builtins.pathExists with srcs = [...]; but because node2nix is
designed to possibly build groups of packages in a tree, it needs to
preserve package[-lock].json files recursively.
However, we can still improve the existing implementation by writing a
filter directly, which will save us from evaluating an unnecessarily
expensive routine.

In summary, our goal is simply: filter a source tree, preserving only
package.json and package-lock.json files.

  1. Rough Draft

    A naive implementation of this routine would resemble the example above for
    “Composing filters”, but that routine would fail to preserve files
    in subdirectories.
    The example below illustrates the issue.
    The expected output is:
    [ "package-lock.json" "package.json" "foo/" ]

    mkdir -p /tmp/naive-nix-filter/foo;
    cd /tmp/naive-nix-filter >/dev/null;
    touch {,foo/}package{,-lock}.json;
    cat <<'EOF' > naive.nix
    let
      inherit (builtins) attrValues mapAttrs readDir filterSource;
    
      # A useful idiom for the REPL.
      ls = dir:
        let f2str = n: t: if t == "directory" then n + "/" else n;
        in attrValues ( mapAttrs f2str ( readDir dir ) );
    
      naiveFilter = name: type: let bname = baseNameOf name; in
        ( type == "regular" ) &&
        ( ( bname == "package.json" ) || ( bname == "package-lock.json" ) );
    
      src = filterSource naiveFilter ./.;
    in ls src
    EOF
    nix eval -f ./naive.nix;
    cd - >/dev/null;
    rm -rf /tmp/naive-nix-filter;
    
    # ==> [ "package-lock.json" "package.json" ]
    

    The behavior above illustrates the Nix processes filters using BFS, which makes
    sense considering we don’t want to traverse directories that are going to be pruned.
    But in this case, we actually don’t want to delete foo/, so lets try preserving
    all directories this time, and for kicks we’ll add a directory with a file that we
    DO want to be deleted.

  2. A Working Filter

    This round we will opt for cleanSourceWith this time so we can trace into subdirs;
    while we could go down the rabbit hole of extending ls, I’ll leave that as an
    exercise to the reader.
    Our example does a bit of post processing on the trace output to make it more
    readable for our purposes.

    mkdir -p /tmp/naive-nix-filter/foo;
    cd /tmp/naive-nix-filter >/dev/null;
    touch {,foo/}package{,-lock}.json;
    mkdir bar;
    touch bar/baz;
    cat <<'EOF' > naive.nix
    let
      inherit (builtins) attrValues mapAttrs readDir filterSource;
      pkgs = import <nixpkgs> {};
      inherit (pkgs.lib.sources) trace cleanSourceWith;
    
      naiveFilter = name: type: let bname = baseNameOf name; in
        ( type == "regular" ) &&
        ( ( bname == "package.json" ) || ( bname == "package-lock.json" ) );
    
      betterFilter = name: type:
        ( type == "directory" ) || ( naiveFilter name type );
    
    in trace ( cleanSourceWith { filter = betterFilter; src = ./.; } )
    EOF
    # `trace' prints to `stderr' and we don't care about the result so redirect.
    nix eval -f ./naive.nix 3>&1 1>/dev/null 2>&3-       \
      |grep ' = true$'|cut -d' ' -f3|sed "s,.*$PWD,.,";
    cd - >/dev/null;
    rm -rf /tmp/naive-nix-filter;
    
    # ==>
    # ./bar
    # ./foo
    # ./foo/package-lock.json
    # ./foo/package.json
    # ./package-lock.json
    # ./package.json
    

    This is looking much better, we preserved out package[-lock].json files
    recursively as intended, and we deleted the gargage file bar/baz as well.
    The only thing we might want to improve here would be to prune the empty
    directory bar/, since deleting bar/baz makes it pretty useless.
    I’ll note that while this detail seems somewhat pedantic, the existing of
    this file would produce a different hash than if it were deleted.
    Since our goal was truly “preserve only package[-lock].json” files, lets
    see if we can delete it.

  3. Diving Off the Deep End

    Let me preface this example by saying, “yes this is incredibly extra, and I
    don’t actually expect anyone to know how to leverage cleanSource internals
    in this way in a regular derivation”.

    mkdir -p /tmp/naive-nix-filter/foo;
    cd /tmp/naive-nix-filter >/dev/null;
    touch {,foo/}package{,-lock}.json;
    mkdir bar;
    touch bar/baz;
    cat <<'EOF' > naive.nix
    let
      inherit (builtins) attrValues mapAttrs readDir filterSource;
      pkgs = import <nixpkgs> {};
      inherit (pkgs.lib) hasPrefix;
      inherit (pkgs.lib.sources) trace cleanSourceWith;
    
      naiveFilter = name: type: let bname = baseNameOf name; in
        ( type == "regular" ) &&
        ( ( bname == "package.json" ) || ( bname == "package-lock.json" ) );
    
      # If your eyes glaze over while reading this, that's totally fine.
      # The overview is apply our original filter to regular files,
      # but for directories perform a dry run of filtering recursively,
      # and remove any directories which would end up being empty
      # after cleaning.
      # This is less efficient than a tree walker ( by a lot ), but
      # it's still better than the `gitignoreSourcePure' routine.
      filterSourceRmdir = filt: src:
        let
          pstr = p: if ( p ? origSrc ) then p.origSrc else ( toString p );
          absPath = p:
            let op = pstr p;
            in if ( hasPrefix "/" op ) then op else ( ( pstr src ) + "/${p}" );
          fr = n: t:
            ( filt n t ) ||
            ( ( t == "directory" ) &&
              ( ( readDir ( filterSourceRmdir filt ( absPath n ) ).outPath )
                != {} )
            );
        in cleanSourceWith { filter = fr; inherit src; };
    
      bestFilter = filterSourceRmdir naiveFilter;
    
    in trace ( bestFilter ./. )
    EOF
    # `trace' prints to `stderr' and we don't care about the result so redirect.
    nix eval -f ./naive.nix 3>&1 1>/dev/null 2>&3-       \
      |grep ' = true$'|cut -d' ' -f3|sed "s,.*$PWD,.,";
    cd - >/dev/null;
    rm -rf /tmp/naive-nix-filter;
    
    # ==>
    # ./foo
    # ./foo/package-lock.json
    # ./foo/package.json
    # ./package-lock.json
    # ./package.json
    

    VICTORY!

    … Well alright I’ll admit, while we got the expected output, we’ve created an
    abomination - the likes of which was probably not worth the effort and is still
    incredibly wasteful because nested directories at depth N are redundantly
    processed N - 1 times.

    A far better implementation is a simple tree walker that collects matching files,
    but this was nonetheless a good way to explore some advanced filtering.

  4. Recommended Solution

    A low effort, but highly effective approach that I actually suggest using in the
    field is to use betterFilter, then use stdenvNoCC.mkDerivation with find
    and rmdir.

    In this example I’ll do our workspace setup as a derivation, and simply use readDir
    to confirm that the directory bar is deleted.
    We could just as well use the same pattern from the previous examples where we create
    the workspace in sh; I’m just using this as an opportunity to sneak in an example
    using runCommandNoCC.

    This means we will use an intermediate builder rather than producing a single derivation,
    so we take a small hit on space in the Nix store, and for spinning up a builder
    ( stdenvNoCC.mkDerivation has a relatively small footprint ); but we would
    have saved ~1 1/2 hours of reading, and would still get the hash we care about
    for our “real build” - which is what we actually care about.

    let
      inherit (builtins) attrValues mapAttrs readDir filterSource;
      pkgs = import <nixpkgs> {};
      inherit (pkgs) stdenvNoCC runCommandNoCC;
      inherit (pkgs.lib.sources) trace cleanSourceWith;
    
      example-src = runCommandNoCC "source" {} ''
        mkdir -p $out/foo $out/bar;
        touch $out/{,foo/}package{,-lock}.json $out/bar/baz;
      '';
    
      naiveFilter = name: type: let bname = baseNameOf name; in
        ( type == "regular" ) &&
        ( ( bname == "package.json" ) || ( bname == "package-lock.json" ) );
    
      betterFilter = name: type:
        ( type == "directory" ) || ( naiveFilter name type );
    
      mostlyClean = cleanSourceWith { filter = betterFilter; src = example-src; };
    
      fullyClean = stdenvNoCC.mkDerivation {
        inherit (mostlyClean) name;
        src = mostlyClean;
        phases = ["unpackPhase" "buildPhase" "installPhase"];
        buildPhase = "find . -type d -empty -exec rmdir {} \+";
        installPhase = "cp -pr --reflink=auto -- . $out";
        # This is such a quick operation that queing a remote builder isn't worth
        # the effort; `preferLocalBuild' will cause this to be run locally.
        preferLocalBuild = true;
      };
    in readDir fullyClean.outPath
    
    # ==> { foo = "directory"; "package-lock.json" = "regular"; "package.json" = "regular"; }
    

    In practice, we really don’t mind producing a large number of small derivations as
    long as we have valid reasons to do so; unlike docker containers a builder runs
    on very low overhead, for a trivial routine like this we really just make a
    temporary directory, add some junk to PATH, copy our inputs, and delete the
    temporary directory after without any virtualization or other non-sense.

    In this case our “inputs” are just a few text files, so this intermediate isn’t
    expensive to spin up - you wouldn’t want to do this on a large unfiltered mono-repo!

    Using technically unnecessary intermediate derivations is a common practice when
    working on projects which are under active development, and may fail - adding
    intermediate derivations with well defined inputs allows us to effectively
    create cacheable “check-points” that we can revert to in the event of a failure.

38 Likes

I’m struggling to understand how fullyClean will actually have a hash that does not depend on the empty directories of mostlyClean, since I don’t see anything that makes it not-source-addressed. (I’m trying to prevent recompilations that may be caused by adding and removing directories in an otherwise largely filtered tree)

Oh you might be interested in checking out lib.fileset, which hasn’t existed when the original post was published :slight_smile:

5 Likes

Thanks! I was just looking at it but I still couldn’t find a way in there to do the regular directory traversal that filterSource does, but only creating a directory once we’re about to put an actual (not filtered-out) file inside, in order to not update hashes and cause recompilations based on unrelated directories.
It does look like something that I’m very likely not the first to need and should consequently already exist in Nix…
Is there such a thing?

EDIT: Ah this post seems to suggest that maybe lib.fileset by default does filter empty directories? Does this mean then that #8820 (that you opened!) is out of date?

So I’m trying to use fileset.fileFilter and the documentation specifies that type may not be directory, unlike with cleanSourceWith.
Does that mean that fileset will always traverse the entire file tree recursively?
(I would like to avoid e.g. target and node_modules because those tend to be huge so slow to traverse and I know nothing interesting is in there)

EDIT: looks like combing it with lib.fileset.fromSource may be a fix…
EDIT2: Yeah doing fromSource of a filtered source then toSource again works.

Indeed, lib.fileset cannot even at all express the notion of “empty directories” because it (conceptually) has no notion of directories at all (it’s just a set of files)! So that’s also why the issue there is still up-to-date, because there might be use cases where you do need empty directories, so when you convert to a store path, you’d need some non-fileset argument to specify that. The issue there however is just for efficiency reasons. Having that feature would allow lib.fileset to be faster :slight_smile:

fileset.fileFilter does recurse all the way if you use it as-is, but because the combinators are lazy, you can prevent that from happening:

# All rust files in the current directory that are not in ./target
# This prevents recursing into ./target!
difference
  (fileFilter (file: file.hasExt "rs") ./.)
  ./target

As you discovered, you can also convert from lib.sources-based values, though I’d avoid that since you lose some laziness properties that way.

Note that for the use-case of filtering out directories if you just know their name, but not their path, there’s a recent discussion in `lib.fileset` should have a way to filter directories · Issue #271307 · NixOS/nixpkgs · GitHub :slight_smile:

1 Like