I’m preparing some articles for my coworkers who are learning Nix for the first time, and after wrapping this one up I thought it might be useful to share. For context, the code-base we work in is a monstrous mono-repo where you will get atrocious cache performance if you aren’t really explicit with filters. Caching aside, just copying sub-trees of the workspace unnecessarily can become really costly, making it especially important to filter “purely in Nix” up-front using prim-ops to avoid spurious file copies in an intermediate derivation.
I plan to extend this soon with examples of dead simple source tree construction with runCommandNoCC
+ cp ${./foo} .
style cherry picking, and likely a proper tree-walker.
I hope y’all find it helpful, and feel free to share any critiques/additions.
Source Filtering
Many Nix expresssions, particularly those which produce “derivations”
( build recipes ) carry an attribute named src
which is used to create your
build area.
In most cases src
, and all of its contents, effect the hash of your derivation
such that any file modifications in that tree will trigger a cache miss ( rebuild ).
With that in mind, filtering a source tree down to the minimal list of files that
really effect a build’s outputs can make a night and day difference in caching
performance; this is especially true in large source trees such as a mono-repo.
Nix carries a simple builtin command builtins.filterSource
which allows us to
perform rudimentary filtering, and Nixpkgs carries a collection of more robust
filters which you will want to become familiar with.
Source Filters Overview
-
builtins.filterSource
: simple builtin filter routine.- Nice for removing a few files from a shallow source tree.
- Unlike Nixpkgs’
lib.cleanSource
, the builtin function cannot be
composed with other filters, and cannot be used with the debug tracing
functionlib.sources.trace
. - Significantly faster than the Nixpkgs filters; “closer to the metal”.
It is implemented in C++ and takes advantage of platform specific
filesystem operations directly.
-
lib.cleanSource
: sane default filter which removes temporary and VC files.- This is your bread and butter filter to apply to a Git project’s top level.
- Deletes the
.git/
directory which is essential for working with locally
cloned repositories. - Deletes common temporary files such as
*~
and.*.sw*
produced by many
editors, build artifacts such as*.o
and*.so
, anyresult
symlinks
produced by Nix, and oddball filesystem nodes like sockets. - May be composed with other filters, commonly acting as the base filter to
be applied first.
-
lib.cleanSourceWith
: applies custom filters and composes them.- Used to run any user defined filter, and build sequences of filters.
- Uses the exact same set of arguments seen in
builtins.filterSource
, and
is actually the ancestor of the builtin. - Composing filters with
lib.cleanSourceWith
is significantly more efficient
the filtering a tree, and passing to another filter.
Composition merges filter predicates and applies them to a source tree
once without making multiple copies of the source tree between filters. - Composed filters are applied sequentially as:
f1: f2: ( path: type: ( f1 path type ) && ( f2 path type ) )
-
lib.sources.trace
: traces source filters for debugging.- Generally used in a REPL when writing a filter.
- Do not check in derivations that use
trace
, it’s a debugging function
intended for interactive use. - Used as
lib.sources.trace ( lib.cleanSource ./. )
- I use this literally every time I’m writing a filter, I strongly suggest
taking advantage of it.
-
lib.sources.sourceByRegex
: produces a filter from regex expressions.- Given a list of patterns, returns true if a file/directory matches any.
- Great for recursively pulling files by name like
.*/package\.json$
. - Uses PCRE regex, not POSIX globbing.
- Internally this uses
cleanSourceWith
, so it may be composed with other
lib.sources.*
cleaners. - WARNING: that since Nix 2.4 Darwin and Linux platforms will produce
different results when attempting “non-greedy” PCRE matches such as
[^a]*?
or[^a]{0,}
.
This platform dependant behavior is a result of LLVM developers failing
to implement PCRE as specified.
This issue is not limited to Nix, it applies to any software which uses
LLVM’s implementation of PCRE - keep your eyes peeled for this elsewhere.
-
lib.sources.sourceFilesBySuffices
: produces a filter from a list of suffices.- Suffices are interpreted literally - don’t use regex.
- Almost always used to match file extensions.
- This largely exists to account for the LLVM debacle mentioned above.
- Internally this uses
cleanSourceWith
, so it may be composed with other
lib.sources.*
cleaners.
-
nix-gitignore.gitignoreSource[Pure]
: filters using.gitignore
syntax.- Accepts a list of patterns which are applied following
.gitignore
style
filtering rules. -
gitignoreSource
supplements the provided patterns with any.gitignore
files found in the source tree.
Reading.gitignore
files incurs a considerable performance cost.
WARNING: ifsrc
is not the git repository root, this function will not
be able to read.gitignore
files in parent directories.
With this in mind I do not suggest using this function in a large mono-repo. -
gitignoreSourcePure
ignores any.gitignore
files, and is
purely declarative.
This has significantly improved performance and caching, I strongly
recommend using this function over the impure form in large repositories. - In practice I’ve found that
lib.sources.sourceByRegex
is often much
easier to reason about; gitignore syntax has wonky rules.
- Accepts a list of patterns which are applied following
Basic Filtering Examples
Filtering a local project directory
{ pkgs ? import <nixpkgs> {}
, stdenv ? pkgs.stdenv
, lib ? pkgs.lib
}:
stdenv.mkDerivation {
pname = "my-project";
version = "0.0.1";
src = lib.cleanSource ./.;
}
Nesting Filters
Discouraged, but fine for small codebases.
{ pkgs ? import <nixpkgs> {}
, stdenv ? pkgs.stdenv
, lib ? pkgs.lib
, nix-gitignore ? pkgs.nix-gitignore
}:
stdenv.mkDerivation {
pname = "my-project";
version = "0.0.1";
# A bit more efficient than `lib.sources.sourceByRegex' because
# these directories are pruned immediately, and aren't traversed.
src = nix-gitignore.gitignoreFilterPure [
"node_modules/"
".yarn/cache/"
] ( lib.cleanSource ./. );
}
Ignore Nix files ( Composed )
{ pkgs ? import <nixpkgs> {}
, stdenv ? pkgs.stdenv
, lib ? pkgs.lib
}:
stdenv.mkDerivation {
pname = "my-project";
version = "0.0.1";
# `sourceFilesBySuffices' uses, `cleanSourceWith' internally, so these will
# compose properly.
src = lib.sources.sourceFilesBySuffices [".nix"] ( lib.cleanSource ./. );
}
Composing filters
{ pkgs ? import <nixpkgs> {}
, stdenv ? pkgs.stdenv
, lib ? pkgs.lib
}:
let
inherit (lib) cleanSource cleanSourceWith;
# Remove `*/.yarn/cache' directories.
cleanYarnCacheFilter = name: type:
! ( ( type == "directory" ) &&
( ( baseNameOf name ) == "cache" ) &&
( ( baseNameOf ( dirOf name ) ) == ".yarn" ) # `dirOf' is like `dirname'
);
# Remove `*/node_modules' directories
cleanNodeModulesFilter = name: type:
! ( ( type == "directory" ) && ( ( baseNameOf name ) == "node_modules" ) );
# Clean cached Node.js artifacts, and apply Nix's default clean routine.
cleanNodeSource = src:
cleanSourceWith {
filter = cleanYarnCacheFilter;
src = cleanSourceWith {
filter = cleanNodeModulesFilter;
src = cleanSource src;
};
};
in stdenv.mkDerivation {
pname = "my-project";
version = "0.0.1";
src = cleanNodeSource ./.;
}
Writing filters
Both builtins.filterSource
and lib.sources.cleanSourceWith
accept functions with
the prototype filter :: Name (Path/File Name) -> Type (Filetype) -> Boolean
where Type ::= "directory" | "regular" | "symlink" | "unknown"
, “unknown”
being Sockets and other oddball filesystem node types.
A filter function is expected to return true
if a file should be preserved,
and false
if a file should be removed.
An example filter which unconditionally preserves all files would be:
let myFilter = name: type: true;
in builtins.filterSource myFilter ./.
# Or equivalently ( since we never reference `name' or `type' ) :
builtins.filterSource ( _: _: true ) ./.
The filter above could also be used with Nixpkgs lib.cleanSourceWith
:
lib.cleanSourceWith { filter = _: _: true; src = ./.; }
A more useful filter: Optimized node2nix
ignores.
A common Nix utility node2nix
uses nix-gitignore.gitignoreSourcePure
to pluck just the package.json
and package-lock.json
files from a
source tree.
In this case, nix-gitignore.gitignoreSourcePure
is somewhat unnecessary,
since the underlying predicate “only keep these two files” is dead simple.
Really, when we are building a single package, you should really use
builtins.pathExists
with srcs = [...]
; but because node2nix
is
designed to possibly build groups of packages in a tree, it needs to
preserve package[-lock].json
files recursively.
However, we can still improve the existing implementation by writing a
filter directly, which will save us from evaluating an unnecessarily
expensive routine.
In summary, our goal is simply: filter a source tree, preserving only
package.json
and package-lock.json
files.
-
Rough Draft
A naive implementation of this routine would resemble the example above for
“Composing filters”, but that routine would fail to preserve files
in subdirectories.
The example below illustrates the issue.
The expected output is:
[ "package-lock.json" "package.json" "foo/" ]
mkdir -p /tmp/naive-nix-filter/foo; cd /tmp/naive-nix-filter >/dev/null; touch {,foo/}package{,-lock}.json; cat <<'EOF' > naive.nix let inherit (builtins) attrValues mapAttrs readDir filterSource; # A useful idiom for the REPL. ls = dir: let f2str = n: t: if t == "directory" then n + "/" else n; in attrValues ( mapAttrs f2str ( readDir dir ) ); naiveFilter = name: type: let bname = baseNameOf name; in ( type == "regular" ) && ( ( bname == "package.json" ) || ( bname == "package-lock.json" ) ); src = filterSource naiveFilter ./.; in ls src EOF nix eval -f ./naive.nix; cd - >/dev/null; rm -rf /tmp/naive-nix-filter; # ==> [ "package-lock.json" "package.json" ]
The behavior above illustrates the Nix processes filters using BFS, which makes
sense considering we don’t want to traverse directories that are going to be pruned.
But in this case, we actually don’t want to deletefoo/
, so lets try preserving
all directories this time, and for kicks we’ll add a directory with a file that we
DO want to be deleted. -
A Working Filter
This round we will opt for
cleanSourceWith
this time so we cantrace
into subdirs;
while we could go down the rabbit hole of extendingls
, I’ll leave that as an
exercise to the reader.
Our example does a bit of post processing on the trace output to make it more
readable for our purposes.mkdir -p /tmp/naive-nix-filter/foo; cd /tmp/naive-nix-filter >/dev/null; touch {,foo/}package{,-lock}.json; mkdir bar; touch bar/baz; cat <<'EOF' > naive.nix let inherit (builtins) attrValues mapAttrs readDir filterSource; pkgs = import <nixpkgs> {}; inherit (pkgs.lib.sources) trace cleanSourceWith; naiveFilter = name: type: let bname = baseNameOf name; in ( type == "regular" ) && ( ( bname == "package.json" ) || ( bname == "package-lock.json" ) ); betterFilter = name: type: ( type == "directory" ) || ( naiveFilter name type ); in trace ( cleanSourceWith { filter = betterFilter; src = ./.; } ) EOF # `trace' prints to `stderr' and we don't care about the result so redirect. nix eval -f ./naive.nix 3>&1 1>/dev/null 2>&3- \ |grep ' = true$'|cut -d' ' -f3|sed "s,.*$PWD,.,"; cd - >/dev/null; rm -rf /tmp/naive-nix-filter; # ==> # ./bar # ./foo # ./foo/package-lock.json # ./foo/package.json # ./package-lock.json # ./package.json
This is looking much better, we preserved out
package[-lock].json
files
recursively as intended, and we deleted the gargage filebar/baz
as well.
The only thing we might want to improve here would be to prune the empty
directorybar/
, since deletingbar/baz
makes it pretty useless.
I’ll note that while this detail seems somewhat pedantic, the existing of
this file would produce a different hash than if it were deleted.
Since our goal was truly “preserve onlypackage[-lock].json
” files, lets
see if we can delete it. -
Diving Off the Deep End
Let me preface this example by saying, “yes this is incredibly extra, and I
don’t actually expect anyone to know how to leveragecleanSource
internals
in this way in a regular derivation”.mkdir -p /tmp/naive-nix-filter/foo; cd /tmp/naive-nix-filter >/dev/null; touch {,foo/}package{,-lock}.json; mkdir bar; touch bar/baz; cat <<'EOF' > naive.nix let inherit (builtins) attrValues mapAttrs readDir filterSource; pkgs = import <nixpkgs> {}; inherit (pkgs.lib) hasPrefix; inherit (pkgs.lib.sources) trace cleanSourceWith; naiveFilter = name: type: let bname = baseNameOf name; in ( type == "regular" ) && ( ( bname == "package.json" ) || ( bname == "package-lock.json" ) ); # If your eyes glaze over while reading this, that's totally fine. # The overview is apply our original filter to regular files, # but for directories perform a dry run of filtering recursively, # and remove any directories which would end up being empty # after cleaning. # This is less efficient than a tree walker ( by a lot ), but # it's still better than the `gitignoreSourcePure' routine. filterSourceRmdir = filt: src: let pstr = p: if ( p ? origSrc ) then p.origSrc else ( toString p ); absPath = p: let op = pstr p; in if ( hasPrefix "/" op ) then op else ( ( pstr src ) + "/${p}" ); fr = n: t: ( filt n t ) || ( ( t == "directory" ) && ( ( readDir ( filterSourceRmdir filt ( absPath n ) ).outPath ) != {} ) ); in cleanSourceWith { filter = fr; inherit src; }; bestFilter = filterSourceRmdir naiveFilter; in trace ( bestFilter ./. ) EOF # `trace' prints to `stderr' and we don't care about the result so redirect. nix eval -f ./naive.nix 3>&1 1>/dev/null 2>&3- \ |grep ' = true$'|cut -d' ' -f3|sed "s,.*$PWD,.,"; cd - >/dev/null; rm -rf /tmp/naive-nix-filter; # ==> # ./foo # ./foo/package-lock.json # ./foo/package.json # ./package-lock.json # ./package.json
VICTORY!
… Well alright I’ll admit, while we got the expected output, we’ve created an
abomination - the likes of which was probably not worth the effort and is still
incredibly wasteful because nested directories at depthN
are redundantly
processedN - 1
times.A far better implementation is a simple tree walker that collects matching files,
but this was nonetheless a good way to explore some advanced filtering. -
Recommended Solution
A low effort, but highly effective approach that I actually suggest using in the
field is to usebetterFilter
, then usestdenvNoCC.mkDerivation
withfind
andrmdir
.In this example I’ll do our workspace setup as a derivation, and simply use
readDir
to confirm that the directorybar
is deleted.
We could just as well use the same pattern from the previous examples where we create
the workspace insh
; I’m just using this as an opportunity to sneak in an example
usingrunCommandNoCC
.This means we will use an intermediate builder rather than producing a single derivation,
so we take a small hit on space in the Nix store, and for spinning up a builder
(stdenvNoCC.mkDerivation
has a relatively small footprint ); but we would
have saved ~1 1/2 hours of reading, and would still get the hash we care about
for our “real build” - which is what we actually care about.let inherit (builtins) attrValues mapAttrs readDir filterSource; pkgs = import <nixpkgs> {}; inherit (pkgs) stdenvNoCC runCommandNoCC; inherit (pkgs.lib.sources) trace cleanSourceWith; example-src = runCommandNoCC "source" {} '' mkdir -p $out/foo $out/bar; touch $out/{,foo/}package{,-lock}.json $out/bar/baz; ''; naiveFilter = name: type: let bname = baseNameOf name; in ( type == "regular" ) && ( ( bname == "package.json" ) || ( bname == "package-lock.json" ) ); betterFilter = name: type: ( type == "directory" ) || ( naiveFilter name type ); mostlyClean = cleanSourceWith { filter = betterFilter; src = example-src; }; fullyClean = stdenvNoCC.mkDerivation { inherit (mostlyClean) name; src = mostlyClean; phases = ["unpackPhase" "buildPhase" "installPhase"]; buildPhase = "find . -type d -empty -exec rmdir {} \+"; installPhase = "cp -pr --reflink=auto -- . $out"; # This is such a quick operation that queing a remote builder isn't worth # the effort; `preferLocalBuild' will cause this to be run locally. preferLocalBuild = true; }; in readDir fullyClean.outPath # ==> { foo = "directory"; "package-lock.json" = "regular"; "package.json" = "regular"; }
In practice, we really don’t mind producing a large number of small derivations as
long as we have valid reasons to do so; unlike docker containers a builder runs
on very low overhead, for a trivial routine like this we really just make a
temporary directory, add some junk toPATH
, copy our inputs, and delete the
temporary directory after without any virtualization or other non-sense.In this case our “inputs” are just a few text files, so this intermediate isn’t
expensive to spin up - you wouldn’t want to do this on a large unfiltered mono-repo!Using technically unnecessary intermediate derivations is a common practice when
working on projects which are under active development, and may fail - adding
intermediate derivations with well defined inputs allows us to effectively
create cacheable “check-points” that we can revert to in the event of a failure.