At my workplace, we use nix pretty extensively, and also as a “proper”
per-file build system, with a large pool of builders (~80 VMs currently) and
a global cache backed by Google Storage. This has caused us to run into
various limitations of nix remote build support, so I’ve been considering if
there’s a way to fix them all in one go, by effectively entirely replacing
nix’s remote build feature.
The recent post
was interesting because it shows that other people are thinking of the same
thing, though presumably with different conditions in mind. Still, it made me
think I should put this out to the community, both to see if there are better
ideas out there, and also to see if there’s interest in possibly open-sourcing
this work.
The current situation is that I have a working proof of concept, though it’s
not being used at scale yet. I’ve done some tests for a few hundred
simultaneous builds, but it hasn’t withstood the full fury of the CI pipeline
and merge queue. I’m currently deprioritized from working on it, but will
hopefully be able to get back to it soonish. From the tests, it seems to be
good at efficiently using resources. The one obvious bottleneck so far is
that uploading all the drvs to the global cache is very slow, but it might be
because the storage backend (Google Storage) is not suitable for many small
files.
Anyway, for motivation, I’ll list out some of the problems we’ve had with the
existing nix remote build, much of this should be familiar but I guess I can
repeat so you can see where I’m coming from. Also it’s more or less a list of
what this new system fixes. It’s skewed towards our particular use case,
which is more on the “big build farm” side of things. Perhaps in the nix
world, hydra is the closest analogue.
“director” is the machine that initiated the build, while “builder” is the
set of machines that will do the actual work:
-
Nix downloads lots of stuff unnecessarily, because it doesn’t assume you
have a global cache. So the director will download all deps, then the
builder will download again (either from the cache or have the director
send them), then the director will stream them back, only to copy them to
some other builder (see scheduling problems). -
The nix scheduler doesn’t understand about affinity, it’ll just go more or
less round robin on builders, leading to more copying of data. -
Nix scheduling is local, not global. So settings like max-jobs are only max
per nix process, not global max. This means you can’t actually control how
much runs on a builder, which is pretty serious if there actual are hardware
limitations. Lack of global knowledge means even with a global cache, builds
can get duplicated if one starts before the other finishes uploading. -
The nix scheduler has no notion of job weight, in CPU or memory, or size of
builders (other than some ad-hoc and manual requiredFeatures stuff). We
actually know that stuff from previous builds, and could use that for
scheduling or even just “time left” estimates. Nix has some basic hardcoded
load notion, which seems to be the number of build locks taken. It’s
non-exensible and pretty much undocumented. -
A non-extensible scheduler means we also can’t choose the build order. We
may want to do clever things like prioritize things we think are more likely
to fail. -
Speaking of priorities, we also would have distinctions between interactive,
essential, and background jobs. -
The nix scheduler is static, so in other words you pick a set of builders at
the beginning and they’ll be the same for the whole build. There’s no way to
notice if a builder went down, or got overloaded, or whatever. This makes
the prospect of draining and shutting down builders finicky, and on the other
side, we can’t wait for new builders to start up. -
Along with the above, nix has no notion of transient failures, and that it’s
ok to retry them. Speaking of that, there’s no notion of an aborted build,
whether due to ^C or dependencies having failed. Internally nix knows, but
it’s hard to figure out from the outside. See the “structured logging” issues
below. -
Nix remote build is actually quite slow at starting up new jobs. Perhaps
this is the need to establish a new ssh connection for each one, or just some
internal locks and things. -
Build log streaming is buggy, both in that a single line which is too long
will break something inside nix and lose the line, and even worse logs tend to
get truncated around 8k. Practically speaking this is really bad because the
most important part of a build failure is usually at the end! -
We wind up using nix’s structured log output to demultiplex job status.
However, this is an internal undocumented format, and has too many details in
some places (e.g. download progress) and not enough in others (e.g. have to
parse text for the “will download, will build” sets). In addition either
we’re using it wrong, or it’s buggy WRT job start and end times, we get
hundreds of started builds which stack up until they all complete at once. -
Nix is inconsistent about propagating build flags to builders. What’s
worse, it’s differently inconsistent if you usessh://
vsssh-ng://
. But
some, like timeouts, is apparently never propagated. -
There’s almost no logging on the builder side, pretty much just the
nix-daemon saying it accepted a connection. Actually figuring out what is
building where, or what PIDs correspond to what builds, requires a whole bunch
of extra scaffolding and hackery. -
In line with the above, nix can’t propagate metadata like a build ID to tie
stuff together, because remote build only uses the drv, which can’t be changed
without causing a rebuild. -
No way to bracket builds with special actions, like logging or HW reset or
something. We do havepost-build-hook
(again, not propagated by remote
build though, so must be in local nix.conf), butpre-build-hook
is not what
it’s name sounds like.
Anyway, there are some of the issues. Here’s the basic idea of how this
new system works:
-
director does a nix-instantiate, and creates a list of drvs to build. This
may involve downloading and building things, due to IFD and others.
Unfortunately since this is all hardcoded into the nix interpreter, we pretty
much have to keep using standard nix remote build for this, so we need to keep
that functionality alive. However we already need to minimize this this due
to well known problems (instantiate build is single threaded, no structured
logs from instantiate, etc.). So this part is all the same. -
Parse the drvs and explore them to create a DAG for building. This is what
nix already does internally, and I get it out withnix-store -q --graph
. -
Do a mass query against the cache and remove the ones that don’t need
building. If we are able to useca-derivations
, then nix verify or some
other tool will need to be extended to work with ca-derivations.
I’m not using ca-derivations yet though, due to some other problems with it. -
So far this is recreating internal nix logic, but the next step is to drop
the DAG onto a DB table as a bunch of individual jobs. Each job has links to
its parents. I do some clustering to e.g. merge together linear sequences
of drvs, also we have some internal annotation for known-cheap builds. Each
job has timestamps for start, end, uploaded, as well as buildId and
nixOptions, and astatus
enum (‘pending’, ‘running’, ‘succeed’, ‘fail’,
‘timeout’, ‘abort’, ‘unexpected’, ‘transient’), which will be filled out when
appropriate. Since I don’t delete them when complete, this then becomes
metrics in addition to current state. A uniquebuildId
is assigned. -
The coordinator also starts up a subscription to the
buildId
on
a messaging service, to notifications, and streaming build logs if requested.
I’m using GCP pubsub, but I understand there are many systems that provide
this kind of feature. The notification is optional, since the coordinator can
also poll the DB to get status, but it’s nice to get streaming updates. If
there is a failed build, it can notice about that, and download and print the
log. -
Builders poll the work table, and atomically claim jobs that they want.
They are a bit clever in that they know if dependencies were built on the same
host they don’t need to wait for upload. In fact they can short-circuit the
query entirely in that case, though they still have to claim the job. -
Once a drv is claimed, the builder does a local
nix-env -r
on it.
Structured build phases and logs are published to the pubsub mechanism.
The DB is updated appropriately.
For cluster management, the coordinator can trigger the creation of new VMs
if it notices that pending builds are greater than the number of available
build slots (VM * paralellism). The builders themselves shut down if they
haven’t gotten a job for a certain time interval.
That’s the basic structure. As it is, I think it mostly solves all the
problems I mentioned. Since this is a polling approach, it naturally behaves
well with cluster changes. However, it also requires an efficient poll which
will get worse the more builders we have and the more we want to reduce poll
latency. I think it’ll be fine since builders will be in the hundreds but not
thousands, but if it’s not fine then we can experiment with refinements or
a more push-oriented style where the coordinator assigns buiders.
There are some refinements, like:
-
Abort is supported via cancelling jobs that haven’t started, and setting
a abortRequest flag on ones that are ongoing. -
I use heartbeats on both builders and individual builds to detect if someone
hard-crashed. -
I record build time, and memory use (and in the future, CPUs used over
time). I can then use this to do some fancier scheduling, since I’ll know
the weight of each job too. So I don’t have to hardcode a parallelism and
NIX_BUILD_CORES, but can allocate how much it seems to need, and enforce that
with cgroups. On the same note, we can have heterogenous builders, and
allocate based on required RAM. -
Build clustering is good because it’s less overhead in the DB, and nix can
quickly sequence the builds with purely local information, but of course if
siblings get clustered then we lose parallelism, and a long sequence is
clustered then we lose scheduling granularity. So there’s a tradeoff in
there. I don’t have anything sophisticated implemented yet, but in theory could
do something clever based on the graph structure and node weights (which we
know due to history). We already have known-cheap builds marked with
preferLocalBuild, those are considered 0 cost if you already have the
dependencies, so those are probably faster rebuild locally than to even check
the cache. -
Since we have metrics for upload and download times, there’s also a lot of
room for cleverness in builder affinity. Actually I guess this would fit into
the same graph problem as build clustering. We would want to minimize build
time, taking into account predicted download time. -
We possibly no longer need
nix-daemon
on builders, since we are taking
over nix’s job of allocating and running jobs. The advantage is that now
the nix version can also be part of the build request, so we can test new
nix versions without a whole new build cluster.
Since I’m doing my own scheduling, I don’t need requiredSystemFeatures
to be
built-in to nix. That actually brings up another point, which is that it
would be nice make nix a bit less monolithic, and I think this work does that.
Things that used to have to be built directly into core nix can now be
implemented separately.
As for communication from nix language to the outside, we already use the
__json
parameter passing technique exclusively, and have standard attributes
that we parse out of the drv and can use to affect scheduling or builds. For
instance we can easily support per-drv timeouts by extracting it from the
drv
itself, where supporting that in “standard” nix would require another
special derivation field, and nix remote build to understand it, and to
propagate that timeout option, hopefully for both ssh://
and ssh-ng://
transports (currently neither do it). Just as a general point for the
evolution of the nix system, is there interest in moving in a more modular
direction?
Actually that in turn brings up another thing, which is that it’s unfortunate
that the entire drv is hashed unconditionally, because there’s no place to put
metadata… changing timeout or tags shouldn’t cause a rebuild! The
workaround I considered was to use nix-instantiate in such a way to return
pairs of (drv, metadata), and associate the metadata on the outside.
So, currently this is still an internal only thing which is still in
development. As an open-sourceable thing, I imagine it would be a purely
build backend, in that you just give it a list of drvs and it takes it from
there. I don’t really know if that’s practical to integrate into other
people’s workflows, because I don’t know what those are. There was
a suggestion that if it was able to speak the nix-daemon
protocol, it
could be a drop-in replacement for then nix-daemon, at which point it
would work for build-while-instantiate, and also have a convenient way to
integrate into pretty much any nix workflow. I know the nix-daemon
protocol does a bunch of stuff other than “build this list of drvs”, so
I’d have to look into what exactly that entails.
Since I pretty much need to do this anyway for our internal needs, I’m pretty
much just fishing around for interest and ideas. And of course if there is
interest, then I’m more motivated to do the extra work to export it from our
repo. It would be a haskell program that requires a DB connection and acts
kind of like nix-build
except it takes drvs. Or, acts like a nix-daemon, if
we wind up doing that.
From the other thread, I guess my equivalent to a build protocol like REAPI
or Syndicate is the schema of the DB, and the haskell module that writes and
reads it. I’m not sure how that stacks up, but at least for REAPI (which
I guess is the bazel build protocol), it looks like it all just turns into
the nix store protocol for me. A better nix store protocol would be great,
but out of scope for what I’m doing.
Thanks for reading this far!