Developing a system that replaces nix remote build

At my workplace, we use nix pretty extensively, and also as a “proper”
per-file build system, with a large pool of builders (~80 VMs currently) and
a global cache backed by Google Storage. This has caused us to run into
various limitations of nix remote build support, so I’ve been considering if
there’s a way to fix them all in one go, by effectively entirely replacing
nix’s remote build feature.

The recent post

was interesting because it shows that other people are thinking of the same
thing, though presumably with different conditions in mind. Still, it made me
think I should put this out to the community, both to see if there are better
ideas out there, and also to see if there’s interest in possibly open-sourcing
this work.

The current situation is that I have a working proof of concept, though it’s
not being used at scale yet. I’ve done some tests for a few hundred
simultaneous builds, but it hasn’t withstood the full fury of the CI pipeline
and merge queue. I’m currently deprioritized from working on it, but will
hopefully be able to get back to it soonish. From the tests, it seems to be
good at efficiently using resources. The one obvious bottleneck so far is
that uploading all the drvs to the global cache is very slow, but it might be
because the storage backend (Google Storage) is not suitable for many small
files.

Anyway, for motivation, I’ll list out some of the problems we’ve had with the
existing nix remote build, much of this should be familiar but I guess I can
repeat so you can see where I’m coming from. Also it’s more or less a list of
what this new system fixes. It’s skewed towards our particular use case,
which is more on the “big build farm” side of things. Perhaps in the nix
world, hydra is the closest analogue.

“director” is the machine that initiated the build, while “builder” is the
set of machines that will do the actual work:

  • Nix downloads lots of stuff unnecessarily, because it doesn’t assume you
    have a global cache. So the director will download all deps, then the
    builder will download again (either from the cache or have the director
    send them), then the director will stream them back, only to copy them to
    some other builder (see scheduling problems).

  • The nix scheduler doesn’t understand about affinity, it’ll just go more or
    less round robin on builders, leading to more copying of data.

  • Nix scheduling is local, not global. So settings like max-jobs are only max
    per nix process, not global max. This means you can’t actually control how
    much runs on a builder, which is pretty serious if there actual are hardware
    limitations. Lack of global knowledge means even with a global cache, builds
    can get duplicated if one starts before the other finishes uploading.

  • The nix scheduler has no notion of job weight, in CPU or memory, or size of
    builders (other than some ad-hoc and manual requiredFeatures stuff). We
    actually know that stuff from previous builds, and could use that for
    scheduling or even just “time left” estimates. Nix has some basic hardcoded
    load notion, which seems to be the number of build locks taken. It’s
    non-exensible and pretty much undocumented.

  • A non-extensible scheduler means we also can’t choose the build order. We
    may want to do clever things like prioritize things we think are more likely
    to fail.

  • Speaking of priorities, we also would have distinctions between interactive,
    essential, and background jobs.

  • The nix scheduler is static, so in other words you pick a set of builders at
    the beginning and they’ll be the same for the whole build. There’s no way to
    notice if a builder went down, or got overloaded, or whatever. This makes
    the prospect of draining and shutting down builders finicky, and on the other
    side, we can’t wait for new builders to start up.

  • Along with the above, nix has no notion of transient failures, and that it’s
    ok to retry them. Speaking of that, there’s no notion of an aborted build,
    whether due to ^C or dependencies having failed. Internally nix knows, but
    it’s hard to figure out from the outside. See the “structured logging” issues
    below.

  • Nix remote build is actually quite slow at starting up new jobs. Perhaps
    this is the need to establish a new ssh connection for each one, or just some
    internal locks and things.

  • Build log streaming is buggy, both in that a single line which is too long
    will break something inside nix and lose the line, and even worse logs tend to
    get truncated around 8k. Practically speaking this is really bad because the
    most important part of a build failure is usually at the end!

  • We wind up using nix’s structured log output to demultiplex job status.
    However, this is an internal undocumented format, and has too many details in
    some places (e.g. download progress) and not enough in others (e.g. have to
    parse text for the “will download, will build” sets). In addition either
    we’re using it wrong, or it’s buggy WRT job start and end times, we get
    hundreds of started builds which stack up until they all complete at once.

  • Nix is inconsistent about propagating build flags to builders. What’s
    worse, it’s differently inconsistent if you use ssh:// vs ssh-ng://. But
    some, like timeouts, is apparently never propagated.

  • There’s almost no logging on the builder side, pretty much just the
    nix-daemon saying it accepted a connection. Actually figuring out what is
    building where, or what PIDs correspond to what builds, requires a whole bunch
    of extra scaffolding and hackery.

  • In line with the above, nix can’t propagate metadata like a build ID to tie
    stuff together, because remote build only uses the drv, which can’t be changed
    without causing a rebuild.

  • No way to bracket builds with special actions, like logging or HW reset or
    something. We do have post-build-hook (again, not propagated by remote
    build though, so must be in local nix.conf), but pre-build-hook is not what
    it’s name sounds like.

Anyway, there are some of the issues. Here’s the basic idea of how this
new system works:

  • director does a nix-instantiate, and creates a list of drvs to build. This
    may involve downloading and building things, due to IFD and others.
    Unfortunately since this is all hardcoded into the nix interpreter, we pretty
    much have to keep using standard nix remote build for this, so we need to keep
    that functionality alive. However we already need to minimize this this due
    to well known problems (instantiate build is single threaded, no structured
    logs from instantiate, etc.). So this part is all the same.

  • Parse the drvs and explore them to create a DAG for building. This is what
    nix already does internally, and I get it out with nix-store -q --graph.

  • Do a mass query against the cache and remove the ones that don’t need
    building. If we are able to use ca-derivations, then nix verify or some
    other tool will need to be extended to work with ca-derivations.
    I’m not using ca-derivations yet though, due to some other problems with it.

  • So far this is recreating internal nix logic, but the next step is to drop
    the DAG onto a DB table as a bunch of individual jobs. Each job has links to
    its parents. I do some clustering to e.g. merge together linear sequences
    of drvs, also we have some internal annotation for known-cheap builds. Each
    job has timestamps for start, end, uploaded, as well as buildId and
    nixOptions, and a status enum (‘pending’, ‘running’, ‘succeed’, ‘fail’,
    ‘timeout’, ‘abort’, ‘unexpected’, ‘transient’), which will be filled out when
    appropriate. Since I don’t delete them when complete, this then becomes
    metrics in addition to current state. A unique buildId is assigned.

  • The coordinator also starts up a subscription to the buildId on
    a messaging service, to notifications, and streaming build logs if requested.
    I’m using GCP pubsub, but I understand there are many systems that provide
    this kind of feature. The notification is optional, since the coordinator can
    also poll the DB to get status, but it’s nice to get streaming updates. If
    there is a failed build, it can notice about that, and download and print the
    log.

  • Builders poll the work table, and atomically claim jobs that they want.
    They are a bit clever in that they know if dependencies were built on the same
    host they don’t need to wait for upload. In fact they can short-circuit the
    query entirely in that case, though they still have to claim the job.

  • Once a drv is claimed, the builder does a local nix-env -r on it.
    Structured build phases and logs are published to the pubsub mechanism.
    The DB is updated appropriately.

For cluster management, the coordinator can trigger the creation of new VMs
if it notices that pending builds are greater than the number of available
build slots (VM * paralellism). The builders themselves shut down if they
haven’t gotten a job for a certain time interval.

That’s the basic structure. As it is, I think it mostly solves all the
problems I mentioned. Since this is a polling approach, it naturally behaves
well with cluster changes. However, it also requires an efficient poll which
will get worse the more builders we have and the more we want to reduce poll
latency. I think it’ll be fine since builders will be in the hundreds but not
thousands, but if it’s not fine then we can experiment with refinements or
a more push-oriented style where the coordinator assigns buiders.

There are some refinements, like:

  • Abort is supported via cancelling jobs that haven’t started, and setting
    a abortRequest flag on ones that are ongoing.

  • I use heartbeats on both builders and individual builds to detect if someone
    hard-crashed.

  • I record build time, and memory use (and in the future, CPUs used over
    time). I can then use this to do some fancier scheduling, since I’ll know
    the weight of each job too. So I don’t have to hardcode a parallelism and
    NIX_BUILD_CORES, but can allocate how much it seems to need, and enforce that
    with cgroups. On the same note, we can have heterogenous builders, and
    allocate based on required RAM.

  • Build clustering is good because it’s less overhead in the DB, and nix can
    quickly sequence the builds with purely local information, but of course if
    siblings get clustered then we lose parallelism, and a long sequence is
    clustered then we lose scheduling granularity. So there’s a tradeoff in
    there. I don’t have anything sophisticated implemented yet, but in theory could
    do something clever based on the graph structure and node weights (which we
    know due to history). We already have known-cheap builds marked with
    preferLocalBuild, those are considered 0 cost if you already have the
    dependencies, so those are probably faster rebuild locally than to even check
    the cache.

  • Since we have metrics for upload and download times, there’s also a lot of
    room for cleverness in builder affinity. Actually I guess this would fit into
    the same graph problem as build clustering. We would want to minimize build
    time, taking into account predicted download time.

  • We possibly no longer need nix-daemon on builders, since we are taking
    over nix’s job of allocating and running jobs. The advantage is that now
    the nix version can also be part of the build request, so we can test new
    nix versions without a whole new build cluster.

Since I’m doing my own scheduling, I don’t need requiredSystemFeatures to be
built-in to nix. That actually brings up another point, which is that it
would be nice make nix a bit less monolithic, and I think this work does that.
Things that used to have to be built directly into core nix can now be
implemented separately.

As for communication from nix language to the outside, we already use the
__json parameter passing technique exclusively, and have standard attributes
that we parse out of the drv and can use to affect scheduling or builds. For
instance we can easily support per-drv timeouts by extracting it from the
drv itself, where supporting that in “standard” nix would require another
special derivation field, and nix remote build to understand it, and to
propagate that timeout option, hopefully for both ssh:// and ssh-ng://
transports (currently neither do it). Just as a general point for the
evolution of the nix system, is there interest in moving in a more modular
direction?

Actually that in turn brings up another thing, which is that it’s unfortunate
that the entire drv is hashed unconditionally, because there’s no place to put
metadata… changing timeout or tags shouldn’t cause a rebuild! The
workaround I considered was to use nix-instantiate in such a way to return
pairs of (drv, metadata), and associate the metadata on the outside.

So, currently this is still an internal only thing which is still in
development. As an open-sourceable thing, I imagine it would be a purely
build backend, in that you just give it a list of drvs and it takes it from
there. I don’t really know if that’s practical to integrate into other
people’s workflows, because I don’t know what those are. There was
a suggestion that if it was able to speak the nix-daemon protocol, it
could be a drop-in replacement for then nix-daemon, at which point it
would work for build-while-instantiate, and also have a convenient way to
integrate into pretty much any nix workflow. I know the nix-daemon
protocol does a bunch of stuff other than “build this list of drvs”, so
I’d have to look into what exactly that entails.

Since I pretty much need to do this anyway for our internal needs, I’m pretty
much just fishing around for interest and ideas. And of course if there is
interest, then I’m more motivated to do the extra work to export it from our
repo. It would be a haskell program that requires a DB connection and acts
kind of like nix-build except it takes drvs. Or, acts like a nix-daemon, if
we wind up doing that.

From the other thread, I guess my equivalent to a build protocol like REAPI
or Syndicate is the schema of the DB, and the haskell module that writes and
reads it. I’m not sure how that stacks up, but at least for REAPI (which
I guess is the bazel build protocol), it looks like it all just turns into
the nix store protocol for me. A better nix store protocol would be great,
but out of scope for what I’m doing.

Thanks for reading this far!

26 Likes

Some related topics:

3 Likes

This is a tangental issue, but I’d also like to see structured logging. I know of some places in Nixpkgs where build and tests commands are run serially rather than in parallel so the build log stays readable. I don’t have to suggest we make our own init system for build jobs because I think we’ve already done that with stdenv :wink:

Nix actually does have structured logging, we use it to demultiplex. It’s inefficient though, because it’s so noisy. That’s why I’m moving away from it. Of course another option would be to improve it, which would also be nice to not have to do ad-hoc parsing for things like the download list and sizes.

If by “init system” you’re talking about pre and post build hooks, they are used for things which don’t go in the sandbox, like uploading outputs, taking hardware locks. Also for things that run on failures, like log analysis. Otherwise I don’t know what that is…

1 Like

I didn’t know structured logging was available. I mention init systems because as soon as I thought about structured logging I thought that to do that we would end up with an init system for builders, and we basically already have that to run hooks and phases.

Oh, well structured logging is just a way to get the status info out of nix-store -r just like the usual stdout, except more details and in a machine parseable format (json). It doesn’t have anything to do with builds or builders or anything.

It’s also not really designed for external consumption, basically it’s just a raw dump of the internal format nix-build uses. It’s also kind of buggy (e.g. inaccurate times) but probably no one notices if nix-build doesn’t hit those bugs, or maybe it does but no one minds.

Anyway, it could be improved and I may do that at some point just to no longer have to parse things out of the human-readable output, but now that I no longer need it for demuxing build logs I see it as a pretty low order bit, at least until I discover something important that can’t be parsed out.

Interesting! We at our company also extensively use nix as our CI build infrastructure and have found remote builds, in particular, somewhat problematic in all the ways you’ve listed. I was working on something humbler - a drop-in replacement of the build-hook. It was just using postgres as a message-queue. That would have the advantage of having a more global view of machine availability and job scheduling. If you are curious, I can probably share my draft and compare notes.

1 Like

As we try to abstract out various portions of the nix code base and provide dependable APIs, the scheduler system is a clear candidate.

@thufschmitt has also recently worked on a minimal daemon implementation to allow for “nearly rootless” operation. This might have some overlap with your idea.

@jsoo1 that would be great, ideally if you could fit that in a GitHub issue the Nix team can track directly. We identified remote builds as a high value target the other day at NixCon. (FYI @Ericson2314)

Here it is. It’s not complete, but the pieces are there. I also haven’t licensed it yet but I was considering lgpl2.

Todo: broadcasting build logs, cancellation, copying outputs, cas.

Edit: licensed lgpl 2.1

1 Like

I’m not sure where I would open an issue yet, tbh. I am not sure how much more time I’ll be able to put into it, but I am happy to see my ideas out there.

Hey there, this is pretty interesting and also pretty similar to my thinking. I actually started by thinking I should add a scheduling hook, so nix would run a script which would tell it where to put things. Looking into that I discovered build-hook and thought I could just replace that. Then I looked into it, not very deeply, but decided against because it seemed like more work… I’d have to figure out that protocol for one, and I’d still be left with possible bugs in how nix is interacting with the subprocess. It just seemed risky to gamble on an undocumented protocol that presumably wasn’t intended to be reimplemented or extended (for example, from your readme, that its “stop job” protocol is SIGKILL seems awkward). That said, it has its attraction because it then integrates better, it works with eval, and you don’t have to reimplement the build graph. So in the end I’m not sure which way to go!

But maybe it could actually go both ways, my system takes drvs to build on the command line, but presumably could also get an adaptor to act like a remote build hook. It’s similar to the idea of making it talk the nix-daemon protocol, but I honestly don’t know enough about either of those protocols to know what exactly that means, and where such a thing should plug in. Do you have any notes on the protocol? I found the parser in enqueue but I run into the same problem as reading nix, which is that I’m not very good at reading modern c++.

Unfortunately both the c++ and SQL are above my level, so it’s hard for me to get a feel for the architecture.

Aside from that, I discovered that postgres can do async notifications, which mostly removes my use of pubsub, so I thinking I should port it to postgres. Then pubsub would only be used for streaming build logs, which are not even very needed. I think the only reason to want streaming logs is to find out where something got stuck, and for that all I need is a any centralized place that can receive data incrementally. Unfortunately that counts out the nix store, which I’m currently using. Maybe upload to cloud storage if it accepts incrementally. Even NFS could do this.

2 Likes

I referenced this in Deduplicate code between Hydra and Nix · Issue #1164 · NixOS/hydra · GitHub. Deduplicating the Nix and Hydra scheduling code should allow us to improve code quality and testing coverage. It should also “force into being” useful helper functions as part of the deduplication (for we would procede with baby step refactor rather than a big rewrite) that make the experimentation of the sort described here easier. Finally, @roberth has mentioned that it would be good to maintain bindings to Nix, which means that so that such experimentation using the generalized and broken up code doesn’t need to happen in C++.

1 Like

Thanks for this impressive run-down of the issues with remote building, as well as the work you’ve put into solving it on your end. I am running into many of these same things, even with a single remote builder, where I have one specific set of jobs (running CI on a crypto library configured to use a 16 MB precomputed lookup table) that causes make to use a ton of RAM. Then when sending 100 or so of those jobs to the remote the gccs use >1TB of RAM and frequently the nix-daemon gets OOM-killed and I have to manually restart it.

I would definitely be interested in contributing to an open-source solution to this. Reading through threads on this topic it seems that there’s a few competing ideas, and to properly assess them I’d need to understand the Actor model and some variants of it, have experience with other build systems, etc., none of which I have. I’d like if somebody just had an RFC that had some sort of consensus/enthusiasm around it, and some reasonably self-contained tasks that I could wrap my head around and jump into.

Does anyone have some thoughts about what direction I should look in?

From my experience, for issues with that level of depth the best one can do is actually sit down, compile the proposals and their trade-offs into an RFC, and see what else people have to say. Only once a obvious direction emerges one can break down smaller tasks. Such large problems always need someone to take responsibility and stick around until the end, otherwise nothing will happen.

Since this is a high value target, and I’ve discussed this with many people already, it would definitely make sense to make it a funded project. Lots of commercial users would benefit from an upstream solution that just works.

1 Like

Since I posted the first message, I’ve gotten the new system up and going. The nice parts are that it’s scaled up to at least 64 (VMs) * 8 (CPUs each) parallelism and with some notable exceptions is much faster and more efficient than nix remote build. It does global drv deduplication, can detect and retry transient failures, can scale the build cluster up and down dynamically, and various other things. It’s really nice to have a global view of all jobs, where they are in the build graph, their in-progress build log, etc. So far the limitations on scaling seem to be the efficiency of posgresql queries, and I’m no DB expert. I’ll need to get up to at least 128x8.

We’ve been using it at work for non-essential builds and will be running the main CI job in parallel next week. The main job is typically around 5k top level drvs, which comes to maybe 20k drvs including all the inner ones, and will take 3-5 hours for standard nix remote build, and maybe around 1 hour (over 64x8 VMs) with the new system. So that’s all the encouraging parts.

The known problems are that since I make my own build graph, I have to query the global cache to trim off already-built things, and using nix store verify is very slow (say 15-20m for 5k drvs). I also upload drvs to the global cache so individual builders can download them, and that’s even slower (say 1 hour for 5k). The setup overhead can easily exceed the build time. This slowness uploading may well be about our choice of backend (google storage behind a http proxy), and maybe I just need to find the http-connections setting and turn it up to 64. But it’s also badly affected by the “coroutine ended” bug, which makes it constantly crash and need to be restarted.

However, I’ve started reverse engineering the nix-daemon protocol and how nix’s native remote build works, and it looks like queryMissing is much faster than nix store verify (it seems to actually use the narinfo cache). Also, remote build uses buildDerivation, which takes a drv as serialized bytes, not a file existing in /nix/store. This is key, because building with nix-store -r /nix/store/thing.drv means that drv has to be present in the store, and for it to be present in the store, nix insists that all its transitive build dependencies are also present. I guess it doesn’t know about any global cache, and conservatively assumes that you are going to have to build the entire world, not just the one drv you downloaded. So, provided buildDerivation works as it seems to, I can avoid /nix/store for drvs entirely, and put them in a DB or basically anywhere that can efficiently store many small files.

So that’s the plan. I’ll be trying this out starting next week.

Regarding buildDerivation, it was a surprise to me to learn that this is how nix’s native remote build works. I’ll describe how I thought it would work, and if someone is around who knows the internals or the history, maybe they can explain why it wasn’t done that way, or wouldn’t work that way. I assumed that since .drv files can go in /nix/store, and nix-store -r expects one in there, that is the main intended way to build, and when remote build was implemented, it would have used that approach. For that to work, copying a .drv would not bring along its dependencies, but only lazily on demand if they happened to be missing. Instead, it seems like remote build implemented a new way to build things, where the drv is not in the store, drvs don’t have inputDrvs but only inputSrcs, and it is the job of the caller to resolve inputDrvs and stash them in inputSrcs. [ It actually resolves them transitively and then directly copies them to the remote builder, but I’m hoping that only the first level is needed and the rest will come naturally when the first level is pulled from the remote cache. ] So remote builder implemented a new kind of drv subtly different from on-disk ones, and plugged into the build at a lower level, and was then forced to re-implement dependency resolution.

Wouldn’t it have been simpler to allow drvs to be copied without their dependents? Then remote build would have been: nix copy --to remote /nix/store/thing.drv then tell remote to build thing.drv? Then all the resolution from nix-store -r is reused, it can pull from any substituter its configured with, which will likely contain the original requesting machine, which would result in the same peer-to-peer way except pulling from the coordinator instead of pushing. You may still wish to implement it as a single duplex protocol to avoid multiple ssh connection overhead, but fundamentally the actual remote build would then have a direct correspondence to CLI commands. You could implement a prototype with shell scripts and ssh. It wouldn’t require any not-quite-drvs and manual resolution and all the other complicated stuff the build-hook does.

Anyway, assuming the queryMissing and buildDerivation stuff works as I hope it will, my major caveats will be resolved, and I’ll see how far we can scale it internally. At that point, it’ll still be “merely” a backend that takes already-generated .drvs, and build-while-eval will still have to use the nix remote build protocol, since it’s hardcoded into the evaluator. But having gotten my feet wet with the nix protocol, I’ll be in a better position to then implement enough of the daemon protocol to point nix-instantiate at it, and then it should truly be a general purpose alternate build backend, and easier to plug into other nix workflows. That would be a good time to look into how to release the source. It’s not trivial because we have a monorepo and a global build system (the front end, to which this is the back end), so it would need translation to and from this internal structure and the Cabal setup used in the Haskell open source world. I still don’t know if anyone would actually use it, but at least it would be possible.

8 Likes

I (and garnix.io in general) am very interested in seeing any open-sourceable version of this. If there’s anything I can help with to make that happen, let me know!

2 Likes

Sounds like I should try to do the open source work as soon as it’s stable at scale for our internal CI, and not wait to try to implement the full daemon protocol to make it work for instantiate too. That part is the biggest handwave since I haven’t yet thought much about how that might work.

queryMissing seems to be working well though, with one detail so far, which is that it (intentionally) doesn’t work (claims a need to rebuild) for allowSubstitutes='' drvs. They are rare though, and can be worked around with a separate query.

To respond to @apoelstra , if you’re running out of memory, a better remote build system won’t solve that problem. Unless you mean that there are multiple uncoordinated nixes trying to remote build and ignoring the max-jobs parameter and overloading it. If that’s what’s going on, then there are workarounds. The one we are using, until this new system stabilizes, is that the builds all take a global semaphore on the building machine via filesystem lock. If you’re using sandbox=relaxed, you can make it visible via extra-sandbox-paths. Ugly and messes up metrics and timeouts, but works.

Alternately, a single nix process is capable of enforcing max-jobs. It’s actually at the OS level since it also uses filesystem locks. So you could somehow ship all build tasks to a single server and dispatch from there. Of course if you can do that, you may as well dispatch them to the building machine, and then you’re already on the way to inventing your own system to replace nix remote build! But if you just have one builder, that is a simpler task, you can just pull drvs off a queue.

2 Likes

@elaforge Nix actually has two different ways of performing remote builds. The “standard” way that handles all dependency resolution locally and then sends the content of the derivation using the buildDerivation operation.

Then it also has the “remote store” way where it performs evaluation locally and then uploads the actual .drv files to the remote builder and then asks it to build the target .drv using the buildPaths operation. The remote builder then has to perform all dependency resolution and fetching using its local store and substitutes, which sounds similar to your suggestion.

I’ve documented these two different ways of distributing builds here: Running Remote Builds - nixbuild.net documentation, maybe you’ll find it useful in your work. nixbuild.net itself implements both these approaches just like an ordinary Nix machine available over ssh, and then handles the distribution of the builds behind the scenes using its own implementation of builders.

3 Likes

Cool, this sounds like it will solve my problem! Yes, it is a hack, but it’s one that will localize nicely to the specific set of derivations that cause OOM. (And yes, the issue is that multiple are running in an uncoordinated way – individually they take a lot of memory but only for sane values of “a lot”. :))