I’m excited to announce a project I’ve been working on for the major part of 2019: nixbuild.net!
nixbuild.net is a highly scalable service for running Nix builds on a pay-per-use basis. It allows for flexible selection of compute resources, yet it is very easy to setup. To the end-user it works just like an ordinary SSH-based Nix remote builder.
This project grew out of years of frustration with the difficulties of setting up Nix build clusters, especially cost-effective clusters that can handle the usual pattern of very fluctuating load that a development team impose on build servers.
What nixbuild.net does is to act just like an ordinary remote Nix builder. It handles all the commands required, like uploading/downloading nar files, querying paths, building derivations etc. Internally it maintains its own store path database and store path file storage, segregated by user account. When it needs to run a nix build it allocates compute resources and then runs the nix builder, proxying results back to the user.
The nix builds run inside a KVM-based sandboxed that was implemented specially for this purpose. The nix builder is completely isolated from the world and can only access the input paths it needs, giving builds stronger sandboxing guarantees than the ordinary nix sandbox. The sandbox also opens up for unique possibilities like detailed build analysis and other nice things.
The nixbuild.net service is in a “closed alpha” phase. It is functional and all core parts are in place, but it needs more work on robustness, benchmarks, optimization and peripheral functionality. I announce it today hoping to get feedback and gauge interest that can guide further efforts. It is also possible to set up evaluation runs.
I’m happy to answer any questions about this project, either in this thread or at rickard@nixbuild.net!
Nice to hear you’re working on this! I’ve been wanting such service already for quite a while. Often I need a bit more capacity for a short while. While it’s possible to make a deployment with say NixOps that threshold is still relatively high, especially when compared to just adding a remote builder.
Happy to hear you’ve been thinking along the same lines! Nix has wonderful support for remote builds, I’ve just been annoyed with the missing parts, like job scheduling and auto scaling.
So far, I’ve built and tested this for x64-linux exclusively, but there is nothing in the design that stops nixbuild.net from taking the build architecture into account too. There’s already flexible support for selecting build machine types, CPU counts etc, so selecting based on architecture is not any extra work. I think most of the work is in provisioning the desired build machines in a cost-effective way. So in short, it depends on user demand.
It is configurable. The build can keep going, or be aborted. Right now, it is aborted by default. I need to think a bit more about the UX in this case. If the build keeps running, and the client then restarts the build, do we want to just wait for the build to finish, build in parallel (probably not), or maybe attach to the running build? This is also (somewhat) related to repeated builds (for determinism check). We can support most scenarios, but I don’t want to make the UX confusing.
Just to be clear, nixbuild.net is not using the nix-daemon protocol, but the nix-store --serve protocol (ssh:// builders, not ssh-ng://), since that is the documented way of remote nix builders. But I guess that protocol is also latency-sensitive. Any way, yes, the plan is to have SSH endpoints in a range of regions, to minimize end-user latency. How many and which regions depends on user demand, of course.
What do you do with the build results, are they cached anywhere in your architecture?
What happens if my build is very big (a big jobset with a lot of derivations), does the build get scheduled across many VMs or is currently everything routed to a single VM?
Is the build client a trusted-user ? This would be necessary if I want to use sandbox = relaxed or a different set of substituters.
If the build keeps running, and the client then restarts the build, do we want to just wait for the build to finish, build in parallel (probably not), or maybe attach to the running build?
This, and also keeping a build running in the situation:
A->X , B->X
nix-build A & nix-biuld B
B is waiting on X
A is cancelled
currently, X has to be restarted for B
would be also extremely useful as a tool deployable locally without need for a remote service…
Keep the questions coming They’re very valuable for me.
All build outputs (the nar files we get from the nix builder) are stored in a distributed filesystem. They are used when subsequent builds need them, to avoid having to upload nar files for every build. Naturally, the nar files should be garbage collected according to some policy (possibly user-defined, but I’m leaning more towards some global policy based on LRU), but that is not implemented yet. It is mostly a question about UX.
Firstly, the build have been split into individual derivations before it reaches us. The nix remote builder protocol does that. We will schedule each derivation individually, and every derivation that can be built in parallel, will be built in parallel (barring any account limitations or so). So if a build is big in terms of input derivations, we will simply scale out.
However, if a build is big in terms of CPU or memory requirements, the end user has the possibility to select how large builders that should be used (number of CPUs and amount of memory). This is implemented and works fine, but more thought are needed on how to map individual nix builds to resource requirements. Nix has no concept of that. It does have the concept of system features, and that could maybe be used for this (like: a user can map big-parallel jobs to 16 vCPUs, but run other jobs with 2 vCPUs).
The nix sandbox is not used at all inside the builders. In fact, the nix-daemon is not used at all. We have our own sandbox implementation. Each and every build gets its own synthesized nix database and synthesized nix store inside an isolated KVM sandbox, we run nix-store --serve as a user that have write access to that db and store. However, it can never change any of its input paths, only output paths.
This, and also keeping a build running in the situation:
A->X , B->X
nix-build A & nix-biuld B
B is waiting on X
A is cancelled
currently, X has to be restarted for B
Yes, that makes sense, thanks for pointing this out.
would be also extremely useful as a tool deployable locally without need for a remote service…
Yes, it would not be particular difficult to deploy nixbuild.net in a self-hosted fashion, and it would be very useful for companies that want to keep everything local, with lowest possible latency. It is not my focus right now, but anything is of course possible. I assume that people want to see this work in practice first, though.
I should also say that security and build isolation is a top concern. When I have got nixbuild.net up and running I have plans on working on encryption of nar files with user-provided keys.
I often do mass rebuilds for static-haskell-nix and for testing nixpkgs PRs with nix-review, and would appreciate being able to spawn a 500 CPU cluster with one --builders flag!