nix-scheduler-hook is very easy to use. After building it, a binary named `nsh` is available. Simply set `build-hook` to the path to this binary, and create a `nsh.conf` file in a Nix configuration directory containing your Slurm JWT token (`slurm-jwt-token = token`) and the API host of the endpoint running slurmrestd (`slurm-api-host = host`). Full usage details are available in the README.
Because it depends on a job scheduler for scheduling builds, the value of the Nix `builders` setting / your `machines` file is ignored when using it as your build hook. This means the nodes in the cluster must be configured to be accessible by hostname in your SSH configuration file rather than just your machines file. I hope to eventually add a feature where builds are forwarded on to the regular build hook in decline cases (such as system mismatch) rather than simply declining.
It is currently in a very early stage of development, and thus there are a number of unresolved issues and missing features. For instance, `requiredSystemFeatures` is not yet supported, and the GC roots of the build results are not properly automatically cleaned up. However, it is in a basically usable state. Bug reports and enhancements are welcome!
This makes SLURM basically function like a custom remote builder, right? So it picks which node to run it on? (I assume it’s a single node and you can’t use the SLURM distributed features that dispatch the same job to multiple nodes to run as parallel processes?)
Yes, slurm decides what node to run each job on. In this case, I’m not sure what having multiple nodes per job would mean, since you normally can’t distribute a nix derivation build across multiple machines. In the case where there are multiple derivations to be built at the same time, multiple jobs will be submitted in parallel, like normal remote building.
I’ve been wanting to make something like this, but without assuming the slurm cluster has nix or a nix daemon available. For this i planned to use bwrap or apptainer/singularity.
My working idea is to somehow setup an environment with singularity/apptainer/bwrao bindmounts and overlays such that each slurm job can do its build in a tmpfs local-overlay-store, then with a post-build-hook upload the result back to the underlying store in a way that is concurency-safe.
For this to be efficient it makes sense to invoke this tool as a remote --store, and have the custom remote store fetch all paths available in binary caches before queuing slurm jobs.
Since the remote --store shim will have access to the full build graph, it also becomes possible to queue the jobs with --dependency on each other, allowing the later jobs to increase in priority while waiting for long-running preceding builds.
What you’re proposing sounds like it would require a modification to how Nix dispatches builds, since normally one hook invocation = one leaf dependency build, so it doesn’t have access to the entire build graph. At that point you might instead want your own nix build entrypoint that can queue up the jobs with --dependency on each other. How would the dependent jobs get the newly built dependencies from jobs that just completed on different machines?
The bwrap idea sounds awesome, and would definitely make the tool more accessible. If you ever build it, feel free to build off of what I’ve already done and I’d be happy to integrate it.
with `–eval-store local –store ssh-ng://login1.hpc.uni.edu` the remote becomes responsible for planning the full build graph afaiu.
This is what the post-build-hook is for, it is to upload the resulting store paths back to the underlying nix store in the home directory of the user, which then becomes the underlying fs of the overlayfs in the next dependant slurm job. (i assume the user has a network-mounted shared folder between all the worker machines, i have not used slurm in any other setting)
Thanks for the reference - based on feedback I got from the bionix people, I’ve added the ability to specify additional job submission parameters at the derivation level. The README has an example of how to do this.