We managed to install NixOS on the compute nodes of a small HPC cluster and now we face the problem of how to run parallel jobs via SLURM in multiple nodes reading the same dependencies from the nix store.
As far as I know there are only two options:
-
The nix store is local per node and the clousures are distributed to the nodes before the execution begins. This seems to be the choice that @markuskowa was following.
-
Share the nix store among nodes, so they can directly access the paths.
Until now, we were following the local approach, but it is cumbersome and I would like to switch to a shared store to avoid copying the paths.
For that, I have experimented with a read-only NFS nix store exported from the head node and mounted in the computed nodes. The compute nodes use an overlay FS (not related with nix overlays) to allow writes to the store too, backed by the local disk (similarly as the netboot config). We boot the nodes from the disk, and after the stage2 finishes, during the systemd load we mount the overlay over the /nix/store, so the nix-daemon sees already the nix store with the overlay. Configuring the mount in the proper order at boot is complicated, but we have the config here in case someone is interested. I also had to patch the nix-daemon.socket, as it was requiring /nix/store to run, and that creates a circular dependency (also is not needed, only the socket path).
This approach seems to work, but we have problems with the coherence between the compute nodes and the head node store. In particular, when a compue node reads a path that doesn’t exist from the overlayed nix store, and then that path is made available from the head node, the compute node still thinks that the path doesn’t exist. This seems to be as a cache problem, and I’m not sure how to solve it. The overlay FS seems to support this use case so I’m not sure what is wrong. From the NFS mount point in the compute node, the path is readable. The gc collector also will try to remove all the paths from other nodes that are not present in the gcroots. They will be marked as deleted and won’t be available to SLURM jobs, so this is also a problem.
Another approach that I’m experimenting with is keeping the whole nix store in a shared Ceph filesystem, which has nice POSIX properties. The idea is that the head node is the only one that can modify the store by the nix daemon, which is mounted in a shared FS. From the compute nodes, the nix store is mounted as read only (the whole /nix directory, not only /nix/store) and all the modifications are made through the nix daemon of the head node via the daemon socket exported with socat using TCP. Notice that --store ssh-ng://...
doesn’t allow the nix shell to work, so this is a no-go.
Using ceph also provides some redundancy in the filesystem, so even if one of the storage nodes fails, we are still able to operate the whole cluster.
I did some experiments mounting the shared store manually over a loaded NixOS system from the disk, and it seems to work so far. Compute nodes can build derivations and use nix shell or develop. The gcroots are properly placed in the global store, and as long as they point to a place that is reachable both from the head and the compute nodes they seem to be respected.
But this approach has also some other problems, for example the /nix/var/nix/profiles/system
profile now it is no longer unique for the host, but is shared across the entire cluster. I would like to keep the ability to rollback to another version of the system in a compute node, so I think I could add a prefix with the hostname and patch the generation of the grub entries, so it keeps a profile for each node system.
In any case, none of the options seem to be painless. I comment it here in case someone has played with a similar setup in NixOS nodes (not other distro + Nix) and can give some pointers.
Rodrigo.