NixOS with shared nix store among compute nodes

rodarima · September 20, 2023, 11:02am

We managed to install NixOS on the compute nodes of a small HPC cluster and now we face the problem of how to run parallel jobs via SLURM in multiple nodes reading the same dependencies from the nix store.

As far as I know there are only two options:

The nix store is local per node and the clousures are distributed to the nodes before the execution begins. This seems to be the choice that @markuskowa was following.
Share the nix store among nodes, so they can directly access the paths.

Until now, we were following the local approach, but it is cumbersome and I would like to switch to a shared store to avoid copying the paths.

For that, I have experimented with a read-only NFS nix store exported from the head node and mounted in the computed nodes. The compute nodes use an overlay FS (not related with nix overlays) to allow writes to the store too, backed by the local disk (similarly as the netboot config). We boot the nodes from the disk, and after the stage2 finishes, during the systemd load we mount the overlay over the /nix/store, so the nix-daemon sees already the nix store with the overlay. Configuring the mount in the proper order at boot is complicated, but we have the config here in case someone is interested. I also had to patch the nix-daemon.socket, as it was requiring /nix/store to run, and that creates a circular dependency (also is not needed, only the socket path).

This approach seems to work, but we have problems with the coherence between the compute nodes and the head node store. In particular, when a compue node reads a path that doesn’t exist from the overlayed nix store, and then that path is made available from the head node, the compute node still thinks that the path doesn’t exist. This seems to be as a cache problem, and I’m not sure how to solve it. The overlay FS seems to support this use case so I’m not sure what is wrong. From the NFS mount point in the compute node, the path is readable. The gc collector also will try to remove all the paths from other nodes that are not present in the gcroots. They will be marked as deleted and won’t be available to SLURM jobs, so this is also a problem.

Another approach that I’m experimenting with is keeping the whole nix store in a shared Ceph filesystem, which has nice POSIX properties. The idea is that the head node is the only one that can modify the store by the nix daemon, which is mounted in a shared FS. From the compute nodes, the nix store is mounted as read only (the whole /nix directory, not only /nix/store) and all the modifications are made through the nix daemon of the head node via the daemon socket exported with socat using TCP. Notice that --store ssh-ng://... doesn’t allow the nix shell to work, so this is a no-go.

Using ceph also provides some redundancy in the filesystem, so even if one of the storage nodes fails, we are still able to operate the whole cluster.

I did some experiments mounting the shared store manually over a loaded NixOS system from the disk, and it seems to work so far. Compute nodes can build derivations and use nix shell or develop. The gcroots are properly placed in the global store, and as long as they point to a place that is reachable both from the head and the compute nodes they seem to be respected.

But this approach has also some other problems, for example the /nix/var/nix/profiles/system profile now it is no longer unique for the host, but is shared across the entire cluster. I would like to keep the ability to rollback to another version of the system in a compute node, so I think I could add a prefix with the hostname and patch the generation of the grub entries, so it keeps a profile for each node system.

In any case, none of the options seem to be painless. I comment it here in case someone has played with a similar setup in NixOS nodes (not other distro + Nix) and can give some pointers.

Rodrigo.

wamserma · September 20, 2023, 2:22pm

There is [RFC 0152] `local-overlay` store by Ericson2314 · Pull Request #152 · NixOS/rfcs · GitHub and some related discussion at Super Colliding Nix Stores

ryantm · September 20, 2023, 4:21pm

It does sound like your use-case could benefit from the Local Overlay Store.

The Nix Local Overlay Store work that came out of Super Colliding Nix Stores is fully functional and supports many additional store features like garbage collection and store repair. The PR for it seems basically ready to go from my perspective. I hope Eelco or another Nix maintainer merges it soon.

The main caveat with it is that Linux overlay file systems don’t like their lower stores to change, so there could potentially be undefined behavior if you use a NFS mount as a lower store. However, we have not tested this much and might only be a theoretical rather than a practical one.

If you have any questions about getting going with it, please let me know! It would be good to have more people playing with it.

augu5te · September 21, 2023, 6:34am

A variant to share nix store is to have only one remote store and nix-daemon accessible from nodes through socat tunnel consequently you don’t need of the overlay. For more technical detail see:
https://gricad.github.io/calcul/nix/hpc/2017/05/15/nix-on-hpc-platforms.html

Note 1: Guix can to it w/o socat tunnel
Note 2: We are developing a little dedicated tool to replace socat tunnel to add log + tls + auth (for audit and tracability)

rodarima · September 21, 2023, 8:15am

I took a quick look at the local overlay store, and while it solves some of our problems, the main one is still there: changes in the lower directory are not immediately propagated to the merged one.

The main caveat with it is that Linux overlay file systems don’t like their lower stores to change, so there could potentially be undefined behavior if you use a NFS mount as a lower store. However, we have not tested this much and might only be a theoretical rather than a practical one.

I tested this situation with the following script and I can reproduce the problem on every try. After accessing a path in the overlay directory that is not in the lower or upper directories, after it appears in either of them, it continues to be missing in the merged directory:

#!/bin/sh

# This test creates new files in the lower and upper directories of an overlay
# to see if the changes are propagated to the merged directory.

set -x

rm -rf lower upper work merged
mkdir -p lower upper work merged

sudo mount -t overlay overlay \
  -o "lowerdir=$PWD/lower" \
  -o "upperdir=$PWD/upper" \
  -o "workdir=$PWD/work" \
  "$PWD/merged"

ls merged/measured-lower
ls merged/measured-upper

touch upper/measured-lower
touch upper/measured-upper
touch upper/untouched

ls merged/measured-lower
ls merged/measured-upper
ls merged/untouched

sudo umount "$PWD/merged"

Here is the output:

hut% ./repro.sh
+ rm -rf lower upper work merged
+ mkdir -p lower upper work merged
+ sudo mount -t overlay overlay -o lowerdir=/tmp/lower -o upperdir=/tmp/upper -o workdir=/tmp/work /tmp/merged
+ ls merged/measured-lower
ls: cannot access 'merged/measured-lower': No such file or directory
+ ls merged/measured-upper
ls: cannot access 'merged/measured-upper': No such file or directory
+ touch upper/measured-lower
+ touch upper/measured-upper
+ touch upper/untouched
+ ls merged/measured-lower
ls: cannot access 'merged/measured-lower': No such file or directory
+ ls merged/measured-upper
ls: cannot access 'merged/measured-upper': No such file or directory
+ ls merged/untouched
merged/untouched
+ sudo umount /tmp/merged

So I don’t think we can rely on the overlay FS if we are modifying the lower directory over NFS and expect to see the changes in the merged directory of the compute nodes.

A variant to share nix store is to have only one remote store and nix-daemon accessible from nodes through socat tunnel consequently you don’t need of the overlay.

We have been using this approach in machines without NixOS. The problem is when the node itself also requires the nix store to boot as is our case.

Yesterday I was playing with an intermediate solution where we create a private mount in the slurm service via systemd where we replace the local nix store of the node by the one exported via NFS in read-only mode. So the jobs see the same store as the head node, but any other commands issued via ssh or operating through the nix daemon on that node only work with the local store.

As we don’t use the overlay FS, changes in the head node are immediately propagated to the store seen by the jobs under SLURM. It has some other problems, as the /run/current-system/sw/bin binaries being missing, so jobs have to provide a complete $PATH, but those can be solved. Changes in the node configuration are still managed as of now, using nixos-rebuild and we keep the nodes booting from the disk.

wamserma · September 21, 2023, 9:36am

I also remember some discussion/PR on supporting multiple nix stores, which should solve the issues with overlay-filesystems.