Nix for tracking scientific simulation configurations

TLDR

Can Nix’ input-addressing be used to help keep track of the parameters/metadata associated with scientific datasets?

Details

Scientific studies often involve generation, transformation and combination of many datasets which depend on a multitude of parameters.

A simple, traditional … and completely intractable method of keeping track of such parameters is to encode them in absurdly long directory and file names with ill- (or, more commonly, completely un-) specified schema.

What makes things worse, is that new parameters are continuously being discovered. Take the hypothetical dataset, with catchy filename

l2-h3-w5-wavelength589-NaCl-magiccoefficient12-shoesize43-foo3-bar4

(and that’s an overly simplified example) that we generated sometime last year. Now we want to investigate how the phase of the moon affects the results, so whatever the moon’s phase was on the day those data were taken, has just become an implicit parameter which should be encoded in that filename, but isn’t.

But let’s leave this complication for another day. Even with a fixed set of parameters, the problem gets completely out of hand.

In general, datasets are nodes in a directed graph, maybe something like this

A <- B <- C <- D
^         ^
|          \
F <- G <--- E
^    ^
|    |\
H    I J

where the parameters pertaining to A are determined by the parameters of the leaf datasets D, E, H, I and J and the configuration parameters of the processes represented by the edges leading from these to A.

In order to generate a version of A with some set of parameters that I’m interested in, I have to configure and run the processes B->A and F->A with matching parameters, which requires finding or generating versions of F and B with matching parameters, and so on recursively all the way down to the leaves. There is an exponential explosion of files which is extremely difficult to manage.

Right now, instead of writing this post, I should be generating half a dozen versions of A, but every time I’m faced with such a task, I am paralysed by the terror of getting some of these parameters wrong and having to restart a large tree of very costly computations and track down and eliminate incorrectly-labelled datasets that may have been produced in the process, while not accidentally eliminating very similarly-named but valid data.

The question

A fundamental part of the challenge of scientific (meta)data generation and tracking, is the management of directed graphs with data at their nodes and processes at their edges. This is pretty much what Nix does.

Can Nix be used to make this task more tractable? Without scaring the scientists too much?

1 Like

Yes, Nix and the Nix language are perfectly suitable for that kind of task. Essentially you have to define the computations you care about in terms of derivations (which, for all intents and purposes are thunks, that is, functions with their arguments baked in but not evaluated yet). Depending on your particular problem, you can control the granularity of caching by breaking the computations down into meaningful parts. In any case, the Nix store does the caching for you, transparently.

And finally, as with any programming problem, the art of not scaring other people is defining sane interfaces and implementing them without trying to be clever. The Nix language has very few moving parts, and is easy to learn given a basic programming background. The perceived complexity of the ecosystem is entirely built on top of that and can be avoided by careful design of your project.

In order to generate a version of A with some set of parameters that I’m interested in, I have to configure and run the processes B->A and F->A with matching parameters, which requires finding or generating versions of F and B with matching parameters, and so on recursively all the way down to the leaves. There is an exponential explosion of files which is extremely difficult to manage.

Maybe I’m misunderstanding, but isn’t this something easily solved with a plain old makefile? I would encode the graph in terms of targets (simulations, transformed datasets) and prerequisites (other datasets) then run make A.

l2-h3-w5-wavelength589-NaCl-magiccoefficient12-shoesize43-foo3-bar4

To avoid this sort of thing I normally use unique ids and a database to keep all the metadata. It doesn’t have to be complicated, a json file can do if it’s not a huge library of data.

Basically Nix does the same as a plain old makefile in that regard, but avoids lots of trouble due to content hashing, immutability, and a substantially more expressive host language.

Basically Nix does the same as a plain old makefile in that regard

If you mean that it’s possible to use Nix to do build something like this,
sure, but it doesn’t do any of that out of the box, unlike make.

It would require writing some sort of DSL for expressing the dependency graph
and a framework for turning that into a set of derivations that Nix would then
manage: basically a stdenv but for scientific data.

A while ago I wrote a blog post about using Nix for managing parameter sweeps in scientific settings: Data Science with Nix: Parameter Sweeps. Maybe there’s something you find useful in it.

1 Like

A and all other nodes in the graph come in an effectively unlimited number of varieties, one corresponding to each possible combination of parameters. So your make A hides a lot of implicit detail: it has to be something logically equivalent to make A(foo=3 bar=4 baz=666 zot=42 etc. etc. etc. etc.) with each parameter being propagated through the graph to anywhere it is relevant. Crucially, building A should not replace any old version of A or any other node that was built with a different set of parameters.

Makefiles just don’t seem like a convenient way of expressing this, but then I always found makefiles annoying, even for their intended purpose, and avoid them like the plague, so maybe my prejudice and relative lack of familiarity is preventing me from seeing it.

Nice.

Parameter sweeps are absolutely not feasible in my case: the datasets are much too large and the computations take much too long. Instead points in the parameter space have to be selected judiciously depending on observations of existing results. And then it’s important that intermediate nodes with corresponding parameters be found and reused if they exist. Including allowing a human to find and inspect any existing intermediate dataset that corresponds to a desired parameter set.

There’s an added complication in that some of the datasets are so large that they must remain on the compute server, while some of the work nearer the tip of the graph can be done on individual local machines.

I was also wondering whether something could be built on top of IPFS, with some warning that you are about to download 3TB, so you might prefer to take this particular computation to the data, rather than bringing the data to the computation.

Note this was not at all clear from your opening post.

During evaluation such inspection could be done and decided what dataset to build. You could create derivations that contain separate metadata if that is worth it. The build is then off loaded to nix. The evaluation of this part you probably can’t realistically express in Nix so you would need to use import-from-derivation (which I don’t really like).

The requiredFeatures feature is unfortunately not very powerful. You could think of features such as diskSpace1TB but in practice this may not work well since it won’t check for actual free disk space.

Judging from your requirements I would not use Nix for this part.

Crucially, building A should not replace any old version of A or any other node that was built with a different set of parameters.

Ok, no, this is not something make is meant to do. You could but you would have to encode all the parameters in the file name, which is what you’re trying to avoid.

And then it’s important that intermediate nodes with corresponding parameters be found and reused if they exist.

This comes for free if you turn each node into a derivation.

Including allowing a human to find and inspect any existing intermediate dataset that corresponds to a desired parameter set.

Using Nix this would probably require inspecting /nix/store or automatically adding gc roots for each node ever computed, which is not great but could be done using an external script.

some of the datasets are so large that they must remain on the compute server

Nix needs to copy everything needed to build the final result into the
/nix/store for purity. So, this would require to either:

  • store all the datasets in /nix/store;
  • do an impure build to allow reading outside the build sandbox;
  • integrating IPFS or similar in Nix to allow reading the datasets in a deterministic way.
2 Likes

overlayfs mounts might be a solution as well

There are two qualitatively distinct levels of difficulty here. If the datasets and compute times are sufficiently small for everything to be done on a researcher’s machine, then the problem is relatively easy. A decent mostly-Nix-based solution can probably be found.

But throw in huge datasets and very expensive computations which must stay on the server …

[ aside: One of my leaf nodes occupies over a terabyte (per configuration), which can be read rapidly in parallel off the server’s SSD and processed in parallel on dozens of processors, reducing it to tens of GB at the next node in the graph, and then to dozens of megabytes at the next. Those near-the-leaf computations simply have to be done on the server. As we progress up the graph, it becomes not only more feasible but also more important to work on the researcher’s machine, to allow comfortable and rapid interactive exploration of parameter space. ]

… and the problem gets much more complicated. Automating everything, including the decision where any computation should be performed, is perhaps overly ambitious. I’ve had IPFS in the back of my mind for this purpose for over a year now (thanks for pointing out overlayfs), but it seems it would only make sense if you can always bring the data to the computation, and my situation requires bringing the computation to the data some of the time.

This needs to be broken down into smaller nuts to crack. Perhaps I should approach the problem from a different angle: making it easier for the human to track/find existing datasets corresponding to desired configurations. The huge problem here is that the schema is constantly evolving; as is the code, which must be continuously updated to add the ability to explore new ideas (parameters). The commit hashes of the codes used to produce any result, are, effectively, parameters in the overall configuration (absolutely standard in Nix), BUT adding a feature which doesn’t affect some other part of the code, shouldn’t disqualify datasets that were produced with an older version! It would be unreasonably wasteful to ignore a TB of data that took a CPU-month to produce, just because a comment was added to the code. While this sort of thing would be great for verification and reproducibility of a final result, it would be a disaster during the exploratory phase!

Dealing with these problems by hand on a daily basis is hugely expensive, but I don’t have the wherewithal [time, key competences, good ideas] to make much progress on improving the situation.

Note that this can be avoided using content-addressed (CA) derivations: it’s still an experimental feature but should work well already.

My comment example is almost a caricature, but there are also more subtle cases where content addressing cannot help: imagine that frobnicate() is completely independent from and orthogonal to transmogrify() but both get compiled into the same binary. Refactoring transmogrify() or fixing a bug in it, should not invalidate any results that depend on frobnicate() but not transmogrify(); content addressing will tell us that the frobnicate() results need to be recalculated.

A higher-level summary is that strict reproducibility requirements and very expensive computations produce opposing pressures: the former pushes us towards recomputing absolutely everything; the latter requires reusing results as much as possible. I doubt that there be a simple solution to the problem of finding a decent compromise between the two.

I think that can be addressed systematically, at least in principle. One example where this works is GHC, which since 9.4 can do incremental builds at the granularity of a module. It doesn’t have to rebuild dependent modules even if their dependencies changed, as long as the interface remains the same.

That is, one needs a proper way to express interfaces, and at the level of abstraction Nix operates on - files and processes - that probably needs some sort of gradual typing that is optional for dependents to consume. Then, builds that use interface declarations (whatever they may look like) can hash against the given dependency’s interface and can cut some corners, and those that don’t have to hash against the dependency’s contents and are left to be re-run on any dependency change.