TLDR
Can Nix’ input-addressing be used to help keep track of the parameters/metadata associated with scientific datasets?
Details
Scientific studies often involve generation, transformation and combination of many datasets which depend on a multitude of parameters.
A simple, traditional … and completely intractable method of keeping track of such parameters is to encode them in absurdly long directory and file names with ill- (or, more commonly, completely un-) specified schema.
What makes things worse, is that new parameters are continuously being discovered. Take the hypothetical dataset, with catchy filename
l2-h3-w5-wavelength589-NaCl-magiccoefficient12-shoesize43-foo3-bar4
(and that’s an overly simplified example) that we generated sometime last year. Now we want to investigate how the phase of the moon affects the results, so whatever the moon’s phase was on the day those data were taken, has just become an implicit parameter which should be encoded in that filename, but isn’t.
But let’s leave this complication for another day. Even with a fixed set of parameters, the problem gets completely out of hand.
In general, datasets are nodes in a directed graph, maybe something like this
A <- B <- C <- D
^ ^
| \
F <- G <--- E
^ ^
| |\
H I J
where the parameters pertaining to A
are determined by the parameters of the leaf datasets D
, E
, H
, I
and J
and the configuration parameters of the processes represented by the edges leading from these to A
.
In order to generate a version of A
with some set of parameters that I’m interested in, I have to configure and run the processes B->A
and F->A
with matching parameters, which requires finding or generating versions of F
and B
with matching parameters, and so on recursively all the way down to the leaves. There is an exponential explosion of files which is extremely difficult to manage.
Right now, instead of writing this post, I should be generating half a dozen versions of A
, but every time I’m faced with such a task, I am paralysed by the terror of getting some of these parameters wrong and having to restart a large tree of very costly computations and track down and eliminate incorrectly-labelled datasets that may have been produced in the process, while not accidentally eliminating very similarly-named but valid data.
The question
A fundamental part of the challenge of scientific (meta)data generation and tracking, is the management of directed graphs with data at their nodes and processes at their edges. This is pretty much what Nix does.
Can Nix be used to make this task more tractable? Without scaring the scientists too much?