Nix-managed datalake?

FRidh · March 20, 2023, 6:13pm

At the company I work at we’re dealing with geographical and meteorological data as well as other sources of data. File sizes are anywhere from a few kB to 50+GB and the total size is about 10 TB and going up quite fast. Most of the data covered here does not change often but we start including more and more data.

I’ve been playing with the idea of having a declarative datalake. Typically we fetch some of these files, and transform them into a format/structure we can work with. Then there are some manually crafted files (e.g. csv), typically very small. State clearly becomes an issue here. We have of course backups in case of data loss, but there is still value in being able to rebuild your datalake from scratch in case you don’t know which backups you can trust. To me, this seems like an interesting case to use Nix. Basically a tree of symbolic links to the files/paths you want to keep.

Has anyone got any experience of using Nix for this purpose? What were the issues you run into? Large files I imagine can become a problem. The transform jobs we could use remote builders for with required features. A MountedSshStore seems very useful for this use case. Then there are the files that do change a bit more often that need to be fetched. But that can easily be resolved by using a cron or GH action that updates the hash using TOFU.

nixinator · March 21, 2023, 4:34am

sounds like a job for impure derivations… at a guess…

Interesting…

can you give a small example, of input → processing → expect output ???

maybe something that creates a 1mb file or less?

illustris · March 21, 2023, 5:02am

You’re probably thinking of nix because lt lets you define a DAG and computational edges easily. If so, defining your data pipeline in something like apache airflow might be better. You would still be able to use nix for any sub-component of your pipeline if you wanted to, and won’t be restricted to storing your data in a nix store.

tomberek · March 21, 2023, 6:24am

I once solved this problem using Nix and crafting special “ingress” and “egress” derivations that were impure. At that time my only recourse was __noChroot=true with a relaxed sandbox, but these days we have better mechanisms. This was for processing very large geospatial images and as well as videos. A few derivations mapAttr’d over the entire dataset, Hydra, and an autoscaling group of builders allowed for; fast iteration of different processing methods, solved the caching problem, easy scaling of workloads, local experimentation, access to any software I needed to do processing. I was able to completely replace and outperform Airflow with this approach.

I did run into a few problems with large files (happened to be a large ML model) where I created 50 sub-derivations to use Range requests to fetch it in chunks allowing for parallel fetching and decompression, then reconstituted it in a streaming fashion when loading it. But i believe this has improved since then (~2020)

We did use required features to help shard workloads effectively; better support for that subsystem would be helpful.

Overall it was a pleasant experience and extremely powerful, but I’d only recommend it to a Nix-expert. (I’m working on a project to make this easier, please let me know if this is of interest.)

peterhoeg · March 21, 2023, 7:03am

Without being able to comment on apache airflow, I would argue similarly that having nix manage your transformations could absolutely make sense, but leaving those 50GB files out of the nix store would be helpful.

You could have a string representation of the source data path stored.

FRidh · March 21, 2023, 7:46am

I’ve thought of using impure pipelines but actually I want the data in the store, and at that point using impure derivations results in too many redownloading/rebuilds which is way too expensive.

As an example, we have lots of datasets from Copernicus Climate Data Store | Copernicus Climate Data Store which are transformed from grib to zarr and chunked in such a way we can quickly access what we need and present it. Zarr is essentially a folder with many tiny files, which makes it great for cloud storage (though that can also get very expensive very quickly).

Yeah I thought of airflow too though I was leaning more towards prefect2.

What was large at the time?

For many cases we could go for smaller files, it just means keeping track of more FOD hashes, but that should be doable. Sometimes it is inevitable we have a big file though.

For accessing the data there is also https://github.com/obsidiansystems/nix/commit/overlayfs-store.

Hopefully next week I have some time to test with large files. I wonder how Nix performs with those nowadays.

nixinator · March 21, 2023, 8:17am

you out performed airflow… OOOH RAR!

I think you and @FRidh should get your heads together.

rickynils · March 21, 2023, 10:03am

Hopefully next week I have some time to test with large files. I wonder how Nix performs with those nowadays.

I would be really interested in hearing about the results of those test. If there is anything at all that you can publish in terms of test builds and datasets, that would also be very welcome. I’m interested in benchmarking nixbuild.net with these kinds of builds. The “lazy” way we handle build inputs should be beneficial to builds that use large input files (but there might of course be other bottlenecks). On the other hand, builds that produce huge output files are probably a bit more problematic, since we run all builds entirely in tmpfs.