Flakes and database management systems

I’ve been working on getting neo4j (a graph DBMS) working with flakes, although I believe the basic idea extends out to any DBMS. The main issue I’ve had is with where to put the DB, logs, and run. I am trying to find a solution that fits the principles of flakes, keeping the flake computer independent such that it should be possible to reproduce the DB on another computer by running the flake.

I’ve made some progress with my neo4j flake, which installs the community edition of neo4j and provides a wrapper to allow project specific plugin management and DBs. This requires supplying a path to the project’s DB, which means it’s dependent on the DB existing on the computer and therefore not a proper flake solution (worse yet, this generally requires hard-coding a user’s home directory into the path).

The ideal solution

I would like to have the DBs stored within /nix/store. In this way we could add a derivation maker for DBs that would describe how to generate the database in a reproducible manner (such as downloading data and importing it into neo4j). You would then write a flake for each DB and use that flake as another input for the neo4j wrapper. This would allow multiple flakes to use the same DB and an individual flake could have a normal DB and a separate test DB that would be provided to checkInputs (and, if needed, it would be easy to switch out the normal DB for the test DB to interactively play around on a smaller db).

Side note on storage

This raises the issue of storage, generally data is going to be much larger than software so having multiple copies in /nix/store is potentially unacceptable. The data, however, should be largely static, once it has been properly set up nix should not have a reason to regenerate it unless it’s been specifically modified so I don’t believe this will actually be a problem and shouldn’t require any more rebuilds than would be required outside of using nix management. Though it would potentially warrant a method for aggressively garbage collecting old DBs.

Problem

Nix store is read-only to normal users, this makes it unsuitable for storing logs and placing run. Additionally, neo4j will not run if the DB is read-only. (Though I’m leaning towards preferring a read-only DB as running any commands that modifies the DB makes it non-reproducible so I’m hoping there’s a solution on the neo4j side that allows me to use a RO DB.) One solution would be to run as su but for obvious reasons I don’t think neo4j should be run with root privileges as it really shouldn’t need to be given that much power.

Is there a way to get systemctl to handle this while keeping everything contained in a flake? Are there any ideas from other DBMSs for where we could put logs, run, and the DB such that a normal user would be able to access it? Could I make a directory under /var that a normal user would own and then put everything there and have a the DB derivations be symlinks to the DBs in /var/neo4j/db or something like that?

3 Likes

A mix of thoughts and questions (none of which really solve your issue, some tangential…):

  1. Are you aspiring to a read-only database for reasons intrinsic to your app/service/use-case, or is this just an outgrowth of trying to square it with Nix?

  2. A lot of the difficulties persistent servers pose for the Nix model are generally wrangled by modules. I’m not sure if this opinion will be unpopular, but I’ve wondered if modules are mis-abstracted, and if most of the work they do (i.e., generating some config) would be better supported by functions directly attached to the passthru of the relevant packages. I think this might make it easier to build better levers and experiment more?

  3. I had a half-baked thought a while back… roughly whether we could handle the persistent-storage question with mountable disk images/volumes/etc. With a persistent absolute path you have to worry about all of the sad problems of statefulness, like whether different versions of the database server will share and corrupt the data store. Images could be UUID-identified so that an unintentional clash is less likely.

    We could use an expression for selecting the image with instructions for creating it deterministically if there’s no UUID specified. If the UUID is specified but the image is missing I suppose it should fail, though it might be nice to have bolt-on fetchers for pulling the image from elsewhere? Hash check could be optional if it makes sense for the use-case. Seems like there might be a way to combine this with #2 to have functions for applying ~version-compat migrations to the image?

  1. Overnight I was thinking my use-case probably isn’t the norm where you may have new customer (or whatever) data being added to the DB regularly. Instead, I generally have complete data coming in that I then convert to neo4j. So the same data should always produce the same (minus some randomly generated IDs) DB. My desire to have a read-only database stems from the same reason any analyst should move from working directly from a REPL to writing everything down as a script.

And in fact, for my current project, I received the DB with the note that the guy who produced it did so interactively and didn’t keep track of how he created it so I ended up having to start from scratch when we needed to make some changes. Similarly, when I had a younger guy working with me, even after specifically telling him not to modify the DB itself, the first thing he does is run a bunch of write commands to generate some intermediate values (good thing I can easily rebuild the DB :slight_smile: ), so enforcing read-only could be good but I don’t think neo4j will allow it (especially since I’m pretty certain it caches data and creates indices).

  1. Modules have been the suggested solution to past questions I’ve asked on the subject. I don’t think they’re a good solution to this. I have far from a great understanding of modules but we shouldn’t be doing this at the OS level. For one, these DBs should be local to the project they’re in not globally available. Second they should require as little outside work to get running as possible so that you can simply run the flake to get the environment up and running without having to also set up the module. Three modules require the user to be running NixOS making it harder to share.

  2. For versions of the DBMS/server, if we could get this so the DBs were described in flakes, then if neo4j releases version 5.0.0 and I update my neo4j flake to it, and then update a project dependent on neo4j, it should complain that the DB is built off version 4 and is not compatible. At this point I can update the neo4j in the DB flake and it will be rebuilt off of version 5 and everything should work. This would also allow us to test out 5 outside of the production server before updating the real thing. Then you don’t have to worry about different versions of neo4j trying to open the same DB, you have a separate one for each version.

As far as the hash check goes, if you’re talking about hashing the entire DB, I don’t believe that will work as (at least for neo4j) the importer is non-deterministic (it runs in parallel so the order of entries will change and the generated IDs will differ) and even if we’re trying to determine if we’re loading the same already built DB I still think some caches and metadata get added to the DB by neo4j as it’s used. But a nix expression describing how to import the data should produce a hash to check against.

Another thing I was thinking about is maybe creating a neo4j user group and having the DBs installed to /var/neo4j/dbs (prepended with their nix store hash) which the neo4j group can write to. Then a user would just have to be added to the neo4j group. Though I’m not positive it’s possible (if a group doesn’t have write access to /var can it have write access to /var/neo4j?). Plus this would require a bit of external work to setup the group and /var/neo4j/dbs but at least it would only have to be done once. This also has the issue of not being garbage collected by nix.

Acutally, if putting DBs in /var/neo4j/dbs would work, I could put together a general neo4j module (or maybe even a general DB module passed DBMSs) which would create the directories/gids to do the set up part. This way someone not using NixOS could still use the flake they would just have to manually set up their environment while NixOS user could have a explicit set up.

I tired using /var/neo4j/${hash}-${name} to place DB specific paths then making /var/neo4j writable for the normal user (i.e. me). Then I would create symlinks back to the DB in /nix/store/${hash}-mkDatabase/share/neo4j/{data,run,logs} (where neo4j expects the DB dependent paths). This has the advantage of not needing to patch the paths in the neo4j conf.

This ended up failing because the nixbld user doesn’t know about anything outside /nix/store so it gives a could not find path error when trying to make or move things under /var.

This leads to a theoretical solution of changing the DB location in the neo4j conf to /var/neo4j/${hash}-${name} then copying the DB symlinked to the current neo4j environment to the /var/neo4j location in the neo4j wrapper. But this requires making a full copy of the DB which is far from ideal. I’m hoping there’s still some way to take advantage of makeWrapper to get this working. Seems pretty equivalent to a package searching for configuration files outside of /nix/store, which nix is okay with, so should be possible without breaking nix philosophy.

For reference, here’s the current attempt for the DB builder:

{ symlinkJoin, makeWrapper, neo4j, jre, unzip }:

{ src ? "", pname, version, importPhase ? "", preImport ? "", postImport ? ""
, meta ? { } }:

symlinkJoin rec {
  inherit src pname version meta;

  name = "${pname}-${version}";
  preferLocalBuild = true;
  allowSubstitutes = false;
  paths = [ neo4j ];
  nativeBuildInputs = [ neo4j makeWrapper unzip ];
  hash = builtins.head (builtins.match "/nix/store/(.*)-mkDatabase"
    (builtins.path { path = ./.; }));

  # By default, when `importPhase` is empty, do not generate a
  # db. This is equivalent to a fresh install of neo4j.
  inherit importPhase preImport postImport;

  # symlinkJoin only accepts postBuild so hack into this to make it an
  # entry point for everything I need to do.
  postBuild = ''
    runHook configPhase
    runHook importPhase
    runHook linkPhase
  '';

  configPhase = ''
    rm -r "$out"/share/neo4j/{data,run,logs}
    rm "$out"/bin/neo4j-admin

    makeWrapper "$out"/share/neo4j/bin/neo4j-admin \
        "$out"/bin/neo4j-admin \
        --set JAVA_HOME "${jre}" \
        --set NEO4J_HOME "$out"/share/neo4j

    conf=$out/share/neo4j/conf/neo4j.conf
    origconf=$(readlink $conf)
    rm $conf
    cp $origconf $conf

     PATH="$out"/bin:$PATH
  '';

  linkPhase = ''
    varDir=/nix/var/neo4j/${hash}-name
    mkdir $varDir

    for d in data run logs
    do
        mv "$out"/share/neo4j/$d $varDir
        ln -s $varDir/$d "$out"/share/neo4j/
    done
  '';
}

This fails in the link phase since nixbld can’t make the var directory.

Then an example graph would be packaged with:

{ mkDatabase, fetchurl, lib, unzip }:

mkDatabase {
  pname = "zkc";
  version = "0.0.1";

  src = fetchurl {
    url = "https://aaronclauset.github.io/data/zkcc-77.zip";
    sha256 = "0g4wgarlp0hva4lmcdl8dc5c0hqxrdhxxsjv01m064izzjqrgxzx";
  };

  preImport = ''
    ${unzip}/bin/unzip $src

    mkdir -p import
    echo -e "personId:ID\tgroup" > import/nodes.tsv
    cat zkcc-77/karate_groups.txt >> import/nodes.tsv

    echo -e ":START_ID\t:END_ID" > import/edges.tsv
    cat zkcc-77/karate_edges_77.txt >> import/edges.tsv
  '';

  importPhase = ''
    runHook preImport

    neo4j-admin import --delimiter="\t" \
                       --nodes=member=import/nodes.tsv \
                       --relationships=interactions=import/edges.tsv

    runHook postImport
  '';

  meta = with lib; {
    description = "Zachary karate club graph";
    homepage = "http://www.konect.cc/networks/ucidata-zachary/";
  };
}