I’m trying to use Nix as a “polyglot” build automation tool for data science. It’s mostly exploration, but it could be nice if I got it to work!
Essentially, I want some Python code to generate some data, R code to then generate a plot, and finally convert a Markdown file to a PDF using Quarto. Here is my default.nix
:
let
pkgs = import (fetchTarball "https://github.com/NixOS/nixpkgs/archive/27285241da3bb285155d549a11192e9fdc3a0d04.tar.gz") {};
tex = (pkgs.texlive.combine {
inherit (pkgs.texlive) scheme-small;
});
# Derivation to generate the CSV file using Python
generateCsv = pkgs.stdenv.mkDerivation {
name = "generate-csv";
src = ./.;
buildInputs = with pkgs.python312Packages; [ scikit-learn pandas ];
buildPhase = ''
python -c "
from sklearn.datasets import load_iris
import pandas as pd
iris = load_iris(as_frame=True)
df = iris['frame']
df.to_csv('iris.csv', index=False)
"
'';
installPhase = ''
mkdir -p $out
cp iris.csv $out/
'';
};
# Derivation to generate the plot from the CSV using R
generatePlot = pkgs.stdenv.mkDerivation {
name = "generate-plot";
src = ./.;
buildInputs = with pkgs; [ R rPackages.ggplot2 rPackages.janitor ];
buildPhase = ''
Rscript -e "
library(ggplot2)
library(janitor)
iris <- read.csv('${generateCsv}/iris.csv') |>
clean_names() |>
transform(species = as.character(target))
p <- ggplot(iris, aes(x = sepal_length_cm, y = sepal_width_cm, color = species)) +
geom_point(size = 3) +
labs(title = 'Sepal Length vs Sepal Width',
x = 'Sepal Length (cm)',
y = 'Sepal Width (cm)') +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5))
ggsave('plot.png', plot = p, width = 6, height = 4, dpi = 300)
"
'';
installPhase = ''
mkdir -p $out
cp plot.png $out/
'';
};
# Derivation to generate the HTML report from Markdown
in
pkgs.stdenv.mkDerivation {
name = "generate-report";
src = ./.;
buildInputs = [ pkgs.quarto tex ];
buildPhase = ''
cp ${generatePlot}/plot.png .
# Deno needs to add stuff to $HOME/.cache
# so we give it a home to do this
mkdir home
export HOME=$PWD/home
quarto render report.Qmd --to pdf
'';
installPhase = ''
mkdir -p $out
cp report.pdf $out/
'';
}
and here is the report.Qmd
:
---
title: "My document"
format:
html:
toc: true
embed_resources: true
---
# Data Report
This report shows a scatter plot of the generated random data.
![Scatter Plot](plot.png)
This works, as I do get a nice PDF. However, each time I call nix-build
the whole thing runs again, and produce another PDF in another store path, even though inputs are deterministic.
What am I missing? I also tried putting the previous derivations generateCsv
and generatePlot
as buildInputs to the document, to no avail.
@waffle8946 The tip in that link is not relevant, since it’s the same source directory in use.
That said, @brodriguesco pointing src
at ./.
is often a cause of issues. In this case, I wager it’s because of the result
symlink. When you first build, ./.
contains no result
symlink. Then when you build again, it’s a difference source directory, because this one has a result
symlink pointing at the last result. And this results in a new build with a new result
symlink. Since the result symlink is different now, the next build with have a different source again. Ad infinitum.
The fix is to filter the source. nixpkgs
has a lib function called cleanSource
that will filter out stuff like this and your .git
directory, so you can do src = pkgs.lib.cleanSource ./.;
. Or, you can use src = builtins.fetchGit ./.;
to only get those files tracked by git (meaning you’ll have to git add
any new files before building); providing no rev
argument means it’ll get the dirty worktree. This is actually what flakes do, so using flakes is another option.
5 Likes
Oops I didn’t read my own link, I thought it mentioned filterSource and friends. Must’ve been thinking of Working with local files — nix.dev documentation
2 Likes
From the Nix code, it looks like you don’t actually need any src
for your derivations. In that case you can do this instead:
stdenv.mkDerivation {
# ...
dontUnpack = true;
}
2 Likes
Thank you all, that was indeed the issues @ElvishJerricco ! Because result
was always changing, this was causing the whole derivation to get rebuilt. Using cleanSource
worked wonders! And indeed, as you point out @Infinisil , src
is actually only needed for the very last derivation, such that quarto
finds the input .Qmd
file.
And thanks @waffle8946 this looks like a nice website, I’ve added it to my bookmarks
For reference, here is what the expression looks like now:
let
pkgs = import (fetchTarball "https://github.com/NixOS/nixpkgs/archive/27285241da3bb285155d549a11192e9fdc3a0d04.tar.gz") {};
tex = (pkgs.texlive.combine {
inherit (pkgs.texlive) scheme-small;
});
# Derivation to generate the CSV file using Python
generateCsv = pkgs.stdenv.mkDerivation {
name = "generate-csv";
buildInputs = with pkgs.python312Packages; [ scikit-learn pandas ];
dontUnpack = true;
buildPhase = ''
python -c "
from sklearn.datasets import load_iris
import pandas as pd
iris = load_iris(as_frame=True)
df = iris['frame']
df.to_csv('iris.csv', index=False)
"
'';
installPhase = ''
mkdir -p $out
cp iris.csv $out/
'';
};
# Derivation to generate the plot from the CSV using R
generatePlot = pkgs.stdenv.mkDerivation {
name = "generate-plot";
buildInputs = with pkgs; [ R rPackages.ggplot2 rPackages.janitor ];
dontUnpack = true;
buildPhase = ''
Rscript -e "
library(ggplot2)
library(janitor)
iris <- read.csv('${generateCsv}/iris.csv') |>
clean_names() |>
transform(species = as.character(target))
p <- ggplot(iris, aes(x = sepal_length_cm, y = sepal_width_cm, color = species)) +
geom_point(size = 3) +
labs(title = 'Sepal Length vs Sepal Width',
x = 'Sepal Length',
y = 'Sepal Width') +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5))
ggsave('plot.png', plot = p, width = 6, height = 4, dpi = 300)
"
'';
installPhase = ''
mkdir -p $out
cp plot.png $out/
'';
};
# Derivation to generate the HTML report from Markdown
in
pkgs.stdenv.mkDerivation {
name = "generate-report";
buildInputs = [ pkgs.quarto tex ];
src = pkgs.lib.cleanSource ./.;
buildPhase = ''
cp ${generatePlot}/plot.png .
# Deno needs to add stuff to $HOME/.cache
# so we give it a home to do this
mkdir home
export HOME=$PWD/home
quarto render report.Qmd --to pdf
'';
installPhase = ''
mkdir -p $out
cp report.pdf $out/
'';
}
1 Like
https://nix.dev another official website fyi, though there’s a desire for moving it to docs.nixos.org
. @waffle8946 linked to a tutorial for lib.fileset
which you could also make use of here to only include exactly the file that you need, and not potentially many others:
pkgs.stdenv.mkDerivation {
name = "generate-report";
buildInputs = [ pkgs.quarto tex ];
src = pkgs.lib.fileset.toSource {
root = ./.
# Only include report.Qmd in the source
fileset = ./report.Qmd;
};
}
3 Likes
That’s really useful, thanks! I realize I did know about https://nix.dev, but only ever found the reference manual, not the tutorials (which also looks different). But I’ll definitely check out the tutorials!
3 Likes