Derivation gets always rebuilt

I’m trying to use Nix as a “polyglot” build automation tool for data science. It’s mostly exploration, but it could be nice if I got it to work!

Essentially, I want some Python code to generate some data, R code to then generate a plot, and finally convert a Markdown file to a PDF using Quarto. Here is my default.nix:

let
 pkgs = import (fetchTarball "https://github.com/NixOS/nixpkgs/archive/27285241da3bb285155d549a11192e9fdc3a0d04.tar.gz") {};

 tex = (pkgs.texlive.combine {
   inherit (pkgs.texlive) scheme-small;
 });

  # Derivation to generate the CSV file using Python
  generateCsv = pkgs.stdenv.mkDerivation {
    name = "generate-csv";
    src = ./.;
    buildInputs = with pkgs.python312Packages; [ scikit-learn pandas ];
    buildPhase = ''
      python -c "

from sklearn.datasets import load_iris
import pandas as pd

iris = load_iris(as_frame=True)

df = iris['frame']


df.to_csv('iris.csv', index=False)

"
    '';
    installPhase = ''
      mkdir -p $out
      cp iris.csv $out/
    '';
  };

  # Derivation to generate the plot from the CSV using R
  generatePlot = pkgs.stdenv.mkDerivation {
    name = "generate-plot";
    src = ./.;
    buildInputs = with pkgs; [ R rPackages.ggplot2 rPackages.janitor ];
    buildPhase = ''
      Rscript -e "

library(ggplot2)
library(janitor)

iris <- read.csv('${generateCsv}/iris.csv') |>
  clean_names() |>
  transform(species = as.character(target))

p <- ggplot(iris, aes(x = sepal_length_cm, y = sepal_width_cm, color = species)) +
    geom_point(size = 3) +                
    labs(title = 'Sepal Length vs Sepal Width',
         x = 'Sepal Length (cm)',           
         y = 'Sepal Width (cm)') +           
    theme_minimal() +                         
    theme(plot.title = element_text(hjust = 0.5)) 


ggsave('plot.png', plot = p, width = 6, height = 4, dpi = 300)

"
    '';
    installPhase = ''
      mkdir -p $out
      cp plot.png $out/
    '';
  };

  # Derivation to generate the HTML report from Markdown
in
  pkgs.stdenv.mkDerivation {
    name = "generate-report";
    src = ./.;
    buildInputs = [ pkgs.quarto tex ];

    buildPhase = ''

cp ${generatePlot}/plot.png .

# Deno needs to add stuff to $HOME/.cache
# so we give it a home to do this
mkdir home
export HOME=$PWD/home
quarto render report.Qmd --to pdf

    '';

    installPhase = ''
      mkdir -p $out
      cp report.pdf $out/
    '';
}

and here is the report.Qmd:

---
title: "My document"
format:
  html:
    toc: true
    embed_resources: true
---

# Data Report

This report shows a scatter plot of the generated random data.

![Scatter Plot](plot.png)

This works, as I do get a nice PDF. However, each time I call nix-build the whole thing runs again, and produce another PDF in another store path, even though inputs are deterministic.

What am I missing? I also tried putting the previous derivations generateCsv and generatePlot as buildInputs to the document, to no avail.

https://nix.dev/guides/best-practices.html#reproducible-source-paths

1 Like

@waffle8946 The tip in that link is not relevant, since it’s the same source directory in use.

That said, @brodriguesco pointing src at ./. is often a cause of issues. In this case, I wager it’s because of the result symlink. When you first build, ./. contains no result symlink. Then when you build again, it’s a difference source directory, because this one has a result symlink pointing at the last result. And this results in a new build with a new result symlink. Since the result symlink is different now, the next build with have a different source again. Ad infinitum.

The fix is to filter the source. nixpkgs has a lib function called cleanSource that will filter out stuff like this and your .git directory, so you can do src = pkgs.lib.cleanSource ./.;. Or, you can use src = builtins.fetchGit ./.; to only get those files tracked by git (meaning you’ll have to git add any new files before building); providing no rev argument means it’ll get the dirty worktree. This is actually what flakes do, so using flakes is another option.

5 Likes

Oops I didn’t read my own link, I thought it mentioned filterSource and friends. Must’ve been thinking of Working with local files — nix.dev documentation

2 Likes

From the Nix code, it looks like you don’t actually need any src for your derivations. In that case you can do this instead:

stdenv.mkDerivation {
  # ...
  dontUnpack = true;
}
2 Likes

Thank you all, that was indeed the issues @ElvishJerricco ! Because result was always changing, this was causing the whole derivation to get rebuilt. Using cleanSource worked wonders! And indeed, as you point out @Infinisil , src is actually only needed for the very last derivation, such that quarto finds the input .Qmd file.

And thanks @waffle8946 this looks like a nice website, I’ve added it to my bookmarks :slight_smile:

For reference, here is what the expression looks like now:

let
 pkgs = import (fetchTarball "https://github.com/NixOS/nixpkgs/archive/27285241da3bb285155d549a11192e9fdc3a0d04.tar.gz") {};

 tex = (pkgs.texlive.combine {
   inherit (pkgs.texlive) scheme-small;
 });

  # Derivation to generate the CSV file using Python
  generateCsv = pkgs.stdenv.mkDerivation {
    name = "generate-csv";
    buildInputs = with pkgs.python312Packages; [ scikit-learn pandas ];
    dontUnpack = true;
    buildPhase = ''
      python -c "

from sklearn.datasets import load_iris
import pandas as pd

iris = load_iris(as_frame=True)

df = iris['frame']


df.to_csv('iris.csv', index=False)

"
    '';
    installPhase = ''
      mkdir -p $out
      cp iris.csv $out/
    '';
  };

  # Derivation to generate the plot from the CSV using R
  generatePlot = pkgs.stdenv.mkDerivation {
    name = "generate-plot";
    buildInputs = with pkgs; [ R rPackages.ggplot2 rPackages.janitor ];
    dontUnpack = true;
    buildPhase = ''
      Rscript -e "

library(ggplot2)
library(janitor)

iris <- read.csv('${generateCsv}/iris.csv') |>
  clean_names() |>
  transform(species = as.character(target))

p <- ggplot(iris, aes(x = sepal_length_cm, y = sepal_width_cm, color = species)) +
    geom_point(size = 3) +                
    labs(title = 'Sepal Length vs Sepal Width',
         x = 'Sepal Length',           
         y = 'Sepal Width') +           
    theme_minimal() +                         
    theme(plot.title = element_text(hjust = 0.5)) 


ggsave('plot.png', plot = p, width = 6, height = 4, dpi = 300)

"
    '';
    installPhase = ''
      mkdir -p $out
      cp plot.png $out/
    '';
  };

  # Derivation to generate the HTML report from Markdown
in
  pkgs.stdenv.mkDerivation {
    name = "generate-report";
    buildInputs = [ pkgs.quarto tex ];
    src = pkgs.lib.cleanSource ./.;
    buildPhase = ''

cp ${generatePlot}/plot.png .

# Deno needs to add stuff to $HOME/.cache
# so we give it a home to do this
mkdir home
export HOME=$PWD/home
quarto render report.Qmd --to pdf

    '';

    installPhase = ''
      mkdir -p $out
      cp report.pdf $out/
    '';
 }

1 Like

https://nix.dev another official website fyi, though there’s a desire for moving it to docs.nixos.org :slight_smile: . @waffle8946 linked to a tutorial for lib.fileset which you could also make use of here to only include exactly the file that you need, and not potentially many others:

  pkgs.stdenv.mkDerivation {
    name = "generate-report";
    buildInputs = [ pkgs.quarto tex ];
    src = pkgs.lib.fileset.toSource {
      root = ./.
      # Only include report.Qmd in the source
      fileset = ./report.Qmd;
    };
  }
3 Likes

That’s really useful, thanks! I realize I did know about https://nix.dev, but only ever found the reference manual, not the tutorials (which also looks different). But I’ll definitely check out the tutorials!

3 Likes