Haunted nix build breaks isolation

I’m trying to build a package from a modified nixpkgs folder. Here are the modified contents of the package I’m building:

{ lib, stdenv, fetchzip, makeWrapper, jre, python3Packages, coreutils, hadoop_3_1, pkg-config, curl, autoPatchelfHook, cacert
, RSupport? true, R
}:

with lib;

stdenv.mkDerivation rec {

  pname = "spark";
  version = "3.1.2";

  src = fetchzip {
    url    = "mirror://apache/spark/${pname}-${version}/${pname}-${version}.tgz";
    sha256 = "0npld46bz20ixwaqsg65j6xwlb8nar1dhz2yfd0z5vzhy10qcq11";
  };

  nativeBuildInputs = [ makeWrapper pkg-config autoPatchelfHook ];
  buildInputs = [ jre python3Packages.python python3Packages.numpy curl ]
    ++ optional RSupport R;

  buildPhase = ''
    seq 3 | while read line; do
      patchShebangs .
      autoPatchelf .
      export SSL_CERT_FILE=${cacert}/etc/ssl/certs/ca-bundle.crt
      export HOME=$out
      HOME=$out ./build/mvn -DskipTests clean package -Phive -Dhadoop.version=3.0.0-cdh6.3.1 -Dhive.version=2.1.1-cdh6.3.1 -Dlibthrift.version=0.9.3-1 -Dmaven.repo.local=$out/.m2 || true
   done
  '';

  outputHashAlgo = "sha256";
  outputHashMode = "recursive";
  outputHash = "0npld46bz20ixwaqsg65j6xwlb8nar1dhz2yfd0z5vzhy10qcqqq";

  #untarDir = "${pname}-${version}-bin-without-hadoop";
  #installPhase = ''
    #mkdir -p $out/{lib/${untarDir}/conf,bin,/share/java}
    #mv * $out/lib/${untarDir}

    #sed -e 's/INFO, console/WARN, console/' < \
    #   $out/lib/${untarDir}/conf/log4j.properties.template > \
    #   $out/lib/${untarDir}/conf/log4j.properties

    #cat > $out/lib/${untarDir}/conf/spark-env.sh <<- EOF
    #export JAVA_HOME="${jre}"
    #export SPARK_HOME="$out/lib/${untarDir}"
    #export SPARK_DIST_CLASSPATH=$(${hadoop_3_1}/bin/hadoop classpath)
    #export PYSPARK_PYTHON="${python3Packages.python}/bin/${python3Packages.python.executable}"
    #export PYTHONPATH="\$PYTHONPATH:$PYTHONPATH"
    #${optionalString RSupport
    #  ''export SPARKR_R_SHELL="${R}/bin/R"
    #    export PATH=$PATH:"${R}/bin/R"''}
    #EOF

    #for n in $(find $out/lib/${untarDir}/bin -type f ! -name "*.*"); do
    #  makeWrapper "$n" "$out/bin/$(basename $n)"
    #  substituteInPlace "$n" --replace dirname ${coreutils.out}/bin/dirname
    #done
    #ln -s $out/lib/${untarDir}/lib/spark-assembly-*.jar $out/share/java
  #'';

  meta = {
    description      = "Apache Spark is a fast and general engine for large-scale data processing";
    homepage         = "http://spark.apache.org";
    license          = lib.licenses.asl20;
    platforms        = lib.platforms.all;
    maintainers      = with maintainers; [ thoughtpolice offline kamilchm ];
    repositories.git = "git://git.apache.org/spark.git";
  };
}

I’m building it by running:

nix-shell -I nixpkgs=$(pwd) -p spark

If I understood correctly, nix builds should not be aware of the existence of my user, as it happens in an isolated environment. But the build error logs show my username.
Even if I run this as root after purging all env vars and pther references to my user, mvn is somehow able to pick up my non-root username.
here’s the full build log:
https://asciinema.org/a/4QrAgofnrBCFCQmHW4YFgdpP5

How is this possible? Am I missing something obvious?

Did you do a single-user installation?

This is on NixOS, so it should be a multi-user installation. Builds are running as nixbldx users.

Why did you specify the outputhash? Is it maybe that fixed output derivations don’t run sandboxed? (They normally need network)

Maven builds in nix are very jank. Maven packages in nixpkgs do the build in two stages. The first is a fixed output derivation that fetches maven dependencies, and the second does an offline build using these dependencies. See https://github.com/NixOS/nixpkgs/blob/4afb7c8e2c631f5c191aaf586673430b72e4b15e/pkgs/applications/networking/cluster/hadoop/default.nix for an example of what this looks like. Fixed output derivations might not be running sandboxed, but it should still be running as the nixbld user. The error it throws is

[ERROR] Failed to execute goal net.alchim31.maven:scala-maven-plugin:4.3.0:compile (scala-compile-first) on project spark-tags_2.12: wrap: java.io.IOException:
 Could not create directory /home/illustris/.sbt/1.0/zinc/org.scala-sbt: java.nio.file.AccessDeniedException: /home -> [Help 1]

even when I’m running nix-build it as root.

No, it should not run as nixbld but as the invoking user. That is crucial in case your build needs things like SSH keys. You would not want to expose these to all nixbld users.

Also did you read the Maven section in Nixpkgs 23.11 manual | Nix & NixOS?

Even if it was running as the invoking user, the invoking user is root, not illustris. But as you can see in the screenshot above, it runs as nixbld, not the invoking user. Please try building a fixed output derivation on your system to confirm, not sure if this is just me. The error log shouldn’t be complaining about /home/illustris/.sbt either way.

Yes, I have read the maven section. I can’t use mvn2nix as mentioned in 15.16.1.1. buildMaven with NixOS/mvn2nix-maven-plugin for reasons outside the scope of this thread. Using a fixed output derivation is in accordance with 15.16.1.2. Double Invocation. This is how packages like dbeaver, hadoop and exhibitor fetch their dependencies.

That’s what I get:

[ERROR] Failed to execute goal net.alchim31.maven:scala-maven-plugin:4.3.0:compile (scala-compile-first) on project spark-tags_2.12: wrap: java.io.IOException: Could not create directory /var/empty/.sbt/1.0/zinc/org.scala-sbt: java.nio.file.FileSystemException: /var/empty/.sbt: Operation not permitted -> [Help 1]

/var/empty is the home of the nixbld* users.

What does your /etc/password look like?

By the way, if you want to work around the issue, one way is to point the HOME environment variable to something that exists like $NIX_BUILD_TOP

I tried setting HOME to export HOME=$out and export HOME=$(pwd).

Java appears to be doing something weird for detecting your home dir, according to ubuntu - Java: System.getProperty("user.home") returns "?" - Stack Overflow
The logs showed Reading user settings from /home/illustris/.m2/settings.xml. I set -Duser.home=$(pwd), after which the log line changed to Reading user settings from /build/source/.m2/settings.xml.

This gets weirder the more I look into it…

Running the build as a fixed output derivation definitely has some weird behavior. I added the following lines to buildPhase:

id
touch idtest

For a normal derivation, id shows uid=1000(nixbld) gid=100(nixbld) groups=100(nixbld) and the idtest file is owned by nixbld1: -rw-r--r-- 1 nixbld1 nixbld 0 Jul 2 12:28 nix-build-spark-3.1.2.drv-13/source/idtest. /etc/passwd contains

root:x:0:0:Nix build user:/build:/noshell
nixbld:x:1000:100:Nix build user:/build:/noshell
nobody:x:65534:65534:Nobody:/:/noshell

For a fixed output derivation,
/etc/passwd shows

root:x:0:0:Nix build user:/build:/noshell
nixbld:x:1000:100:Nix build user:/build:/noshell
nobody:x:65534:65534:Nobody:/:/noshell

id shows uid=1000(illustris) gid=100(users) groups=100(users)
the test file created is still owned by nixbld1: -rw-r--r-- 1 nixbld1 nixbld 0 Jul 2 12:36 nix-build-spark-3.1.2.drv-15/source/idtest

Despite /etc/passwd containing

root:x:0:0:Nix build user:/build:/noshell
nixbld:x:1000:100:Nix build user:/build:/noshell
nobody:x:65534:65534:Nobody:/:/noshell

in the fixed output derivation, and despite it running inside a namespace with UID mappings for 1000 to nixbld1, getent 1000 inside the build env returns
illustris:x:1000:100::/home/illustris:/run/current-system/sw/bin/bash. This is confusing Java. Is this a bug or is it expected behavior for nix build? It seems to have no purpose. Contrary to what @hmenke suggested, ssh keys don’t seem to be passed automatically from the invoking user to the fixed output build environment, as the build is still running in an isolated mount namespace.

There are multiple issues intertwined here:

Fixed-output-derivations aren’t sandboxed, so that they can talk to the network. There is no sandbox-but-with-network-access capability in Nix.

The Java tooling which depends on having a home folder, and bypasses the env vars that every other sane tool respects.

For some reason, your system has nixbld with UID 1000, which is the same as your user. In most setups, the nixbld users get a UID in the 30k+ range.

This is certainly an issue, and there are java-ish ways of dealing with the issues java creates, such as passing -Duser.home=. But in this thread, the strange inconsistency in nix’s build environment is far more interesting to explore.

Not exactly…

[illustris@illustris-thinkpad:~]$ id nixbld1
uid=30001(nixbld1) gid=30000(nixbld) groups=30000(nixbld)
[illustris@illustris-thinkpad:~]$ id
uid=1000(illustris) gid=100(users) groups=100(users),1(wheel),17(audio),57(networkmanager),131(docker),302(kvm),998(adbusers)

From the system’s perspective, nixbld users have UIDs starting at 30k+
When you run a program in a user namespace the way nix seems to do, a process running inside this namespace might think its user’s PID is 1000, while that PID is actually mapped to another PID outside. Even on your system, if you were to run the id command inside a nix build, you will find that the uid reported from inside will be 1000. If you run ps aux and observe the process from outside the namespace, you will correctly see that the process reporting its UID as 1000 actually has a UID of 30k+.

Even fixed output derivations are built using the nixbldx users. You can experimentally verify that there is some level of sandboxing taking place

These are the outputs of some commands from inside a fixed output derivation’s buildPhase:

++ getent passwd
root:x:0:0:Nix build user:/build:/noshell
nixbld:x:1000:100:Nix build user:/build:/noshell
nobody:x:65534:65534:Nobody:/:/noshell
++ getent passwd 1000
illustris:x:1000:100::/home/illustris:/run/current-system/sw/bin/bash
++ /nix/store/kxgj8zz555z8wdf0jgbbkn2rrlwr80vf-procps-3.3.16/bin/ps aux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
illustr+       1  4.2  0.0   4604  2956 ?        Ss   14:40   0:00 bash -e /nix/
illustr+      19 82.3  0.0   6684  3696 ?        S    14:40   0:02 bash -e /nix/
illustr+     560  0.0  0.0   6192  1528 ?        R    14:40   0:00 /nix/store/kx
++ getent passwd 1000
illustris:x:1000:100::/home/illustris:/run/current-system/sw/bin/bash
++ getent passwd 30001
nixbld1:x:30001:30000:Nix build user 1:/var/empty:/run/current-system/sw/bin/nologin
++ sleep 1000

getent passwd appears to show the right mapping any process inside the build env should see, 1000->nixbld. Yet getent gets the passwd entry from outside the build env for UID 1000. ps aux also does the same for finding UID->name mappings inside the build env.
Now let’s see that processes inside the build env look like from the outside:

[illustris@illustris-thinkpad:/tmp]$ ps aux | grep sleep
nixbld1  1867724  0.0  0.0   4260   324 ?        S    20:04   0:00 sleep 1000

and here are the UID/GID maps for that process:

[illustris@illustris-thinkpad:/tmp]$ cat /proc/1867724/uid_map
      1000      30001          1
[illustris@illustris-thinkpad:/tmp]$ cat /proc/1867724/gid_map
       100      30000          1

As you can see, UID 1000 inside the build env maps to UID 30001 (nixbld1) on my system. ps aux and all other ways of looking at the PID of the process reports the correct UID according the UID map.

The issue here seems to be that getent is reporting mappings from outside the build env when running a fixed output derivation. This doesn’t seem intentional, given the contents of /etc/passwd inside the fixed-output-derivation shows UID 1000 → username nixbld. The build environment is itself inconsistent about what the UID to username mappings are. getent passwd and /etc/passwd show different UID to username mappings.

Nice analysis. This indeed seems like a consistency bug with Nix and something that could be reported upstream.

I’ve managed to reproduce this with a minimal example and to me it looks like this is an issue with the libc.

with import <nixpkgs> {};

stdenv.mkDerivation {
  name = "impure-test";

  src = builtins.toFile "id.c" ''
    #include <pwd.h>
    #include <stdio.h>
    #include <sys/types.h>
    #include <unistd.h>
    int main() {
        uid_t euid = geteuid();
        struct passwd *pwd = getpwuid(euid);
        printf("uid=%d(%s)\n", euid, pwd->pw_name);
    }
  '';

  outputHashAlgo = "sha256";
  outputHashMode = "recursive";
  outputHash = lib.fakeSha256;
  
  phases = [ "installPhase" ];

  installPhase = ''
    gcc -Wall -Wextra -Wpedantic -o $out $src
    $out
  '';
}

Don’t pay attention to the hash mismatch, that is on purpose so that it is rebuilt every time.

$ nix-build default.nix 
this derivation will be built:
  /nix/store/ykxrsljrrgi522k1gg9cip0fi48nw92h-impure-test.drv
building '/nix/store/ykxrsljrrgi522k1gg9cip0fi48nw92h-impure-test.drv'...
installing
uid=1000(henri)
error: hash mismatch in fixed-output derivation '/nix/store/ykxrsljrrgi522k1gg9cip0fi48nw92h-impure-test.drv':
         specified: sha256-AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=
            got:    sha256-bjFWsNyoEqGCVr7RwXCFi0nzGMIlUuVhoy+cqxa6kX0=

I have solved the problem. The offender here is nscd , the name service cache daemon. Its socket is bind mounted into the sandbox when building a fixed output derivation. This is probably because nscd handles DNS requests and since the build has access to the network this functionality is desirable.

However, when we look at the manpage of nscd we find

Nscd provides caching for accesses of the passwd (5) , group (5) , and hosts (5) databases through standard libc interfaces, such as getpwnam (3) , getpwuid (3) , getgrnam (3) , getgrgid (3) , gethostbyname (3) , and others.

So the problem is that inside the sandbox the libc function getpwuid queries the nscd socket to resolve the uid to a user name but because this is bind mounted to the outside it will resolve the normal user. We can easily verify this:

with import <nixpkgs> {};

stdenv.mkDerivation {
  name = "impure-test";

  outputHashAlgo = "sha256";
  outputHashMode = "recursive";
  outputHash = lib.fakeSha256;
  
  phases = [ "installPhase" ];

  installPhase = ''
    id | tee $out
  '';
}
$ nix-build default.nix 
this derivation will be built:
  /nix/store/r5fcdj9ml8ihkw6mzrcp8bwcj1i5mb70-impure-test.drv
building '/nix/store/r5fcdj9ml8ihkw6mzrcp8bwcj1i5mb70-impure-test.drv'...
installing
uid=1000(henri) gid=100(users) groups=100(users)
error: hash mismatch in fixed-output derivation '/nix/store/r5fcdj9ml8ihkw6mzrcp8bwcj1i5mb70-impure-test.drv':
         specified: sha256-AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=
            got:    sha256-pBb0hEWC1O+U9wJxq4t9z938FQllLlyYQAJqyikNLk4=
$ sudo systemctl stop nscd.service
$ nix-build default.nix 
this derivation will be built:
  /nix/store/r5fcdj9ml8ihkw6mzrcp8bwcj1i5mb70-impure-test.drv
building '/nix/store/r5fcdj9ml8ihkw6mzrcp8bwcj1i5mb70-impure-test.drv'...
installing
uid=1000(nixbld) gid=100(nixbld) groups=100(nixbld)
error: hash mismatch in fixed-output derivation '/nix/store/r5fcdj9ml8ihkw6mzrcp8bwcj1i5mb70-impure-test.drv':
         specified: sha256-AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=
            got:    sha256-JDNqG4zEVVglXODhsZPKFQ/uYg3BVzda49ka9UN/32I=
2 Likes

I’ve also posted an issue:

https://github.com/NixOS/nix/issues/4991

5 Likes

Kudos for digging into the issue and fixing the root cause! :+1:

1 Like