Strange remote build issue

Hi All,

I’ve been seeing some rather strange behavior with my remote-builder wherein when building any package that requires the kvm feature, every open ssh connection to the remote builder is terminated.

Here’s is what I’ve been doing so far:

I use the NixOS 22.11 minimal ISO image to create a bootable USB, and boot up the system. Once booted, I run the script below, which installs NixOS to the local disk and reboots the system, when done.

#!/usr/bin/env bash
# Install NixOS on to a 2013 macmini host

set -o errexit
set -o nounset
set -o pipefail

# Need root privilege ot run
if [[ "$EUID" -gt 0 ]]; then
  err "Must run as root"
  exit 1
fi

disk="/dev/sda"

echo "Partitioning ${disk}..."
parted -s "${disk}" -- mklabel gpt
parted -s -a optimal "${disk}" -- mkpart ESP fat32 0% 512MiB
parted -s -a optimal "${disk}" -- mkpart primary 512MiB -0
parted "${disk}" -- set 1 boot on

echo "Waiting until the partitions are available in /dev..."
systemctl restart systemd-udev-trigger.service
until [[ -e "${disk}1" && -e "${disk}2" ]]; do sleep 1; done

echo "Creating filesystems on $disk..."
mkfs.fat -F 32 -n boot "${disk}1"
mkfs.ext4 -L nixos "${disk}2"

echo "Waiting until the filesystems are available in /dev..."
systemctl restart systemd-udev-trigger.service
until [[ -e /dev/disk/by-label/boot && -e /dev/disk/by-label/nixos ]]; do sleep 1; done

echo "Mounting filesystems..."
mount /dev/disk/by-label/nixos /mnt
mkdir -p /mnt/boot
mount /dev/disk/by-label/boot /mnt/boot

echo "Generating NixOS configuration (/mnt/etc/nixos/*.nix) ..."
nixos-generate-config --root /mnt
mv /mnt/etc/nixos/configuration.nix /mnt/etc/nixos/default-configuration.nix

echo "Writing custom NixOS configuration to /mnt/etc/nixos/ ..."
cat <<EOF >/mnt/etc/nixos/configuration.nix
{ pkgs, ...}:
{
  imports = [
    ./hardware-configuration.nix
    ./default-configuration.nix
  ];

  environment.systemPackages = with pkgs; [
    coreutils
    htop
    less
    vim
    which
  ];

  i18n.defaultLocale = "en_US.UTF-8";

  # Needed for Broadcom drivers
  nixpkgs.config.allowUnfree = true;

  nix = {
    gc = {
      automatic = true;
      dates = "weekly";
      options = "--max-freed $((64 * 1024 ** 3))";
    };

    optimise = {
      automatic = true;
      dates = [ "weekly" ];
    };

    settings.trusted-users = [ "@nixbld" "@wheel" ];
  };

  security.sudo.wheelNeedsPassword = false;

  services.openssh = {
    enable = true;
    permitRootLogin = "no";
    passwordAuthentication = false;
    hostKeys = [{
      path = "/data/etc/ssh/ssh_host_ed25519_key";
      type = "ed25519";
    }];
  };

  time.timeZone = "UTC";

  users = {
    mutableUsers = false;

    users = {
      anand = {
        isNormalUser = true;
        createHome = true;
        extraGroups = [ "nixbld" "sudo" "wheel" ];
        group = "users";
        uid = 1000;
        home = "/home/anand";
        useDefaultShell = true;
        openssh.authorizedKeys.keys = [ "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIMozgKcmC5KdPFteZey9Ov45/inEfg/PCdSaZKd582tb" ];
      };

      root.hashedPassword = "\$6\$Y07SFMN5XPG6fJgw\$FGGBCduL4Bdg55vGoHAnQgDl5MqVqoIzgWlQoMXXrpm.2nmhTgivPJMNEpPclh064or/eM8.6GruCnFttZvPW0";
    };
  };
}
EOF

echo "Installing NixOS to /mnt ..."
nixos-install -I "nixos-config=/mnt/etc/nixos/configuration.nix" --no-root-passwd

echo "Installation succeeded! Rebooting ..."
reboot

Once the system reboots, I am able to log in over ssh using the user-account created by the script without any issue.

However, when I start running a build that uses the Macmini as the remote builder, something really odd seems to be happening. When connected over ssh, the connection (seemingly) randomly terminates. What’s even stranger is that its not just the remote-build ssh connection that terminates… EVERY open ssh connection to the remote builder is terminated, including those that have nothing to do with the remote build! E.g. I have two terminal windows open; one connected over ssh to the remote builder running some program (htop, or journalctl -f), while in the other I run a nix-build command that is using the remote builder to build some package.

The following is a NixOS system closure I’m trying to build using the remote builder.

$ cat default.nix
let
  pkgs = import ./nix { system = "x86_64-linux"; };

in

{
  macmini1 = pkgs.nixos [({ lib, pkgs, ... }: {
    imports = [
      ./nix/modules/remote-builder.nix
      ./nix/modules/ssh.nix
      ./nix/modules/users.nix
    ];

    boot.loader.grub.device = "/dev/disk/by-label/boot";

    fileSystems = {
      "/" = { fsType = "ext4"; device = "/dev/disk/by-label/nixos";};
      "/boot" = { fsType = "vfat"; device = "/dev/disk/by-label/boot"; };
    };

    networking.hostName = "macmini1";

    swapDevices = [ ];
    })];
}

$ nix-build -A macmini1
these 3 derivations will be built:
  /nix/store/86f9riy4fhbx91xkkyrqka547yy7m0r8-nixos-boot-disk.drv
  /nix/store/yhp44j5bgpcyv5qh5yvdbxjsknrvz9cm-run-nixos-vm.drv
  /nix/store/77rq8smg20nr1j03scz1gm132l68vnvq-nixos-vm.drv
building '/nix/store/86f9riy4fhbx91xkkyrqka547yy7m0r8-nixos-boot-disk.drv' on 'ssh://macmini1'...
copying 0 paths...
Connection to 192.168.1.120 closed by remote host.
error: unexpected end-of-file
error: builder for '/nix/store/86f9riy4fhbx91xkkyrqka547yy7m0r8-nixos-boot-disk.drv' failed with exit code 1;
       last 1 log lines:
       > Connection to 192.168.1.120 closed by remote host.
       For full logs, run 'nix log /nix/store/86f9riy4fhbx91xkkyrqka547yy7m0r8-nixos-boot-disk.drv'.
error: 1 dependencies of derivation '/nix/store/yhp44j5bgpcyv5qh5yvdbxjsknrvz9cm-run-nixos-vm.drv' failed to build
error: 1 dependencies of derivation '/nix/store/77rq8smg20nr1j03scz1gm132l68vnvq-nixos-vm.drv' failed to build

Of course, this makes it rather hard to get logs from a remote builder, and I am forced to log into the remote builder at the console to check the logs. This is what I see…

Jan 03 01:00:20 macmini1 sshd[3122]: Accepted publickey for anand from 192.168.1.201 port 55039 ssh2: ED25519 SHA256:j7ZZi4D7e8N+5oxe0VlEvGwksNuI1Ihq6yuwlmgljPA
Jan 03 01:00:20 macmini1 sshd[3122]: pam_unix(sshd:session): session opened for user anand(uid=1000) by (uid=0)
Jan 03 01:00:20 macmini1 systemd[1]: Starting User Runtime Directory /run/user/1000...
Jan 03 01:00:20 macmini1 systemd-logind[874]: New session 19 of user anand.
Jan 03 01:00:20 macmini1 systemd[1]: Finished User Runtime Directory /run/user/1000.
Jan 03 01:00:20 macmini1 systemd[1]: Starting User Manager for UID 1000...
Jan 03 01:00:20 macmini1 systemd[3125]: pam_unix(systemd-user:session): session opened for user anand(uid=1000) by (uid=0)
Jan 03 01:00:20 macmini1 systemd[3125]: Queued start job for default target Main User Target.
Jan 03 01:00:20 macmini1 systemd[3125]: Created slice User Application Slice.
Jan 03 01:00:20 macmini1 systemd[3125]: Reached target Paths.
Jan 03 01:00:20 macmini1 systemd[3125]: Reached target Timers.
Jan 03 01:00:20 macmini1 systemd[3125]: Starting D-Bus User Message Bus Socket...
Jan 03 01:00:20 macmini1 systemd[3125]: Listening on D-Bus User Message Bus Socket.
Jan 03 01:00:20 macmini1 systemd[3125]: Reached target Sockets.
Jan 03 01:00:20 macmini1 systemd[3125]: Reached target Basic System.
Jan 03 01:00:20 macmini1 systemd[1]: Started User Manager for UID 1000.
Jan 03 01:00:20 macmini1 systemd[3125]: Starting Run user-specific NixOS activation...
Jan 03 01:00:20 macmini1 systemd[1]: Started Session 19 of User anand.
Jan 03 01:00:20 macmini1 systemd[3125]: Finished Run user-specific NixOS activation.
Jan 03 01:00:20 macmini1 systemd[3125]: Reached target Main User Target.
Jan 03 01:00:20 macmini1 systemd[3125]: Startup finished in 95ms.
Jan 03 01:08:41 macmini1 sshd[3198]: Accepted publickey for anand from 192.168.1.201 port 55053 ssh2: ED25519 SHA256:j7ZZi4D7e8N+5oxe0VlEvGwksNuI1Ihq6yuwlmgljPA
Jan 03 01:08:41 macmini1 sshd[3198]: pam_unix(sshd:session): session opened for user anand(uid=1000) by (uid=0)
Jan 03 01:08:41 macmini1 systemd-logind[874]: New session 21 of user anand.
Jan 03 01:08:41 macmini1 systemd[1]: Started Session 21 of User anand.
Jan 03 01:08:41 macmini1 nix-daemon[2752]: accepted connection from pid 3201, user anand (trusted)
Jan 03 01:08:42 macmini1 sshd[3198]: pam_unix(sshd:session): session closed for user anand
Jan 03 01:08:42 macmini1 sshd[3122]: pam_unix(sshd:session): session closed for user anand
Jan 03 01:08:42 macmini1 systemd[1]: session-21.scope: Deactivated successfully.
Jan 03 01:08:42 macmini1 nix-daemon[3203]: unexpected Nix daemon error: error: writing to file: Broken pipe
Jan 03 01:08:42 macmini1 systemd[1]: user@1000.service: Main process exited, code=killed, status=9/KILL
Jan 03 01:08:42 macmini1 systemd[1]: user@1000.service: Failed with result 'signal'.
Jan 03 01:08:42 macmini1 systemd[1]: session-19.scope: Deactivated successfully.
Jan 03 01:08:42 macmini1 systemd-logind[874]: Session 21 logged out. Waiting for processes to exit.
Jan 03 01:08:42 macmini1 systemd[1]: Stopping User Runtime Directory /run/user/1000...
Jan 03 01:08:42 macmini1 systemd-logind[874]: Session 19 logged out. Waiting for processes to exit.
Jan 03 01:08:42 macmini1 systemd-logind[874]: Removed session 21.
Jan 03 01:08:42 macmini1 systemd-logind[874]: Removed session 19.
Jan 03 01:08:42 macmini1 systemd[1]: run-user-1000.mount: Deactivated successfully.
Jan 03 01:08:42 macmini1 systemd[1]: user-runtime-dir@1000.service: Deactivated successfully.
Jan 03 01:08:42 macmini1 systemd[1]: Stopped User Runtime Directory /run/user/1000.

I’ve tried digging into it some more with debug logs enabled for ssh, and haven’t been able to find a smoking gun yet. I have experienced this behavior on NixOS 21.11, 22.05 and now 22.11 on multiple machines. I’m hoping I’m not the only one seeing this.

Seems like this is only affecting builds that need the kvm feature, so may that might be a hint.

Happy to post any additional details as required. Any/all help is appreciated.

Thanks

Update: This does not have anything to do with the kvm feature. Built packages that didn’t require that feature and the build terminates because the ssh connection is closed.

Should someone come following my footsteps, here is what worked for me, (though I’m still not sure why; not much options for debug logging in nix-daemon, beyond a core-dump documented below!).

  • Use nix v2.3.16 everywhere. My intent here was to figure out if this was a flake/non-flake issue. In retrospect, I think this may not be the root cause.

    I uninstalled nix and downgraded to this version on my local machine (MacOS Catalina), and updated the NixOS configuration of the remote builder to use the same version using nix.package = pkgs.nixVersions.nix_2_3;. This seemed to make progress, but failed with Too many open files.

  • Try setting up your remote-builder host to ssh in as root. Not sure why the remote builder does not allow logging in using a regular user account that has been added to nix.settings.trusted-users.

With these two fixes, I am able to use a remote NixOS builder with my MacOS (Catalina).

nix-daemon Core-dump

Jan 17 00:34:55 macmini1 systemd-coredump[4192]: [🡕] Process 4175 (nix-daemon) of user 0 dumped core.

                                                 Module linux-vdso.so.1 with build-id 7c32d043d05f492840de8a25bd791610c693e20d
                                                 Module libresolv.so.2 with build-id 88f3bb8423742b08f89ecabb2800b03d17d06e6c
                                                 Module libkeyutils.so.1 without build-id.
                                                 Module libkrb5support.so.0 without build-id.
                                                 Module libcom_err.so.3 without build-id.
                                                 Module libk5crypto.so.3 without build-id.
                                                 Module libkrb5.so.3 without build-id.
                                                 Module libunistring.so.2 without build-id.
                                                 Module libbrotlicommon.so.1 without build-id.
                                                 Module libaws-c-common.so.1 without build-id.
                                                 Module libaws-checksums.so.1.0.0 without build-id.
                                                 Module libaws-c-sdkutils.so.1.0.0 without build-id.
                                                 Module libaws-c-cal.so.1.0.0 without build-id.
                                                 Module libaws-c-compression.so.1.0.0 without build-id.
                                                 Module libs2n.so.1 without build-id.
                                                 Module libaws-c-io.so.1.0.0 without build-id.
                                                 Module libaws-c-http.so.1.0.0 without build-id.
                                                 Module libaws-c-auth.so.1.0.0 without build-id.
                                                 Module libaws-c-s3.so.0unstable without build-id.
                                                 Module libaws-c-event-stream.so.1.0.0 without build-id.
                                                 Module libaws-c-mqtt.so.1.0.0 without build-id.
                                                 Module libaws-crt-cpp.so without build-id.
                                                 Module libzstd.so.1 without build-id.
                                                 Module libgssapi_krb5.so.2 without build-id.
                                                 Module libssl.so.3 with build-id 46c077175f93159c72c2b94bc0d99e424d0a3943
                                                 Module libssh2.so.1 without build-id.
                                                 Module libidn2.so.0 without build-id.
                                                 Module libnghttp2.so.14 without build-id.
                                                 Module libz.so.1 without build-id.
                                                 Module libbrotlidec.so.1 without build-id.
                                                 Module libbrotlienc.so.1 without build-id.
                                                 Module liblzma.so.5 without build-id.
                                                 Module libseccomp.so.2 without build-id.
                                                 Module libaws-cpp-sdk-core.so without build-id.
                                                 Module libaws-cpp-sdk-s3.so without build-id.
                                                 Module libaws-cpp-sdk-transfer.so without build-id.
                                                 Module libcurl.so.4 with build-id 62a281f05b0bc7fe7c9370b4c786525772390fdd
                                                 Module libbz2.so.1 without build-id.
                                                 Module libsqlite3.so.0 with build-id 377f9d7f0fb8f5896be673d87eb739eb7866db92
                                                 Module libcrypto.so.3 with build-id 5ba9c3862d2fed33339255247444fc34d53cb4cc
                                                 Module librt.so.1 with build-id 7c9aae26f0646a27bf0f7c49c914b3258c5fa43e
                                                 Module ld-linux-x86-64.so.2 with build-id db50353a26600bb848b9a5541b1506e0a24cb34b
                                                 Module libc.so.6 with build-id 2bb226bc600b443958c7566207d0d02f8345e6ea
                                                 Module libgcc_s.so.1 without build-id.
                                                 Module libm.so.6 with build-id b8454b40db819599169f3a948939aed4b3fc7f82
                                                 Module libstdc++.so.6 without build-id.
                                                 Module libnixutil.so with build-id adfefeb4fc88723a6f6c945931c613b22f096a9b
                                                 Module libnixstore.so with build-id 083da4a622d28a84b028a5eb53b51ee0a4945c66
                                                 Module libnixmain.so with build-id 9bfdc9810a2f045f6d041ac9c191f42be346180d
                                                 Module libdl.so.2 with build-id 67c430223def0be24c4ae1a4c3985f26566b8831
                                                 Module libpthread.so.0 with build-id 85431f01160c3de171d3baeb3f8cf1c9578dc441
                                                 Module libgc.so.1 with build-id 81b9e14b499120e3d847c03953a091bb727320f7
                                                 Module libnixexpr.so with build-id 7cf0f2fb2c4c52032906bd966c289aa6db32a710
                                                 Module libboost_system.so.1.79.0 without build-id.
                                                 Module libboost_thread.so.1.79.0 without build-id.
                                                 Module libboost_context.so.1.79.0 without build-id.
                                                 Module libeditline.so.1 without build-id.
                                                 Module libsodium.so.23 with build-id 7849bac332c57fb7b1df7d7cbc8445a2f3181334
                                                 Module nix with build-id 8a3fcc74601fb0e018bf207761ee2b893717249e
                                                 Stack trace of thread 4175:
                                                 #0  0x00007f45ead94bc7 __pthread_kill_implementation (libc.so.6 + 0x8abc7)
                                                 #1  0x00007f45ead47b46 raise (libc.so.6 + 0x3db46)
                                                 #2  0x00007f45ead324b5 abort (libc.so.6 + 0x284b5)
                                                 #3  0x00007f45eb0b896a _ZN9__gnu_cxx27__verbose_terminate_handlerEv.cold (libstdc++.so.6 + 0xa996a)
                                                 #4  0x00007f45eb0c3f9a _ZN10__cxxabiv111__terminateEPFvvE (libstdc++.so.6 + 0xb4f9a)
                                                 #5  0x00007f45eb0c2ff9 __cxa_call_terminate (libstdc++.so.6 + 0xb3ff9)
                                                 #6  0x00007f45eb0c3727 __gxx_personality_v0 (libstdc++.so.6 + 0xb4727)
                                                 #7  0x00007f45eaf25da3 _Unwind_RaiseException_Phase2 (libgcc_s.so.1 + 0x10da3)
                                                 #8  0x00007f45eaf26625 _Unwind_Resume (libgcc_s.so.1 + 0x11625)
                                                 #9  0x00007f45eb261287 _ZN3nix15ignoreExceptionEv (libnixutil.so + 0x3c287)
                                                 #10 0x00007f45eb360c72 _ZN3nix14DerivationGoalD2Ev.cold (libnixstore.so + 0x84c72)
                                                 #11 0x00007f45eb3a78ba _ZN3nix6WorkerD1Ev (libnixstore.so + 0xcb8ba)
                                                 #12 0x00007f45eb3bc55c _ZN3nix10LocalStore15buildDerivationERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKNS_15BasicDerivationENS_9BuildModeE (libnixstore.so + 0xe055c)
                                                 #13 0x0000558ba2aef61a _ZL9performOpP12TunnelLoggerN3nix3refINS1_5StoreEEEbjRNS1_6SourceERNS1_4SinkEj.constprop.0 (nix + 0xa161a)
                                                 #14 0x0000558ba2af2386 _ZL17processConnectionbRKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEj (nix + 0xa4386)
                                                 #15 0x0000558ba2af2c25 _ZZL10daemonLoopPPcENKUlvE_clEv (nix + 0xa4c25)
                                                 #16 0x0000558ba2af2c4c _ZNSt17_Function_handlerIFvvEZL10daemonLoopPPcEUlvE_E9_M_invokeERKSt9_Any_data (nix + 0xa4c4c)
                                                 #17 0x00007f45eb2ac63c _ZNSt17_Function_handlerIFvvEZN3nix12startProcessESt8functionIS0_ERKNS1_14ProcessOptionsEEUlvE_E9_M_invokeERKSt9_Any_data (libnixutil.so + 0x8763c)
                                                 #18 0x00007f45eb2a8c39 _ZN3nixL6doForkEbSt8functionIFvvEE (libnixutil.so + 0x83c39)
                                                 #19 0x00007f45eb2ac599 _ZN3nix12startProcessESt8functionIFvvEERKNS_14ProcessOptionsE (libnixutil.so + 0x87599)
                                                 #20 0x0000558ba2aedebf _ZL10daemonLoopPPc (nix + 0x9febf)
                                                 #21 0x0000558ba2af2a04 _ZL5_mainiPPc (nix + 0xa4a04)
                                                 #22 0x0000558ba2b6ef3c _ZN3nix11mainWrappedEiPPc (nix + 0x120f3c)
                                                 #23 0x00007f45eb57331f _ZN3nix16handleExceptionsERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESt8functionIFvvEE (libnixmain.so + 0x2031f)
                                                 #24 0x0000558ba2aa85e4 main (nix + 0x5a5e4)
                                                 #25 0x00007f45ead3324e __libc_start_call_main (libc.so.6 + 0x2924e)
                                                 #26 0x00007f45ead33309 __libc_start_main@@GLIBC_2.34 (libc.so.6 + 0x29309)
                                                 #27 0x0000558ba2aa9355 _start (nix + 0x5b355)
                                                 ELF object binary architecture: AMD x86-64

I am encountering the same issue with non-root users as well. The configuration works when I have the ssh user set to root instead of rnixbld (the user I created). Although rnixbld is both in the nixbld group I get the following error in the systemd journal:

Apr 27 21:02:36 nexus nix-daemon[1180]: error: error processing connection: user 'rnixbld' is not allowed to connect to the Nix daemon

Although the user has access to the socket and is in trusted-users (and allowed-users is set to *).

I’m going to investigate that error message in the source code and see what might cause this.

EDIT: Looks like the solution is to find a different group for this role account. The nixbld group is apparently blocked.