Hi there,
i am quite a fan of the nix integration test framework and use it for my own software projects.
I did however run into a two machines of my posession (with relatively old chipsets) where even simple integration tests fail, and i had a hard time debugging that, before i found a solution.
This thread shall note the symptoms along with the solution so others find the solution faster. But i would also like to reach out for opinions on how this might be solved in a more generic way.
Symptoms:
When executing nix-build <nixpkgs/nixos/tests/login.nix>
(which is kind of my minimal test to see if integration tests work), i expect the VM that is started to work through the test and then getting the results in ./result
.
What does happen:
I get this log: Output of "nix-build $nixpkgs/nixos/tests/login.nix" · GitHub
With the following interesting excerpts:
...
machine# [ 61.495417] systemd-udevd[90]: seq 611 '/devices/pci0000:00/0000:00:03.0' is taking a long time
machine# [ 61.505228] systemd-udevd[90]: seq 620 '/devices/pci0000:00/0000:00:0c.0' is taking a long time
machine# [ 61.508168] systemd-udevd[90]: seq 619 '/devices/pci0000:00/0000:00:0b.0' is taking a long time
machine# [ 61.510333] systemd-udevd[90]: seq 618 '/devices/pci0000:00/0000:00:0a.0' is taking a long time
machine# [ 61.512473] systemd-udevd[90]: seq 612 '/devices/pci0000:00/0000:00:04.0' is taking a long time
machine# [ 61.514660] systemd-udevd[90]: seq 617 '/devices/pci0000:00/0000:00:09.0' is taking a long time
machine# [ 61.516840] systemd-udevd[90]: seq 616 '/devices/pci0000:00/0000:00:08.0' is taking a long time
machine# [ 61.519017] systemd-udevd[90]: seq 613 '/devices/pci0000:00/0000:00:05.0' is taking a long time
machine# [ 61.521210] systemd-udevd[90]: seq 615 '/devices/pci0000:00/0000:00:07.0' is taking a long time
machine# [ 61.523388] systemd-udevd[90]: seq 614 '/devices/pci0000:00/0000:00:06.0' is taking a long time
...
machine# The file /dev/vda does not exist and no size was specified.
machine# waiting for device /dev/vda to appear...
...
machine# Timed out waiting for device /dev/vda, trying to mount anyway.
machine# [ 181.501302] systemd-udevd[90]: seq 611 '/devices/pci0000:00/0000:00:03.0' killed
machine# [ 181.509780] systemd-udevd[90]: seq 620 '/devices/pci0000:00/0000:00:0c.0' killed
machine# [ 181.511947] systemd-udevd[90]: seq 612 '/devices/pci0000:00/0000:00:04.0' killed
machine# [ 181.515584] systemd-udevd[90]: seq 614 '/devices/pci0000:00/0000:00:06.0' killed
machine# [ 181.519645] systemd-udevd[90]: seq 618 '/devices/pci0000:00/0000:00:0a.0' killed
machine# [ 181.523659] systemd-udevd[90]: seq 619 '/devices/pci0000:00/0000:00:0b.0' killed
machine# [ 181.527608] systemd-udevd[90]: seq 615 '/devices/pci0000:00/0000:00:07.0' killed
machine# [ 181.531615] systemd-udevd[90]: seq 617 '/devices/pci0000:00/0000:00:09.0' killed
machine# [ 181.534594] systemd-udevd[90]: seq 616 '/devices/pci0000:00/0000:00:08.0' killed
machine# [ 181.538962] systemd-udevd[90]: seq 613 '/devices/pci0000:00/0000:00:05.0' killed
...
machine# cannot check filesystem with type "auto"!
machine# mounting /dev/vda on /...
machine# mount: mounting /dev/vda on /mnt-root/ failed: No such file or directory
machine# [ 186.631495] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000100
machine# [ 186.631495]
machine# [ 186.633298] CPU: 0 PID: 1 Comm: init Not tainted 4.14.78 #1-NixOS
machine# [ 186.634513] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
machine# [ 186.636818] Call Trace:
machine# [ 186.637375] dump_stack+0x5c/0x85
machine# [ 186.638059] panic+0xd5/0x232
machine# [ 186.638701] do_exit+0xad0/0xad0
machine# [ 186.639392] ? handle_mm_fault+0xaa/0x1e0
machine# [ 186.640171] do_group_exit+0x3a/0xa0
machine# [ 186.640880] SyS_exit_group+0x10/0x10
machine# [ 186.641624] do_syscall_64+0x67/0x120
machine# [ 186.642394] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
machine# [ 186.643400] RIP: 0033:0x7f2783771676
machine# [ 186.644110] RSP: 002b:00007ffe664f0088 EFLAGS: 00000202 ORIG_RAX: 00000000000000e7
machine# [ 186.645603] RAX: ffffffffffffffda RBX: 000000000232e704 RCX: 00007f2783771676
machine# [ 186.647263] RDX: 0000000000000001 RSI: 000000000000003c RDI: 0000000000000001
machine# [ 186.648689] RBP: 0000000000000001 R08: 00000000000000e7 R09: ffffffffffffff80
machine# [ 186.650084] R10: 0000000000000000 R11: 0000000000000202 R12: 00007ffe664f0418
machine# [ 186.651497] R13: 00007ffe664f0410 R14: 0000000000000000 R15: 0000000000000000
machine# [ 186.653019] Kernel Offset: 0x13800000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
machine# [ 186.655107] Rebooting in 1 seconds..
vde_switch: EOF data port: Interrupted system call
error: the VM quit before connecting
the VM quit before connecting
cleaning up
killing machine (pid 2675)
vde_switch: EOF on stdin, cleaning up and exiting
Solution:
So in the end i found out that the host machine is running out of entropy (cat /proc/sys/kernel/random/entropy_avail
→ got single-digit values while the test was running), and then the guest stalled on this. (Great thanks to andi-
in #nixos-de
!!)
nix integration tests worked again when i removed the -device virtio-rng-pci
, but this is of course just a hacky solution.
A better solution is to install haveged
as a daemon to fill up the entropy once it runs empty. On nixos this is of course easy: Just put services.haveged.enable = true
into the /etc/nixos/configuration.nix
. On other distros similar ways exist.
Long term solution:
Does anyone have an idea how to fix that longterm? I mean, the whole idea behind nix is that things “just work”, but in this case they don’t. On non-NixOS-systems this is harder to achieve because nix has less control, but i guess on NixOS it would be feasible to have something, right?