Failing nix integration tests due to lack of entropy

tfc · November 4, 2018, 3:18pm

Hi there,

i am quite a fan of the nix integration test framework and use it for my own software projects.
I did however run into a two machines of my posession (with relatively old chipsets) where even simple integration tests fail, and i had a hard time debugging that, before i found a solution.

This thread shall note the symptoms along with the solution so others find the solution faster. But i would also like to reach out for opinions on how this might be solved in a more generic way.

Symptoms:

When executing nix-build <nixpkgs/nixos/tests/login.nix> (which is kind of my minimal test to see if integration tests work), i expect the VM that is started to work through the test and then getting the results in ./result.

What does happen:

I get this log: Output of "nix-build $nixpkgs/nixos/tests/login.nix" · GitHub

With the following interesting excerpts:

...
machine# [   61.495417] systemd-udevd[90]: seq 611 '/devices/pci0000:00/0000:00:03.0' is taking a long time
machine# [   61.505228] systemd-udevd[90]: seq 620 '/devices/pci0000:00/0000:00:0c.0' is taking a long time
machine# [   61.508168] systemd-udevd[90]: seq 619 '/devices/pci0000:00/0000:00:0b.0' is taking a long time
machine# [   61.510333] systemd-udevd[90]: seq 618 '/devices/pci0000:00/0000:00:0a.0' is taking a long time
machine# [   61.512473] systemd-udevd[90]: seq 612 '/devices/pci0000:00/0000:00:04.0' is taking a long time
machine# [   61.514660] systemd-udevd[90]: seq 617 '/devices/pci0000:00/0000:00:09.0' is taking a long time
machine# [   61.516840] systemd-udevd[90]: seq 616 '/devices/pci0000:00/0000:00:08.0' is taking a long time
machine# [   61.519017] systemd-udevd[90]: seq 613 '/devices/pci0000:00/0000:00:05.0' is taking a long time
machine# [   61.521210] systemd-udevd[90]: seq 615 '/devices/pci0000:00/0000:00:07.0' is taking a long time
machine# [   61.523388] systemd-udevd[90]: seq 614 '/devices/pci0000:00/0000:00:06.0' is taking a long time
...
machine# The file /dev/vda does not exist and no size was specified.
machine# waiting for device /dev/vda to appear...
...
machine# Timed out waiting for device /dev/vda, trying to mount anyway.
machine# [  181.501302] systemd-udevd[90]: seq 611 '/devices/pci0000:00/0000:00:03.0' killed
machine# [  181.509780] systemd-udevd[90]: seq 620 '/devices/pci0000:00/0000:00:0c.0' killed
machine# [  181.511947] systemd-udevd[90]: seq 612 '/devices/pci0000:00/0000:00:04.0' killed
machine# [  181.515584] systemd-udevd[90]: seq 614 '/devices/pci0000:00/0000:00:06.0' killed
machine# [  181.519645] systemd-udevd[90]: seq 618 '/devices/pci0000:00/0000:00:0a.0' killed
machine# [  181.523659] systemd-udevd[90]: seq 619 '/devices/pci0000:00/0000:00:0b.0' killed
machine# [  181.527608] systemd-udevd[90]: seq 615 '/devices/pci0000:00/0000:00:07.0' killed
machine# [  181.531615] systemd-udevd[90]: seq 617 '/devices/pci0000:00/0000:00:09.0' killed
machine# [  181.534594] systemd-udevd[90]: seq 616 '/devices/pci0000:00/0000:00:08.0' killed
machine# [  181.538962] systemd-udevd[90]: seq 613 '/devices/pci0000:00/0000:00:05.0' killed
...
machine# cannot check filesystem with type "auto"!
machine# mounting /dev/vda on /...
machine# mount: mounting /dev/vda on /mnt-root/ failed: No such file or directory
machine# [  186.631495] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000100
machine# [  186.631495] 
machine# [  186.633298] CPU: 0 PID: 1 Comm: init Not tainted 4.14.78 #1-NixOS
machine# [  186.634513] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
machine# [  186.636818] Call Trace:
machine# [  186.637375]  dump_stack+0x5c/0x85
machine# [  186.638059]  panic+0xd5/0x232
machine# [  186.638701]  do_exit+0xad0/0xad0
machine# [  186.639392]  ? handle_mm_fault+0xaa/0x1e0
machine# [  186.640171]  do_group_exit+0x3a/0xa0
machine# [  186.640880]  SyS_exit_group+0x10/0x10
machine# [  186.641624]  do_syscall_64+0x67/0x120
machine# [  186.642394]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
machine# [  186.643400] RIP: 0033:0x7f2783771676
machine# [  186.644110] RSP: 002b:00007ffe664f0088 EFLAGS: 00000202 ORIG_RAX: 00000000000000e7
machine# [  186.645603] RAX: ffffffffffffffda RBX: 000000000232e704 RCX: 00007f2783771676
machine# [  186.647263] RDX: 0000000000000001 RSI: 000000000000003c RDI: 0000000000000001
machine# [  186.648689] RBP: 0000000000000001 R08: 00000000000000e7 R09: ffffffffffffff80
machine# [  186.650084] R10: 0000000000000000 R11: 0000000000000202 R12: 00007ffe664f0418
machine# [  186.651497] R13: 00007ffe664f0410 R14: 0000000000000000 R15: 0000000000000000
machine# [  186.653019] Kernel Offset: 0x13800000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
machine# [  186.655107] Rebooting in 1 seconds..
vde_switch: EOF data port: Interrupted system call
error: the VM quit before connecting
the VM quit before connecting
cleaning up
killing machine (pid 2675)
vde_switch: EOF on stdin, cleaning up and exiting

Solution:

So in the end i found out that the host machine is running out of entropy (cat /proc/sys/kernel/random/entropy_avail → got single-digit values while the test was running), and then the guest stalled on this. (Great thanks to andi- in #nixos-de!!)

nix integration tests worked again when i removed the -device virtio-rng-pci, but this is of course just a hacky solution.

A better solution is to install haveged as a daemon to fill up the entropy once it runs empty. On nixos this is of course easy: Just put services.haveged.enable = true into the /etc/nixos/configuration.nix. On other distros similar ways exist.

Long term solution:

Does anyone have an idea how to fix that longterm? I mean, the whole idea behind nix is that things “just work”, but in this case they don’t. On non-NixOS-systems this is harder to achieve because nix has less control, but i guess on NixOS it would be feasible to have something, right?

Mic92 · November 4, 2018, 10:26pm

It also looks like systemd-udevd was killed due to running out of memory: Output of "nix-build $nixpkgs/nixos/tests/login.nix" · GitHub

tfc · November 5, 2018, 8:01am

I checked for memory, too. I don’t understand where this comes from, as:

the host machine had still more than 6 gigs free at that point
the qemu standard setting for guest memory is 384 megabytes, and when i increased it to 2 gigs (without fixing the entropy thing), this message still occured.

This message however completely disappeared, too, after i enabled haveged.

eonpatapon · December 10, 2018, 10:07am

Thanks for this.

I has also this issue and now it’s working properly. Some other option is to install rng-tools

More info on linux - Is it appropriate to use haveged as a source of entropy on virtual machines? - Information Security Stack Exchange

cyounkins · December 18, 2018, 5:45pm

I just ran into this confusing issue, so thanks for posting.

It seems like adding virtio-rng-pci is not a good idea for determinism to start with. Features/VirtIORNG - QEMU indicates you can pass a device path to the option, so maybe we can feed it a deterministic RNG based on constant seed?

evenbrenden · November 18, 2019, 8:47pm

Ran into this too, thanks! I had just switched the default terminal for i3 and I noticed that it took minutes to start it up (the terminal, not i3) unless I banged my keyboard fast enough - guess that was a hint for the missing entropy. I gather some programs depend on services that are blocked while filling /dev/random. Took me a minute to arrive here, so hope this comment makes it more searchable. Sticking to Haveged for now.

(NixOS 19.09 on VirtualBox 6.0.12)