Infrequent and mysterious build errors using nix build

danielbarter · April 10, 2024, 2:01am

Recently, I’ve been doing a lot of cross compiling stuff, and hence a lot of building.

I have noticed that sometimes, when building things using the flake interface, i.e nix build, derivations will randomly fail to build. If I follow up with a nix-store --realize <derivation>, then the build will complete successfully.

There doesn’t seem to be any pattern, and the failures look like standard build failures, except that when I try to scrutinize them, the build will work.

These build failures tend to happen more when --max-jobs is large.

Has anyone else noticed this? Is building through the flake interface haunted / cursed in some way?

colemickens · April 10, 2024, 2:52am

and the failures look like standard build failures

what does “standard build faillure” mean?

I think some logs might help us shed some light. My gut reaction though, is that if it’s really independent of multi-threading problems, or memory exhaustion due to too many jobs… that you might have some sort of latent hardware issue. (Or potentially you’re building things that are flakey, but I have really only encountered that once or twice over many many years of locally building lots of packages in nixpkgs).

danielbarter · April 10, 2024, 3:05am

@colemickens: by standard build failure, i mean some test fails, or some compilation action fails. When i start trying to probe using nix-shell, the build will succeed.

The machine in question has 64GB of memory (with 100GB swap), standard 4 core intel CPU. No other signs of hardware issues.

I wish I had more information, but the stochastic nature of the failures makes it really hard to gather evidence. I am gonna step back on saying I have seen this with --max-jobs = 1, but I am 100% sure I have seen builds fail using nix build --max-jobs 8, and then succeed straight afterwards when running individually.

Do we have much understanding of the effects of running multiple large builds at once (i.e --max-jobs > 1) and how this interacts with concurrency within a single build? Is it possible that this is just triggering latent concurrency bugs in build systems?

colemickens · April 10, 2024, 3:12am

I can’t really give hard data, but I would say that I’m normally building on much, much smaller machines than that, and not seeing this issue to the extent you seem to be.

Like, I take patches to gnupg, which requires a tremendous rebuild, and it just chugs away on my 2+ year old ultrabook laptop “server”, running in some basement, and it generally works. Granted, those builds run in CI and so they benefit from basically iterative retries, but still, I’m quite surprised to hear what you’re reporting with a machine like that.

I think it would be hard, for me at least, to speculate further without logs.

I was going to say maybe your core count is high enough to trigger some extreme parallelism count, but rather, you’re in sort of an ideal state, tons of RAM, with rather conservative core count.

Though I do wonder if “max-jobs = 8” might be a bit aggressive on a 4-core CPU? Maybe review this and revise your cores / max-jobs settings? Tuning Cores and Jobs - Nix Reference Manual

But still, I think some logs would help folks advise further.

danielbarter · April 10, 2024, 3:16am

@colemickens: yeah, things have felt more reliable with cores=4 and max-jobs=1. It seems that the default max-jobs is set to your thread count, which is 8 on my machine.

I’ll be more careful to collect logs next time I see this happen.

danielbarter · April 10, 2024, 2:09pm

OK, was running a big build overnight, with --max-jobs 1, and observed a test failure while building ell: gist:76693e1a3b4b346934a3be93b030e4a8 · GitHub

The important bit is

FAIL: unit/test-dbus-message-fds
================================

launching dbus-daemon
dbus-daemon process 8515 created
dbus-daemon[8515]: Failed to start message bus: Failed to bind socket "/tmp/ell-test-bus": Address already in use
process 8515 terminated with status=256

Disconnected from DBus
FAIL unit/test-dbus-message-fds (exit status: 134)

Rebuilt exactly the same derivation this morning, and it succeeded. Maybe this one is just a flaky test?

colemickens · April 10, 2024, 3:33pm

From the gist of the full log:

./build-aux/test-driver: line 112:  8514 Aborted                 (core dumped) "$@" >> "$log_file" 2>&1

I suspect that’s more the cause. Unfortunately, I’m still not really sure how to investigate or speculate further.

danielbarter · April 10, 2024, 4:08pm

@colemickens: good spot! That definitely looks suspicious. Will keep reporting here as I see more

danielbarter · April 11, 2024, 12:59am

OK, so all these crashes are being logged by coredumpctl, but currently, no cores dumps have been saved, so i can’t spin them up in gdb to see what is happening. I have hundreds of pretty suss looking crashes since the start of march, all during building.

Core dumps outside of the build environment get saved. If a derivation build segfaults, do we have any mechanism for extracting the core dump, before it is lost when the build environment is destroyed?