Nix caches successful builds. Why doesn't it cache unsuccessful builds?

samuela · April 19, 2022, 10:38pm

@edolstra Is there any way we could get it back? I would find immense value in this. The transient/permanent failure issue seems resolvable with a flag to simply force a rebuild.

samuela · April 19, 2022, 10:41pm

These are legitimate, but I believe surmountable, concerns. A few possible mitigations:

Flag/option to force CI to rebuild
Make it dead-simple to remove failure logs from the nix store. This is simple since we don’t have to worry about other packages/entries depending on failure logs, so the GC story is very straightforward.

gador · April 20, 2022, 9:50am

I agree with @samuela. When trying to bisect a failure (like recently `python39Packages.sentry-sdk` build failure on x86_64-linux as of `891d0226` · Issue #169130 · NixOS/nixpkgs · GitHub) it takes ages to find the commit or fail in trying to do so.
Caching failures would really go a long way in cutting down resource and time wasting when debugging.

sternenseemann · April 20, 2022, 7:17pm

Caching failures alone wouldn’t help you here (or only to an extent), since the failures would also need to be »substituted« from a binary cache.

samuela · April 20, 2022, 7:36pm

Sure, I suppose that’s what I really meant by “cached”. In any case, there’s no reason to fail on the same build twice, even across machines. (Yes, I know we need to allow retries…)

sternenseemann · April 22, 2022, 9:17am

This would probably be a huge pain. It is not uncommon for builds to fail on Hydra for “flaky” reasons: Timeouts, OOM, flaky test suites, … These failures often need to be flagged manually and restarted, but often reverse dependencies of those jobs are not (this is because Hydra’s UI is lacking in this area and it is not possible to mass restart a job and all its reverse dependencies). Of course the “main” jobs are usually not affected by this, especially if they guard channel advancement, but for niche package sets and platforms like aarch64-linux and x86_64-darwin this happens regularly enough.

Practically this means that users would occasionally substitute failures for packages that would be working perfectly, albeit requiring a bit of compiling locally. So before we’d introduce such a feature we would need to make our binary cache generation much more reliable to avoid keeping users from building derivations that actually work fine.

rickynils · May 13, 2022, 1:07pm

nixbuild.net caches failed builds if you don’t opt-out: Settings - nixbuild.net documentation. This works across user accounts, so if somebody else has built the exact same thing you are building using the exact same (compared bit-for-bit) inputs you will get the failure log replayed immediately.

khaled · May 13, 2022, 5:29pm

@rickynils How do you know if a failure is transient or not?

rickynils · May 13, 2022, 11:18pm

@khaled In my experience, transient builds are caused by:

Network access. There can be temporary issues with the local network or remote hosts used by the build. Historically, all network access has been blocked for builds running on nixbuild.net, so this hasn’t been a cause for transient builds. However, we now support network access for fixed-output derivations (basically like the Nix sandbox works), and we plan on simply turning off cached failures for fixed-output derivations to avoid caching transient network failures. Nix could perhaps do something like that too: only cache failures when the builds run in sandboxed mode and never for fixed-output derivations.
Running out of memory or disk space. In such situations, nixbuild.net can reliably detect that the failure is transient since we run every build inside a virtualized (KVM) sandbox and can monitor resource usage. If a build runs out of memory we will simply restart it with more memory. All builds use tmpfs, so running out of disk space is the same as running out of memory. I believe Nix has some way to detect if a build failure was caused by lack of disk space, but I’m not sure how reliable it is. Presumably these things could be made pretty reliable inside the Nix sandbox, maybe by using cgroups.
Bugs in the sandbox. We’ve had cases where bugs or missing features in our sandbox has caused some builds to fail. These failures are trickier, but once we find such failures we fix the sandbox and then simply invalidate all cached failures (including those that had nothing to do with the sandbox bug). During our first year this happened several times, but now it is pretty stable. Nix could also have a mechanism for invalidating cached failures during updates (if the update included any such fixes). Back when Nix actually cached failures I believe it also had a way to clear all cached failures manually.

volth · May 22, 2022, 12:06pm

some builds are time-sensitive. Especially checkPhase. If the build server is overloaded and cannot assign enough CPU cycles to the build, it could fail by timeout.

rickynils · May 23, 2022, 10:58am

Yes, but this is detected as a build timeout, and Nix can decide to not cache such failures.

Atemu · June 13, 2022, 11:18am

What they meant is that the derivation’s build system fails because it hit a timeout internally. Nix can’t detect that.

rickynils · June 13, 2022, 12:28pm

Ah, I see. If a test is time-sensitive (with a shorter timeout than the overall build timeout), it can introduce transient failures.

tobiasBora · February 1, 2023, 1:07pm

Here is one common usecase that would really benefit from this feature: getting notified before 2h of compilations when switching to a broken nixos-unstable commit while hydra already knows it will fail to compile: Stop a build that is known to fail (since hydra failed to build some dependencies) · Issue #7722 · NixOS/nix · GitHub