Optimize slow (4+ hour) evaluation

How do I approach optimizing an expression that takes a long time to evaluate? Is there any form of Nix profiler? Are there easy ways to experiment with techniques like memoization?

I mean this as a general issue but the expression that I am dealing with now is the hydraJobs attribute of flake github:lukego/nix-cl-report/ac1901afba8ca8501c12cfcd744b2ce0ae59a454.

This expression produces several thousand derivations that each reference large sets of build inputs. If I reduce the size of the build inputs then the expression evaluates much more quickly. I assume that a computation somewhere is being repeated.

I assume problems like this can be solved by strategically reusing/caching/memoizing values but it’s not clear to me how to identify which ones and to affect the caching.

EDIT: My problem turns out to be specific to Hydra evaluating expressions using hydra-eval-jobs processes (see discussion below.)

1 Like

Afaik the nix evaluator memoizes everything inherently. It’s guaranteed not to evaluate indentical expressions twice.

2 Likes

I have not actively monitored the time spent on eval, though ran the following command and when I came back from a short discussion with a coworker it was already querying substituters. So evaluation didn’t took longer than ~15 minutes.

$ NIXPKGS_ALLOW_UNSUPPORTED_SYSTEM=1 nix build --impure github:lukego/nix-cl-report/ac1901afba8ca8501c12cfcd744b2ce0ae59a454#hydraJobs._000-report

That is very interesting. This expression takes hours for me to evaluate on my Hydra server running on an otherwise-idle bare metal Intel server.

Maybe the slow performance is Hydra-specific then?

I have to be honest, after printing the initial list of ~27000 items to build, it takes another 20 minutes before actually the first thing gets substituted.

Though then I have a constant flow of active downloads.

I do not know though, what hydra actually counts as “evaluation time”.

It might be, that it counts the related builds into the evaluation here, as it can’t start a new evaluation as long as related builds haven’t finished anyway.

I think I also do remember, that back when I still had a hydra running, the displayed “eval time” indeed was dependant on the amount of stuff to build.

Interesting. I don’t think that’s the situation here.

My “evaluation” has been running for over one hour now. I don’t see any downloads in the Hydra logs (journalctl -f) as I usually would.

I do see a procession of hydra-eval-jobs processes using ~100% CPU and ~4GB RAM. Running ‘top’ the picture is consistent but there is a new pid on each refresh.

I wonder what’s going on? Is Hydra running a long series of sub-evaluations in separate processes before kicking of substitutions and builds?

I have no clue about hydra internals, sorry.

And my build failed now, after an overall runtime of 45 minutes, as I do not have an aarch64-darwin builder available…

Though I can’t find an aarch64-darwin reference in your code on a quick glance… Weird…


if d?variant then d.variant else "base" can be shortened to d.variant or "base"

Thanks for your help! This has narrowed the issue down significantly.

What you are experiencing is normal when building a staging branch that has no cache.

there is nix/stack-collapse.py at 2affb19c921c3c5c3f7628a65dbcdb74f36f5222 · NixOS/nix · GitHub but it takes probably 100GB of disk space and at least 50GB of RAM.

I would first search though your code if you are iterating anywhere over all your nixosConfiguration’s and maybe recurse them to find attributes about machines to eg generate update scripts or so.

I suspect the issue is that Hydra is evaluating job derivations separately and that this inhibits memoization.

I say this because running top on the Hydra server shows a series of short-lived (~1sec) hydra-eval-jobs processes running at maximum CPU.

It seems unfortunate if (a) these processes are repeating a lot of evaluation work already performed by the previous process and (b) running one-at-a-time despite the machine having many CPU cores.

Maybe? The answers seem to live in this C++ code inside Hydra: hydra/hydra-eval-jobs.cc at d1fac69c213002721971cd983e2576b784677d40 · NixOS/hydra · GitHub

I thought that as well after reading Elco’s thesis but then I did a test calling a function performing a long running computation twice with the same arguments and from time it took to run the code I’m 100% sure it ran the computation twice.