Some libraries/packages can be quite RAM-hungry during compilation (for example, scientific/CUDA packages). Unfortunately, Linux is generally not great about handling memory pressure. This isnât really a nix-specific problem, however nix/NixOS appears to handle this issue especially poorly.
Case in point, just now I was rebuilding my NixOS system after bumping the nixpkgs version, which triggered most of my overlayed scientific packages to be rebuilt. Unsurprisingly, this build eventually ran out of memory.
To add insult to injury, the OOM killer has for some reason decided to kill .kwin_x11-wrapped, .plasmashell-wrapped, tilda (my terminal emulator) and a bunch of other random service processes, instead of the build daemon.
Unsaved documents were lost.
My laptop has 32Gb of RAM, so this isnât exactly a âthinâ machine. I use ZFS, so itâs possible that the memory pressure was made slightly worse due to the ARC not shrinking fast enough, however the zfs_arc_max should be 50% of RAM by default, so at least ~16Gb should have still been available for the build.
I should also mention that this is not the first time I am running into this issue (although all the previous times didnât result in a dead graphical session). After my first run in with this problem, I started running all my full system builds with nix build --cores 10 (down from the default core count of 20) in the hopes of reducing the number of simultaneously built packages. I am hesitant to reduce this setting any further, as I actually want to build most packages in parallel. Itâs just the huge scientific packages that are âpoor tenantsâ.
So I guess, there are 2 issues here:
Some derivations can be exceptionally memory hungry. Is there any way to make nix play more nicely with such derivations (especially, when there are multiple such packages in a closure)?
Assuming that an OOM does occur during a build, can we reduce the blast radius of the OOM killer (at least on NixOS)?
Some possible ideas:
Assign build processes higher OOM killer scores (or maybe even monitor the memory usage in the daemon and eagerly kill any runaway builds before they trigger the OOM)?
Enforce resource limits via cgroups (or similar)?
Assuming that we can contain/detect the OOM condition, maybe we could optionally retry building âfatâ derivations one at a time (without resource contention due to parallel builds).
It adds an experimental feature cgroups that causes builds to be executed in a cgroup. This allows getting some statistics from a build (such as CPU time) and in the future may allow setting resource limits. But it mainly exists because the uid-range feature requires it.
Bazel has a concept of marking specific tasks as âbigâ (as well as small), which has implications on scheduling. I think it only does this for tests, presumably it has fine-grained enough understanding of the build process itself to avoid scheduling too much memory-heavy work.
Since nix is not particularly fine-grained, it may be able to schedule things a bit better if derivations could contain a hint for how âlargeâ a build is? We already kind of have some control over scheduling with runCommandLocal and whatnot, so itâs not entirely far-fetched.
canât you configure the OOM killers, to favour you most critical processes, and kill the the build processes , like the children of the nix build daemon.
I think this would be the best solution. There is also kind of precedent for this with the big-parallel nix feature.
Nix could be made to assume big-parallel builds will use all the available resources and therefore schedule fewer builds while a big-parallel build is running (or even none).
cgroups wonât really help here. Limiting the memory of a build will only serve to get it killed and therefore fail.
The issue lies in that we must tell the build process (as in: make or ninja) how parallel it can be once upfront while actual permittable parallelism during the build is very dynamic.
@Artturin yeah, I remember reading about this. However, I am not very familiar with cgroups, so I would prefer if there was a built-in way to configure this from nix rather than having to bodge it together myself.
@Mic92@nixinator I am assuming that you are suggesting to set systemd.services.nix-daemon.serviceConfig.OOMScoreAdjust? I am not sure, if that would do the âright thingâ˘â. Wouldnât this result in killing the nix-daemon process itself instead of the âRAM-hungryâ build process?
If this does indeed work like we want, maybe it should be exposed as nix.daemonOOMScoreAdjust and set to some sane default value out of the box?
This should work as a short term fix for the âdonât let builds kill the graphical sessionâ problem, but we might also want to improve âsoftâ OOM handling (not letting this happen in the first place OR automatically retrying/recovering when the build failed due to parallel build induced memory pressure).
@SergeK I think I remember trying earlyoom back when I was using Arch. Iâll take another look at it. Thank you for the recommendation.
@TLATER@Atemu building big-parallel (or maybe some new class like big-memory) tasks one at a time actually makes a lot of sense.
Also, regarding cgroups/kernel OOM killer/userspace OOM killer configurations, I wouldnât write them off so quickly. Itâs true that they wonât allow the currently failing builds to succeed. However, it is inevitable that there will always be some derivations that occasionally run out of memory, despite our best efforts.
I think that constraining such build processes ought to be included in the âbuild sandboxingâ that nix provides (I am aware, that the nix build sandboxing is meant more for reproducibility rather than as a security feature, but my point still stands).
An additional optimisation that just sprung to my mind would be that builds marked to be preferLocalBuild could be ran with higher job count than the system default as theyâre assumed to be tiny.
Certainly. For properly killing a whole build cgroups are an obvious boon. This would also allow the likes of systemd-oomd to kill builds pre-emtively.
Now, ideally, nix could even get smart enough to restart such a failed build because it didnât actually fail because of some property of the build itself but rather an âenvironmentalâ factor; an impurity.