I’ve been hunting down a performance issue when running a CPU-bound benchmarks via nix-build. I have a somewhat minimal reproducer at eugeneia/reproduce-nix-build-perf (github.com) which documents the setup.
Note however, that I can only reproduce this on one particular server… So at this point I feel like I could probably purge the server and replace it and maybe never see this issue again, but I do would like to understand it. So here goes…
On the server where this seems to be 100% deterministically reproducible, I’m basically seeing this: when I run the workload under nix-build I get ~1.3 instructions/cycle, and when I run the same executable manually I get ~2 instructions/cycle.
Some hints I have found while looking into this:
- I seem to need to do some pointer chasing to trigger this, i.e.:
- I have also looked at what’s actually happening on CPU via AMDuProf and Instruction Based Sampling (IBS) tells me that tag-to-retire of instructions surrounding loads/stores is heavily inflated matching the tag-to-complete measurements of the interlerved loads/stores. (Are loads/stores slower for some reason?)
- Trying to understand data cache access using AMDuProf I see different distributions of unaligned access rations over the hot functions, maybe the slower run has slightly more total unaligned accesses. (I’m not super sure about this one, the differences are rather slight, this is sampling based, and I am not super experienced in reading these tea leaves…)
To test my hunches I tried manually aligning the hot memory locations, but that didn’t change anything.
Now being a bit at a dead-end, I would like to attack the issue from the other side of things. What is it that nix-build does to trigger this issue on this particular machine? As far as I understand the only difference would be that the workload uses a different linker when run under nix-build, but that doesn’t seem to be it:
max@spare-NFG2:~/test-nix$ time /nix/store/f3qlm2873bxlhxns4lrmrinvbzn933pj-glibc-2.34-210/lib/ld-linux-x86-64.so.2 result/bin/test
Received 1000000000 packets (fl.nfree=100000)
max@spare-NFG2:~/test-nix$ time result/bin/test
Received 1000000000 packets (fl.nfree=100000)
What did I miss? Any ideas how I can further reproduce the nix-build environment without actually invoking nix-build? Are there any resources I could read to better understand how nix-build creates the environment in which the executable is run? (I.e., how does it make it use the nix linker and dynamic shared objects?)
Thanks in advance!
I suppose your already looked at this but sometimes obvious causes are worth asking
Are you sure nix-build is starting only one instance of your benchmark?
Did you check with top/htop that nothing else was using some CPU while your benchmark is running through nix-build?
On NixOS at least, you will also by default be building inside a sandbox: nix.conf
If you haven’t disabled it yet, that might be a quick way to assert one of the environment changes that might cause issues here.
The rest of the environment can be reproduced using
nix-shell, see e.g. the nix shell examples section for a quick look at how you would do that. I don’t think this sets up the sandbox by default, but once you’ve disabled sandboxing you’ll know if it’s the sandbox that causes these differences.
Sadly can’t find any particularly good docs otherwise. The nix pills cover the build process in a lot of detail, and may ultimately be what you need to understand: NixOS - Nix Pills
On top of that, finding
stdenv.mkDerivation in nixpkgs and reading the script might help.
I think that I covered those bases by having
~$ cat /etc/nix/nix.conf
sandbox = false
max-jobs = 1
And the CPU should be isolated and otherwise idle. I have checked via htop and not seen anything suspicious, but that seems error prone. However, the
default.nix I use shouldn’t allow for any parallelism either, I think?
I can’t reproduce this in
Very weird! Any chance it’s a file system thing? The only remaining potential difference I know of off the top of my head is that
nix-build will by default create a temporary working directory in
/tmp, though maybe you did so when using
Could you monitor the CPU frequency? (with s-tui for example)
Maybe the CPU frequency isn’t scaling exactly the same when it’s done by the builder (maybe due to cgroups, nice, namespaces or whatever that would limit its impact on the system by throttling performance)
Frequency is the same, scaling disabled in bios but also identical as reported by perf in both cases. I was thinking something like this in the beginning but there is no nice set as per htop, and at least forcing the process into my user’s cgroup/slice doesn’t make a difference.
Hmm the program shouldn’t touch the filesystem. strace also doesn’t show anything unusual as far as I can tell.
You could try manually running the benchmark from inside the build env, see if that makes a difference. To get an interactive shell inside the build container, put a long sleep at the end of the install phase and run
nsenter --target $(pidof sleep) --all -- /bin/sh
Interesting! However in my case this doesn’t do anything since the sandbox is disabled. (I’ve checked and the namespaces are the same in both environments.)
Uhm… so I’ve tried to compare single-user installation of nix vs multi-user/daemon installation. Those itself doesn’t seem to differ. However, on the way I noticed that the issue goes away if I run:
nix-build --no-sandbox --allow-new-privileges
--allow-new-privileges have the effect when supplied individually without the other.
So as long as I have both I get the “good” perf, once I remove either I get the bad perf. (In my current test I don’t escape the sandbox or elevate privileges either fwiw.)
Umm… I guess
--allow-new-privileges isn’t a noop, because it allows getting new privileges in the current namespace maybe?
This is very weird. Does gcc make use of anything exotic? Is this some overhead in perf itself?
I can reproduce this regardless of using perf or not. Can’t really comment on gcc. Should be default nixpkgs gcc and a rather simple C program.