Pre-RFC: Gradual Transition of NixOS x86_64 Baseline to x86-64-v3 with an Intermediate Step to x86-64-v2

Atemu · February 21, 2024, 5:45pm

It appears I must retract a statement of mine again in light of new data.

I’ve recently been reminded of a talk I saw a few years ago where evidence was presented that LLVM 3.1’s -O3 shows no significant improvement over -O2 if you continuously randomise the memory layout.

-O2 vs -O3 is not the topic of this thread of course but it’s closely related. An important detail I had forgotten was just how significant the measurement bias caused by memory layout is and how easily it’s caused.

In short: Even extremely minor changes such as a different length of UNIX user name or different working directory can cause very significant changes in performance of the same binary.
If @nyanbinary and I ran the same executable on the same hardware and software environment, we could realistically see something on the order of a 10% performance difference just because my UNIX user name is shorter.

This is a huge issue because the slightly different memory layout due to slightly different code generation (i.e. different -march) can cause significant performance differentials even if that slightly different code itself causes no significant difference at all.

The paper which first reported on this issue claims variance of upwards of 20% in industry standard benchmarks (though usually lower) just due to such measurement biases.

Any experiment which shows a significant but not immense increase by changing the compiler optimisation µ-arch target could be fully explained by just changes in memory layout alone. Something that looks like a slight increase could be a slight regression that is masked by the measurement bias in actuality.

The authors of the original paper do propose a method to control for some of the biases and have shown that a sufficiently diverse set of benchmarks can mitigate the measurement bias to some degree.
The presenter of the talk I watched and author of the paper it’s based upon has developed an effective technique to control this bias called Stabilizer which continuously randomises the memory layout down to the function at runtime; thereby removing any possibility of their position to impact performance.
Sadly the original project has not been developed further and is not compatible with modern clang or gcc. There are two forks which have attempted to support newer versions of clang (12, 16) but they do not appear to be very mature.

Stabilizer talk recording on Youtube: https://youtu.be/r-TLSBdHe1A (well worth watching)
Stabilizer paper: https://people.cs.umass.edu/~emery/pubs/stabilizer-asplos13.pdf
“Producing Wrong Data Without Doing Anything Obviously Wrong!”: https://users.cs.northwestern.edu/~robby/courses/322-2013-spring/mytkowicz-wrong-data.pdf

Given how none of the data presented so far even attempts to control for this bias, I am no longer sure this is worth pursuing at all, even in the “lite” form of glibc-hwcaps.

I now also need to question the majority of benchmarks of software I see anywhere because, to my knowledge, none of those control for this bias either.