Pre-RFC: Gradual Transition of NixOS x86_64 Baseline to x86-64-v3 with an Intermediate Step to x86-64-v2

There’s rarely such a thing as “optimising one package”. The vast majority of packages use libraries for much (if not most) of their work. This means these libraries also influence the package’s performance.

I.e. optimising ffmpeg would likely gain very little (if any) vorbis encoding performance because ffmpeg uses a dedicated library for the performance-critical part of that.

*no ability to use the caches.

You likely wouldn’t even be able to boot an installer ISO on an unsupported CPU.

Raising the default baseline will exclude the majority of users of then-unsupported hardware. There’s no way around it.

Just the stdenv wouldn’t really help anyone. If you’re going to be building your average desktop closure, having the stdenv cached or not doesn’t make a huge difference.

Suffice to say, this is not a feasible option if you want users of then-unsupported hardware to still be able to reasonably use NixOS. Especially considering their hardware likely being highly unsuited for compiling modern software.

If anyone should be building their own packages, it’s the users of powerful modern CPUs who need the handful of % mean improvement for …what reason exactly?


Anyhow, I think it’s reasonably well established that:

  • There is a point to this. With v3, the increase is small and highly package dependant but it’s there, it’s real and likely desirable, especially for those few packages.
  • There are many users who would be effectively excluded from using the distro were the baseline raised to v3 and even some who would be excluded by v2.
  • We’re can have our cake and eat it too at the expense of a closure size increase (how much?) using glibc-hwcaps in addition to the $PATH workaround for the executables themselves.

The next step is to build code infrastructure for glibc-hwcaps usage and use it on packages&libraries that benefit.
Until that is done and well established, I think we can pause any discussion on raising the baseline.

7 Likes

Would this change only apply to Linux or also include Darwin?

Darwin has two levels of support:

  • x86_64: Implies SSE3 but not necessarily SSE4.2.
  • x86_64h: This appears equivalent to x86-64-v3. It targets Haswell CPUs or newer.

The only caveat is Rosetta 2 does not support AVX, so it can’t really benefit from x86_64h. However, the most likely approach would be to build fat binaries supporting both architectures, so it could continue to run x86_64 code. And if you really concerned about performance on an aarch64 Mac, you should be using native code anyway.

(Whether it’s worth the effort is a separate question. My inclination is to say it’s not.)

Seems SerpentOS is going to use a custom superset of v3 called v3x.

Can we add an option, perhaps to stdenv?, that selects the x86_64 baseline? Then we can have multiple versions of each package with different baselines in the cache, and let the user choose based on their hardware.

The setting exists, gcc.arch = "x86_64-v2". Caching multiple architecture levels requires a lot of additional cache space and build farm time.

3 Likes

Considering Nix is source-based, wouldn’t it be possible to still support older microarchitectures by just not using the cache?

Older microarchitectures have less processing power for compiling software. Therefore they need cached builds the most.

1 Like

See discussion in Pre-RFC: Gradual Transition of NixOS x86_64 Baseline to x86-64-v3 with an Intermediate Step to x86-64-v2 - #51 by Atemu

It appears I must retract a statement of mine again in light of new data.

I’ve recently been reminded of a talk I saw a few years ago where evidence was presented that LLVM 3.1’s -O3 shows no significant improvement over -O2 if you continuously randomise the memory layout.

-O2 vs -O3 is not the topic of this thread of course but it’s closely related. An important detail I had forgotten was just how significant the measurement bias caused by memory layout is and how easily it’s caused.

In short: Even extremely minor changes such as a different length of UNIX user name or different working directory can cause very significant changes in performance of the same binary.
If @nyanbinary and I ran the same executable on the same hardware and software environment, we could realistically see something on the order of a 10% performance difference just because my UNIX user name is shorter.

This is a huge issue because the slightly different memory layout due to slightly different code generation (i.e. different -march) can cause significant performance differentials even if that slightly different code itself causes no significant difference at all.

The paper which first reported on this issue claims variance of upwards of 20% in industry standard benchmarks (though usually lower) just due to such measurement biases.

Any experiment which shows a significant but not immense increase by changing the compiler optimisation µ-arch target could be fully explained by just changes in memory layout alone. Something that looks like a slight increase could be a slight regression that is masked by the measurement bias in actuality.

The authors of the original paper do propose a method to control for some of the biases and have shown that a sufficiently diverse set of benchmarks can mitigate the measurement bias to some degree.
The presenter of the talk I watched and author of the paper it’s based upon has developed an effective technique to control this bias called Stabilizer which continuously randomises the memory layout down to the function at runtime; thereby removing any possibility of their position to impact performance.
Sadly the original project has not been developed further and is not compatible with modern clang or gcc. There are two forks which have attempted to support newer versions of clang (12, 16) but they do not appear to be very mature.

Stabilizer talk recording on Youtube: https://youtu.be/r-TLSBdHe1A (well worth watching)
Stabilizer paper: https://people.cs.umass.edu/~emery/pubs/stabilizer-asplos13.pdf
“Producing Wrong Data Without Doing Anything Obviously Wrong!”: https://users.cs.northwestern.edu/~robby/courses/322-2013-spring/mytkowicz-wrong-data.pdf


Given how none of the data presented so far even attempts to control for this bias, I am no longer sure this is worth pursuing at all, even in the “lite” form of glibc-hwcaps.

I now also need to question the majority of benchmarks of software I see anywhere because, to my knowledge, none of those control for this bias either.

8 Likes

I decided to try and rebuild my laptop with various arch flags and found a lot of build failures, such as libvorbis segfaulting on a zen4 arch, x509 certificate validators failing on x86_64-v3 and more. I think having a parallel build pipeline for at least a core subset of packages for different architectures would be useful to shake out these failures early.

There’s a lot of derivations in the store that don’t care what platform they’re running on (think pure-python, pure-ruby, npm packages), which get these implied dependency on the architecture because they’re force-built against a particular version of of the host interpreter just in case it relies on a c lib, which itself is arch dependent.

The design of Nix makes this really hard to do, but surely there’s got to be a way to split these derivations out into architecture-agnostic package-sets. Perhaps content-addressed packages may provide a way to collapse truly-arch-independent packages down automatically, by collapsing N different derivations that all build to the same set of files into a single output?

2 Likes

So even for computational demanding software the performance difference between v2 and v3 is mostly within the error margins.

My takeaway: Stick with current settings and add a good guide that:

  • explains how to optimize individual packages by setting architecture feature flags
  • explains how to confirm that there is an actual speed-up (which is not guaranteed, see zstd compression in the CachyOS benchmarks) on the target machine and with the expected workload.

But I guess people that have a real requirement for these optimizations (e.g. in the HPC context) are already tuning their packages.

5 Likes

Came here to post this too.

General observation seeems to be not much difference except some particular cases, which matches discussions so far generally.

But who would have guessed that PHP was one of those cases?!

Yep, I had this intuition when I was doing performance diagnostics on Nextcloud deployments and looking at perf for a while. I think PHP is not exploiting the hardware in any serious way via cpuid alas.

But this is an interesting observation, this makes it really compelling to encourage people to use PHP on a higher tier if they need that performance boost on modest CPUs albeit having advanced CPU instruction sets.

Though, I guess now it’d be interesting to do real world benchmarks on PHP applications. :slight_smile:

Just a footnote:

performance difference between v2 and v3 is mostly within the error margins.

Ok, but we’re not even enabling v2 yet IIRC. It seems an easy win to enable compilation for v2 given its broad compatibility (any post~2009 cpu). This would allow us to test the waters and identify misbehaving packages/toolchains early. instead of relying on random end user builds to discover packages that break with arch specific flags, and have to patch in exceptions for those packages.

Also eventually I’d love to see PGO/BOLT enabled packages on NixOS but it seems unlikely we’d be able to get there if we can’t stabilize builds enabling post-2009 cpu features.

I don’t see any data suggesting a significant “win” if you account for the insane error margins that the measurement bias imposes.

PGO/BOLT don’t exclude any hardware to my knowledge. They are an entirely tangential topic.

2 Likes

Pre-RFC: Gradual Transition of NixOS x86_64 Baseline to x86-64-v3 with an Intermediate Step to x86-64-v2 - #38 by riceicetea
we are not using O3 because sometimes decreases performance (by agressive unrolling, etc).

But, has anyone ever benchmarked a full-o3 system to an o2 system?

I’ve asked the CachyOS people (ricers, I know!) about this kind of thing over telegram, and their response is that 2020s CPUs actually respond well to -O3, but only if you also use x86-64-v3.

See also Sunnyflunk.

Pre-RFC: Gradual Transition of NixOS x86_64 Baseline to x86-64-v3 with an Intermediate Step to x86-64-v2 - #39 by Atemu
O3 includes potentially optimisations that produce potentially unsafe and/or wrong code.

Are you confusing -O3 with -Ofast, or are you talking about undefined behavior getting surfaced? Either way, it’s not totally the compiler’s fault, and buggy packages can be marked as O2-only anyways until more investigation is done.

(There are some more commonly used UBs, but compilers have accommodating flags like -fwrapv. Some use fno-strict-aliasing too.)

That’s nice but that’s not data.

There is a ton of things you can unknowingly do wrong when producing such data (see my previous post), so even if it was data, it’d have to be extremely clear and plentiful, not just a minor difference in a subset of benchmarks.

I’m not entirely sure where I have that from, so it might not be true (anymore?) but I don’t mean -Ofast. -Ofast enables things that knowingly break certain aspects of specifications/standards.

Long ago gcc -O3 really used to produce buggy code relatively commonly IIRC. I don’t think that should hold anymore. But if -O3 was a good default in general (e.g. for a whole distro), I wonder why gcc is still keeping -O2 as the default.

3 Likes