Pre-RFC: Gradual Transition of NixOS x86_64 Baseline to x86-64-v3 with an Intermediate Step to x86-64-v2

1 Like

TL;DR Ubuntu is looking into this topic aswell and have come to the same conclusion as we have: There is no reliable data on the performance benefit and we need actual real-world people to test their workloads on real systems to find out.

2 Likes

My router is firewall appliance with a Celeron J3060 running NixOS. I could set up a builder if the change is made, but I’d really prefer not to have to do that.

2 Likes

Phoronix has a decent set of benchmarks on Ubuntu’s experimental x86_64-v3 ISO:

Summary:

The Ubuntu x86-64-v3 performance benefits overall were typically small but consistent. In some workloads the x86-64-v3 applications obtained from the archive could be a great deal faster but ultimately it comes down to a subset of software that will really benefit.

2 Likes

Now Fedora is looking into providing optimized packages too.

3 Likes

RHEL 10 will also go for it, apparently: https://developers.redhat.com/articles/2024/01/02/exploring-x86-64-v3-red-hat-enterprise-linux-10

But note that distros with very long support cycles (5–10 years) can be more aggressive about this than e.g. NixOS/NixPkgs, as just after several months you’d be left without compatible binaries on any maintained version.

1 Like

From the Fedora 40 announcement:

Systemd will be modified to insert the additional directories into the $PATH environment variable (affecting all programs on the system) and the equivalent internal mechanism in systemd (affecting what executables are used by services). Individual packages can provide optimized libraries via the glibc-hwcaps mechanism and optimized executables via the extended search path.

This is an interesting workaround for the “doesn’t work with static executables” limitation of glibc-hwcaps that we should be able to trivially replicate.

1 Like

tbf we could also do that but just announce the changes now and implement them 5 years later (eg announcing a v2 baseline this year and then actually implementing the baseline in 2029)

I’m afraid I fail too see an advantage in announcing such a thing years in advance.

Providing specific packages that make good use of the optimizations would be valuable for saving build time and storage space. However, being Nix, it’s already possible to build packages for specific CPU micro-architectures. This prevents using the cache as would be expected, but it’s also just really confusing to use, and depending on how you do it, it could involve rebuilding everything all the way down to the bootstrap tools just to optimize one package.

Before changing the default, I’d suggest we make it easier for users to select which micro-architecture or level to build for to apply to individual packages. If the package in question is a library, then dependant packages could use it with overlays, though this would also require rebuilding those dependent packages.

Another advantage we have being Nix is that changing the default does not mean completely dropping support. Much like how you can already set a micro-architecture higher than the baseline, increasing the baseline should still allow using older hardware at the cost of specifying so explicitly and less ability to use the caches. Assuming we can actually make setting the micro-architecture simple enough so people would actually understand how to do it, we could just say to do so in case one has older hardware. They’d need to compile manually, but it should be possible for Hydra to build at least stdenv for multiple micro-architectures, even if not all of nixpkgs.

TLDR: Increasing the baseline isn’t a compatibility problem for us like it is with other distros, but rather a documentation problem.

Also, Nixpkgs will simply deny v4 optimizations when using GCC 12, which is the default on unstable right now. Staging has updated to GCC 13, but it could be some time before that propogates to other channels.

1 Like

There’s rarely such a thing as “optimising one package”. The vast majority of packages use libraries for much (if not most) of their work. This means these libraries also influence the package’s performance.

I.e. optimising ffmpeg would likely gain very little (if any) vorbis encoding performance because ffmpeg uses a dedicated library for the performance-critical part of that.

*no ability to use the caches.

You likely wouldn’t even be able to boot an installer ISO on an unsupported CPU.

Raising the default baseline will exclude the majority of users of then-unsupported hardware. There’s no way around it.

Just the stdenv wouldn’t really help anyone. If you’re going to be building your average desktop closure, having the stdenv cached or not doesn’t make a huge difference.

Suffice to say, this is not a feasible option if you want users of then-unsupported hardware to still be able to reasonably use NixOS. Especially considering their hardware likely being highly unsuited for compiling modern software.

If anyone should be building their own packages, it’s the users of powerful modern CPUs who need the handful of % mean improvement for …what reason exactly?


Anyhow, I think it’s reasonably well established that:

  • There is a point to this. With v3, the increase is small and highly package dependant but it’s there, it’s real and likely desirable, especially for those few packages.
  • There are many users who would be effectively excluded from using the distro were the baseline raised to v3 and even some who would be excluded by v2.
  • We’re can have our cake and eat it too at the expense of a closure size increase (how much?) using glibc-hwcaps in addition to the $PATH workaround for the executables themselves.

The next step is to build code infrastructure for glibc-hwcaps usage and use it on packages&libraries that benefit.
Until that is done and well established, I think we can pause any discussion on raising the baseline.

7 Likes

Would this change only apply to Linux or also include Darwin?

Darwin has two levels of support:

  • x86_64: Implies SSE3 but not necessarily SSE4.2.
  • x86_64h: This appears equivalent to x86-64-v3. It targets Haswell CPUs or newer.

The only caveat is Rosetta 2 does not support AVX, so it can’t really benefit from x86_64h. However, the most likely approach would be to build fat binaries supporting both architectures, so it could continue to run x86_64 code. And if you really concerned about performance on an aarch64 Mac, you should be using native code anyway.

(Whether it’s worth the effort is a separate question. My inclination is to say it’s not.)

Seems SerpentOS is going to use a custom superset of v3 called v3x.

Can we add an option, perhaps to stdenv?, that selects the x86_64 baseline? Then we can have multiple versions of each package with different baselines in the cache, and let the user choose based on their hardware.

The setting exists, gcc.arch = "x86_64-v2". Caching multiple architecture levels requires a lot of additional cache space and build farm time.

3 Likes

Considering Nix is source-based, wouldn’t it be possible to still support older microarchitectures by just not using the cache?

Older microarchitectures have less processing power for compiling software. Therefore they need cached builds the most.

1 Like

See discussion in Pre-RFC: Gradual Transition of NixOS x86_64 Baseline to x86-64-v3 with an Intermediate Step to x86-64-v2 - #51 by Atemu

It appears I must retract a statement of mine again in light of new data.

I’ve recently been reminded of a talk I saw a few years ago where evidence was presented that LLVM 3.1’s -O3 shows no significant improvement over -O2 if you continuously randomise the memory layout.

-O2 vs -O3 is not the topic of this thread of course but it’s closely related. An important detail I had forgotten was just how significant the measurement bias caused by memory layout is and how easily it’s caused.

In short: Even extremely minor changes such as a different length of UNIX user name or different working directory can cause very significant changes in performance of the same binary.
If @nyanbinary and I ran the same executable on the same hardware and software environment, we could realistically see something on the order of a 10% performance difference just because my UNIX user name is shorter.

This is a huge issue because the slightly different memory layout due to slightly different code generation (i.e. different -march) can cause significant performance differentials even if that slightly different code itself causes no significant difference at all.

The paper which first reported on this issue claims variance of upwards of 20% in industry standard benchmarks (though usually lower) just due to such measurement biases.

Any experiment which shows a significant but not immense increase by changing the compiler optimisation µ-arch target could be fully explained by just changes in memory layout alone. Something that looks like a slight increase could be a slight regression that is masked by the measurement bias in actuality.

The authors of the original paper do propose a method to control for some of the biases and have shown that a sufficiently diverse set of benchmarks can mitigate the measurement bias to some degree.
The presenter of the talk I watched and author of the paper it’s based upon has developed an effective technique to control this bias called Stabilizer which continuously randomises the memory layout down to the function at runtime; thereby removing any possibility of their position to impact performance.
Sadly the original project has not been developed further and is not compatible with modern clang or gcc. There are two forks which have attempted to support newer versions of clang (12, 16) but they do not appear to be very mature.

Stabilizer talk recording on Youtube: https://youtu.be/r-TLSBdHe1A (well worth watching)
Stabilizer paper: https://people.cs.umass.edu/~emery/pubs/stabilizer-asplos13.pdf
“Producing Wrong Data Without Doing Anything Obviously Wrong!”: https://users.cs.northwestern.edu/~robby/courses/322-2013-spring/mytkowicz-wrong-data.pdf


Given how none of the data presented so far even attempts to control for this bias, I am no longer sure this is worth pursuing at all, even in the “lite” form of glibc-hwcaps.

I now also need to question the majority of benchmarks of software I see anywhere because, to my knowledge, none of those control for this bias either.

8 Likes

I decided to try and rebuild my laptop with various arch flags and found a lot of build failures, such as libvorbis segfaulting on a zen4 arch, x509 certificate validators failing on x86_64-v3 and more. I think having a parallel build pipeline for at least a core subset of packages for different architectures would be useful to shake out these failures early.

There’s a lot of derivations in the store that don’t care what platform they’re running on (think pure-python, pure-ruby, npm packages), which get these implied dependency on the architecture because they’re force-built against a particular version of of the host interpreter just in case it relies on a c lib, which itself is arch dependent.

The design of Nix makes this really hard to do, but surely there’s got to be a way to split these derivations out into architecture-agnostic package-sets. Perhaps content-addressed packages may provide a way to collapse truly-arch-independent packages down automatically, by collapsing N different derivations that all build to the same set of files into a single output?

2 Likes