Transparently Supporting PGO/FDO/PLO Optimized Builds

This thread is early exploration of how to employ heavily optimized binaries that require multi-step workflows while using Nix.

Goal

Support enabling use of such binaries with a single configuration option on NixOS or one of two (profile versus optimize) attirbutes on many derivations in nixpkgs. When enabled on NixOS, the user’s machine would automatically use the profiling builds first and switch the derivation to an optimized build once available.

Implementation Sketch

The high level problem is that profiling requires two unique derivations to be used. However, while the user would opt into the workflow, they would still only declare one thing installed. I believe the correct high-level model for NixOS is binary substitution. The derivation the users specifies does not change. The binary used to satisfy that specification changes.

Runtime questions created:

  • How do we decide if we have the profiling data necessary to do substitution?
  • How do programs know where to output their profiling data?
  • How does the user know the current states?
  • How does substitution occur?

On NixOS

First, each profiling binary needs an environment variable to decide where to output the profile data. I do not think wrapping is the right approach, but I don’t think I have considered enough either. With Clang, it seems a pattern can be used so that one environment variable automatically covers all profiling binaries.

Second we have to get the profile data available for building substitutes and making them active. On NixOS, I believe a daemon is the right approach. The daemon would merge and place this data into the nix store.

When enough profile data is available, as a threshold of configurable megabytes, it should build the target binary and either notify the user or perform some enabling operation automatically.

In Nix

So, how do these nuts and bolts become expressions?

Specifying Use of Binaries

The user likely only wants key binaries to be optimized. Each derivation should have a convenient way to be expressed as one that will be optimized. The optimization-capable derivation should further support being built in its profile and optimize variants.

Deriving Profile vs Optimize Builds

Since extra compiler flags are sufficient to convert many vanilla derivations to profile-capable derivations, I believe overriding them with a modified stdenv and exposing that override as an attribute is acceptable. Use of a stdenv adaptor would enable specific creation of both variants of a derivation manually for testing and development.

Selecting Optimized Builds

This should be done by simply supplying an extra argument to the nix CLI or manually to the derviation so that the optimized variant of the profile-capable derivation is built.

Driving the Work Forward

It is of obvious commercial significance, both for enthusiasts and “warehouse” applications. Given that, I believe that instead of waiting for myself or volunteers to write what is necessary, it may be both viable and significantly faster to collect interested parties to coordinate paying sizeable tips to developers who make progress either on the design or implementation so as to accelerate both.

As the viability grows, the chance to obtain the a well-implemented result may attract other users to support the collective effort. This is a new model of commercial open source I’m presenting via https://prizeforge.com, where I am developing the means of coordinating the tips. In any case I will be opening a stream for NixOS soon while I seek to start attracting users.

I’m currently doing FDO/PLO on Emacs binaries to work out the basics of how the variant derivations actually differ and how to provide the data to the derivation source.

5 Likes

Got some FDO builds running. Was nice. Emacs benefits intensely from -Os with FDO, around twice as fast as what is in the Nix community emacs-overlay defaults.

For PLO, it actually looks like more work to get LLVM to have propeller support. Still out of tree and seems to need patching and some rearrangement to work with LLVM. I don’t want to just hack it in, but I won’t intentionally do it right since I’m focused on the automation, not making LLVM with propeller support.

3 Likes

Before taking a break, I’m going to collect the documents that are making me understand this all:

The kernel uses AutoFDO because, understandably, we want to measure performance counters, not fully instrument kernels:

https://docs.kernel.org/dev-tools/propeller.html

I’m going to embrace this pathway for all binaries because it is lightweight and can be done with a vanilla binary. It is more similar because propeller also consumes perf data. However, it seems three binaries are required in the end, vanilla, FDO, and FDO+Propeller. LTO can be applied to all or none, thin or fat, but should be conserved.

Building the bolt tools gives some idea of what using LLVM source will look like in nixpkgs:

This document best indicates what actual binaries are used when, and what they might depend on (WIP):

https://lists.llvm.org/pipermail/llvm-dev/2019-September/135393.html

Reconciling this with the bolt nix file provides some insights. It’s not 1:1. From the previous link:

Our system [propeller] leverages the existing capabilities of the compiler tool-chain and is not a stand alone tool.

This suggests that using Propeller depends more on LLVM. Perhaps after I get it building, I will understand more, but what is concerning for now:

-DLLVM_ENABLE_PROJECTS="clang;lld;compiler-rt"

I will need to determine if this results in a different clang and lld or if those sources are only used to build the create_llvm_prof binary.

Seems like most or all of the propeller work has impacted upstream enough to just use built-in support, which 25.05’s llvm already has.

I have a remaining bit of uncertainty around whether the basic block reordering can all be done during LTO (thin vs full?) or if the propeller tool in google/llvm-propeller still serves some unique purpose.

In any case, I decided that propeller is a distraction and just to build the best binaries I can get from stock LLVM. What really must be figured out is collecting profile data and swapping derivations out.

At an extremely basic level, we could demonstrate binary substitution with --march flags, downloading the vanilla binary and swapping in the CPU tuned binary when it becomes available.

The key point remains that the same declaration has to point to 2-4 different binaries and the files linked into place are dynamic. Only more preparation to implement will reveal if mutating a profile (re-linking new files into place) or generating a new profile (perhaps with the user in the loop) is the cleaner route.

Very important bit:

Because the profile is stale relative to the newest source,
we have developed compiler techniques to tolerate staleness
and deliver stable performance for the typical amount of
code change between releases. On benchmarks, AutoFDO
achieves 85% of the gains of traditional FDO

Basically, once a user has some AutoFDO profile, they can reuse that profile data on the next release source code, and the compiler will be able to do almost as good of a job. That does indeed take a lot of the pain out of the lifecycle.

Another lifecycle improvement comes from using perf instead of instrumentation. With instrumented programs, we have to set LLVM_PROFILE_FILE, which would mean wrappers to create unambiguous output paths. With perf, we can use a daemon to monitor the programs of interest and collect performance to a path that depends on the derivation hash, preventing any unintended mixing of profile data.

I had a leftover time box to Nixify the propeller profile generator tool. What I found made me decide to leave this sub-task alone for now and hope someone more familiar with cmake finds Propeller to be useful.

I cannot use the tool until I upgrade from my zen2 machine anyway. We need zen3 minimum and zen4 preferred to have the necessary hardware support for perf to record the profiles. This reminds me I need zen3+ for AutoFDO as well.

The google/llvm-propeller repo appears to be one of the more annoying kinds of projects to nixify. Every cmake file is going to attempt to download dependencies, some of which are themselves not yet in nixpkgs:

https://github.com/google/llvm-propeller/blob/main/CMake/Quipper/Quipper.cmake

I’m quite unfamiliar with nixifying this kind of repo, and that is why I’m shoving Propeller onto the back burner for now.

Since AutoFDO and Propeller are all blocked on me getting a zen3+ machine, and since I think we should focus on AutoFDO, I guess I am completely blocked :no_good_woman:

do you have a branch up somewhere with your attempt to make propeller work? I’d like to give propellerizing emacs via nix a shot.

Hi! Very excited to learn about this project and about your persistent progress! Managing optimized builds within the “derivation model” is, intuitively, hard. At Nixpkgs we currently dismiss the challenge due to dubious expected benefits (e.g. Pre-RFC: Gradual Transition of NixOS x86_64 Baseline to x86-64-v3 with an Intermediate Step to x86-64-v2 - #59 by Atemu) and due to complexity required to implement optimized builds without compromising “purity” (basic repeatability and being able to identify different builds, e.g. distinguish between builds with different profiles). One trivial solution is to “rebuild the world”: override stdenv globally, injecting profile data as pure inputs, passing the appropriate compiler and linker flags, and setting requiredSystemFeatures to distinguish between derivations destined for sufficiently different hardwares. This is an expensive solution, which is what, if I understand, leads to your following points:

  • Quick check: do you mean two instantiations of the same expression, with a lag and different profile data inputs?
  • Question: do you mean the “nixlang expression does not change”, or the “.drv does not change”? In the latter case, do you suggest “impurely” overwriting (overlaying) the untampered derivation’s outPaths?

Optimized builds could well be a downstream application of the entire range of “late (delayed) binding” problems and efforts:

4 Likes

Yes.

the “nixlang expression does not change”

Yes. I believe a desirable user experience is to have one expression that evaluates to different derivations as data becomes available. The implementation of that UX may force many other decisions, but I think beginning with that UX as a design goal will subtly avoid future problems for the most users.

Optimized builds could well be a downstream application of the entire range of “late (delayed) binding” problems and efforts

Agree, sounds similar. Reading about dynamic derivations might apply to keeping evaluation nice.

Some facts on my mind:

  • One derivation per optimization step, depending on optimization data in the store, appears to be the only way to have reproducibility
  • The same optimized build will be distributed to many machines. For the “warehouse applications” Propeller was designed for, that’s definitely the case.
  • While we can put profile content into the Nix store to create reproducibility, we don’t know the path until the profile data already exists, but it doesn’t exist at all in some cases and may be stale in others.
  • During evaluation, we have to decide which branch of nix evaluation to take, selecting the derivation, before the profile data exists to decide which store path it would exist at

All this makes me believe that we will check for optimization data using a fetch style solution where we know what we are looking for but not what we will find or the store path that will result. Unfortunately, the fetch is impure, but not so impure that --impure should be the only way to support it.

do you have a branch up somewhere with your attempt to make propeller work? I’d like to give propellerizing emacs via nix a shot.

No. I did get instrumented profile working. AutoFDO should work similarly as long as you have a CPU that can support the necessary perf measurements. For propeller, as far as I can tell, we need to nixify google/llvm-propeller. This is the “profile conversion tool” mentioned in Linux propeller docs, which still links to the older google/autofdo repo.

Specific to Emacs, note that I used the IGC branch and -Os optimization. The performance is around double, which is very uncommon, but not unheard of for interpreters and runtimes.