Why is the nix-compiled Python slower?

alan · April 19, 2022, 10:40pm

Hello friends,

We’re trying to replace a custom-built toolchain package manager with nix. So far it’s going well, but we hit a pretty large roadblock with the performance of the Python interpreter that nix provides. On average, running the nix Python 3.8 interpreter results in a ~20% performance penalty for most operations.

I’ve narrowed this down to GCC or something that deals with the interpreter & its compilation. However, the compilation flags used for the nix Python and the Ubuntu Python are the same. Any guidance or insight into why the Python interpreter is slower with nix would be much appreciated.

With a little benchmark script, we see the difference more clearly:

Python 3.8.0 (via apt-get) + Ubuntu 18.04 + GCC 7.5.0

7.592594s (oct)                
 1.802671s (iter str)           
 1.616609s (list str)           
 1.426657s (map)                
 1.073226s (gen 1000000)        
11.625021s (gen 10000000)       
 1.694831s (small json dump)    
 0.945864s (small json load)    
16.688743s (big json dump)      
11.152032s (big json load)      
 0.357358s (small pickle dump)  
 0.384992s (small pickle load)  
 3.645764s (big pickle dump)    
 4.759644s (big pickle load)

Python 3.8.12 (via python38) + Nix 2.3.15 + Nixpkgs 21.11 + GCC 10.3.0

10.383724s (oct)
 2.246434s (iter str)
 2.022653s (list str)
 1.788243s (map)
 1.342692s (gen 1000000)
14.281956s (gen 10000000)
 2.212560s (small json dump)
 0.956572s (small json load)
21.813559s (big json dump)
11.343651s (big json load)
 0.325251s (small pickle dump)
 0.425216s (small pickle load)
 3.472391s (big pickle dump)
 5.079145s (big pickle load)

This benchmark script runs a bunch of tests in a loop for built-in modules; there are no 3rd party imports
yet.

To further isolate things; I even downloaded the official Python 3.8.13 source from python.org, and compiled it inside and outside a nix-shell. The results were the same, with the nix version being slower. This points to something with how nix compiles Python; and you can reproduce this on an Ubuntu client with, it’s the same benchmark as (oct) above:

mkdir /tmp/py38; cd /tmp/py38
wget https://www.python.org/ftp/python/3.8.13/Python-3.8.13.tgz
tar -xvf Python-3.8.13.tgz
cd Python-3.8.13
./configure --enable-optimizations
make -s -j
./python -c "import timeit; print(timeit.Timer('for i in range(100): oct(i)', 'gc.enable()').repeat(5))"

On my machine with Intel® Xeon® Gold 5118 @ 2.30GHz; this results in about a 7.6-7.7s average.
Then running the same test on nix:

cd /tmp/py38/Python-3.8.13
make clean
nix-shell --pure -I nixpkgs=http://nixos.org/channels/nixos-21.11/nixexprs.tar.xz -p stdenv
./configure --enable-optimizations
./python -c "import timeit; print(timeit.Timer('for i in range(100): oct(i)', 'gc.enable()').repeat(5))"

The results will be around 10.4s on average when doing this inside nix-shell.

This all points to something below the Python layer, maybe how GCC is invoked. We don’t really understand why arithmetic operations are impacted this much; if both are compiled with all optimizations.

Thanks in advance, and if any info is needed / tests suggested, I’m happy to try!

knedlsepp · April 20, 2022, 6:33am

Maybe it’s the related to the hardening flags? you could try disabling those for python using hardeningDisable = [ "all" ];

See Nixpkgs 23.11 manual | Nix & NixOS

alan · April 20, 2022, 11:02pm

Thank you for the suggestion @knedlsepp, hardening definitely made a difference! While it is still slower, the no-hardening self-compiled Python38 is about 3% faster:

9.358782s (oct)              
2.140991s (iter str)
1.950351s (list str)         
1.742722s (map)              
1.369890s (gen 1000000)      
14.595787s (gen 10000000)     
2.128350s (small json dump)  
0.964804s (small json load)  
20.911116s (big json dump)    
11.364416s (big json load)    
0.328704s (small pickle dump)
0.387544s (small pickle load)
3.445015s (big pickle dump)  
4.884283s (big pickle load)

I realize that synthetic benchmarks are not always representative of actual performance; but it is concerning that the same bits behave differently inside & outside nix. I would’ve expected a newer compiler (GCC 7.5.0 vs. nix’s GCC 10.3.0) to be same or better from a performance perspective.

Also, it’s worth noting that using nixpkgs.python38 versus compiling the code inside of nix-shell resulted in the same performance difference. Meaning:

nixpkgs.python38 = 10.4s
nix-shell + compiling python38 = 9.7s
nix-shell + hardeningDisable all + compiling python38 = 9.4s

Furthermore, for using Python’s cProfile module during math operations, we see that 100% of the execution time (as reported by timeit) is spent in execution in the oct calls:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   20.459   20.459 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 <timeit-src>:2(<module>)
        1   10.813   10.813   20.455   20.455 <timeit-src>:2(inner)
        1    0.000    0.000    0.004    0.004 timeit.py:101(__init__)
        1    0.000    0.000   20.455   20.455 timeit.py:163(timeit)
        2    0.000    0.000    0.000    0.000 timeit.py:79(reindent)
        3    0.004    0.001    0.004    0.001 {built-in method builtins.compile}
      2/1    0.000    0.000   20.459   20.459 {built-in method builtins.exec}
        1    0.000    0.000    0.000    0.000 {built-in method builtins.globals}
        2    0.000    0.000    0.000    0.000 {built-in method builtins.isinstance}
100000000    9.642    0.000    9.642    0.000 {built-in method builtins.oct}
        1    0.000    0.000    0.000    0.000 {built-in method gc.disable}
        2    0.000    0.000    0.000    0.000 {built-in method gc.enable}
        1    0.000    0.000    0.000    0.000 {built-in method gc.isenabled}
        2    0.000    0.000    0.000    0.000 {built-in method time.perf_counter}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.000    0.000    0.000    0.000 {method 'format' of 'str' objects}
        2    0.000    0.000    0.000    0.000 {method 'replace' of 'str' objects}

Funnily enough, researching oct led me to the following benchmark script from cPython: https://github.com/python/cpython/blob/f4c03484da59049eb62a9bf7777b963e2267d187/Tools/scripts/var_access_benchmark.py

I’ll do some more testing with that.

knedlsepp · April 21, 2022, 5:24am

It could also be that the Ubuntu Python has some custom patches that improve performance.

See e.g. debian/patches/lto-link-flags.diff · master · Python Interpreter / python3 · GitLab

A very interesting read: All Pythons are slow, but some are faster than others

knedlsepp · April 21, 2022, 5:39am

Also seems we don’t build with profiler guided optimizations for reproducibility reasons:

github.com

NixOS/nixpkgs/blob/d56ccde39ad2a5f439d980a27d43cddcde7f4f4b/pkgs/development/interpreters/python/default.nix#L230


      
            configd = null;
            tzdata = null;
            libffi = pkgs.libffiBoot; # without test suite
            stripConfig = true;
            stripIdlelib = true;
            stripTests = true;
            stripTkinter = true;
            rebuildBytecode = false;
            stripBytecode = true;
            includeSiteCustomize = false;
            enableOptimizations = false;
            enableLTO = false;
            mimetypesSupport = false;
          } // sources.python39)).overrideAttrs(old: {
            pname = "python3-minimal";
            meta = old.meta // {
              maintainers = [];
            };
          });
          
          
pypy27 = callPackage ./pypy {

github.com

NixOS/nixpkgs/blob/d56ccde39ad2a5f439d980a27d43cddcde7f4f4b/pkgs/development/interpreters/python/cpython/default.nix#L42


      
          , passthruFun
          , bash
          , stripConfig ? false
          , stripIdlelib ? false
          , stripTests ? false
          , stripTkinter ? false
          , rebuildBytecode ? true
          , stripBytecode ? true
          , includeSiteCustomize ? true
          , static ? stdenv.hostPlatform.isStatic
          , enableOptimizations ? false
          # enableNoSemanticInterposition is a subset of the enableOptimizations flag that doesn't harm reproducibility.
          # clang starts supporting `-fno-sematic-interposition` with version 10
          , enableNoSemanticInterposition ? (!stdenv.cc.isClang || (stdenv.cc.isClang && lib.versionAtLeast stdenv.cc.version "10"))
          # enableLTO is a subset of the enableOptimizations flag that doesn't harm reproducibility.
          # enabling LTO on 32bit arch causes downstream packages to fail when linking
          # enabling LTO on *-darwin causes python3 to fail when linking.
          , enableLTO ? stdenv.is64bit && stdenv.isLinux
          , reproducibleBuild ? false
          , pythonAttr ? "python${sourceVersion.major}${sourceVersion.minor}"
          }:

Seems there is currently however an issue (with a fix available) with enabling optimizations: python3: Enabling optimisations as documented leads to duplicated packages when pulling in aiohttp or filelock · Issue #163639 · NixOS/nixpkgs · GitHub

Anyway this sounds very much like the root cause to me.

Edit: Ok. It may not be the reason as you pointed out already that you built both “by hand” with optimizations on.

FRidh · April 21, 2022, 7:03am

It is indeed slower because enabling all optimizations affects reproducibility.

wkral · April 21, 2022, 5:16pm

Is it possible that the baseline tests since they’re run with Ubuntu 18.04 do not include specter/meltdown mitigations? Probably doesn’t account for the full 20% but might be a contributor.

alan · April 21, 2022, 7:32pm

Thank you all for the info; I believe I have figured out the performance difference. TL;DR - I believe that the optimization flags explain the difference between the nixpkgs Python and the one I compiled within a nix-shell. Specifically --enable-optimizations --with-lto when compiling Python.

I think the article about Python performance posted by @knedlsepp definitely highlighted some potential things we could do. First, while investigating the PGO & LTO optimizations I had a thought; is the libc that Nix provides, also lacking optimizations? I did a test by compiling Python outside of a nix-shell, but using the same GCC (and by extension the same libc?), and the results were the same as within a nix-shell.

Digging deeper, using something like libc-bench (https://git.musl-libc.org/cgit/libc-bench/), and compiling that inside and outside a nix-shell provided no insights. The performance, on average, was the same between the two; and if anything GCC 10.3 (nix) was slightly faster in some cases than GCC 7.5 (ubuntu). Therefore, libc / gcc are probably not suspects in this issue.

Next, a few other bugs/issues/forum threads led me to testing with Python 3.10 and 3.8, and found that compiling Python3.8 inside nix-shell with --enable-optimizations did not actually use PGO or LTO by default but 3.10 did use PGO.

I erroneously assumed that LTO and PGO would be enabled with the --enable-optimizations flag; but that is not true according to https://bugs.python.org/issue28032#msg275182 you have to specify LTO directly.

However, manually specifying ./configure --enable-optimizations --with-lto for compilation resulted in nearly the same performance of Ubuntu’s; within 3%. Additionally, I did not investigate the PGO difference I observed with 3.8 vs 3.10, but it’s probably an environment issue or user error.

can now get similar numbers inside & outside a nix-shell when I compile Python. I appreciate all the help thus far! I’m going to run some tests to get real world numbers, before continuing to dig deeper. We may be OK with the performance penalty that the nixpkgs.python38 incurs due to the lack of optimizations. At least we understand why!

@wkral – Ubuntu 18.04 does have Spectre & Meltdown mitigations enabled on the more recent kernels; which we are using. It indeed was about a ~20% drop in perf when we enabled the mitigations.

knedlsepp · April 21, 2022, 7:53pm

We may be OK with the performance penalty that the nixpkgs.python38 incurs due to the lack of optimizations. At least we understand why!

No need to live with that! As @FRidh merged python3: fix overriding of interpreters, closes #163639 · NixOS/nixpkgs@ba02fd0 · GitHub you can again use what is described in the manual: Instead of simply using python38 in your expressions, you “just” have to use python38.override { enableOptimizations = true; reproducibleBuild = false; ... } how it’s described in “15.25.2.2. Optimizations” of the nixpkgs manual:

https://nixos.org/manual/nixpkgs/stable/#reference

knedlsepp · April 21, 2022, 8:05pm

Here’s an example default.nix that you can put in your project and use nix-shell.

{ nixpkgs ? (builtins.fetchTarball {
    url = "https://github.com/NixOS/nixpkgs/archive/ba02fd0434ed92b7335f17c97af689b9db1413e0.tar.gz";
    sha256 = "1pjx78qb3k4cjkbwiw9v0wd545h48fj4criazijwds53l0q4dzn1";
  })
, src ? builtins.fetchGit ./.
}:
let
  pkgs = import nixpkgs { };
  python38Optimized = pkgs.python38.override {
    enableOptimizations = true;
    reproducibleBuild = false;
    self = python38Optimized;
  };
  pyPkgs = python38Optimized.pkgs;
in
python38Optimized.pkgs.buildPythonPackage rec {
  name = "python-example";
  inherit src;
  propagatedBuildInputs = with pyPkgs; [
    numpy
  ];
  nativeBuildInputs = with pyPkgs; pkgs.lib.optionals (pkgs.lib.inNixShell) [
    ipython
  ];
}

Notes:

The expression requires the directory to be versioned with git
Fetch a hot beverage while it compiles Python 3.8 for you on the first launch

alexv · April 22, 2022, 12:23am

Would self =... be needed if I create an overlay and override the original attribute, say

python39 = super.python39.override {
    enableOptimizations = true;
    reproducibleBuild = false;
}

?

FRidh · April 22, 2022, 6:23am

self is passed on to some of the passthru attributes such as buildEnv and withPackages. So if you need those, then the answer is yes. Best to just follow the manual

infogulch · April 23, 2022, 7:35pm

There seems to be an interesting history of the tension between enabling python optimizations and providing reproducibility.

https://github.com/NixOS/nixpkgs/pull/84072

https://github.com/NixOS/nixpkgs/pull/107965

As someone unfamiliar, I’m curious why enabling optimizations make the builds nondeterministic. Is this true generally, or just for python? Does the optimizer inject randomness directly? Does it use a Monte Carlo simulation for optimization or something similar, and could it accept a pre-selected seed for a PRNG instead?

FRidh · April 24, 2022, 2:29pm

--enable-optimizations also enables profile-guided optimizations, and it is my understanding that generally speaking it is difficult to get such builds to be reproducible. I have no idea why though.

Other than that flag, the Python builds are not yet reproducible because of Issue 34093: Reproducible pyc: FLAG_REF is not stable. - Python tracker.

alan · April 26, 2022, 3:46pm

Following up on the real world performance difference vs. synthetic benchmarks! Running a set of unit tests (our test runner is Python based) and linting our entire codebase (~6100 Python files) is slower; but not dramatically so.

Some of the linters (eg. autoformatter) are about 25% slower; but others are faster. Overall, the time to lint the entire codebase went from 2m55s to 3m05s. The actual time as reported by time was 3m19s to 3m33s. Overall; not a show stopping performance issue.

I think the end result of this is that if you want the absolute best Python performance you can do:

Update to a newer Python version. There’s a significant difference between 3.6 → 3.8 → 3.10.
Use --enable-optimizations and --with-lto when compiling the Python interpreter.
Disable hardening on your nix derivation. Using hardening disables, on purpose, a few optimizations. Although the performance gain from hardening was minimal, coming in at ~3%.

Leaving things as is will fine fine for now; results in not having to rebuild the interpreter on install. Really appreciate everyone’s help!

TLATER · April 26, 2022, 3:51pm

Maybe consider adding that to https://nixos.wiki/wiki/Python, this is very good science. I’m sure someone will ask this question again, and the machine learning folks might actually care about ~5%. Wouldn’t want this to be buried deep inside discourse

There’s also an “Optimizations” heading in the nixpkgs manual, but I guess that has as much detail as is relevant for a manual?

alan · April 26, 2022, 3:53pm

Good suggestion; I’ll look into doing that!

alan · April 29, 2022, 6:42pm

Done: Python - NixOS Wiki

Feel free to edit, give feedback, etc.

sternenseemann · May 6, 2022, 3:39pm

In principle PGO is reproducible, since the profiles contain no timing information and is not machine/cpu-specific. As long as you make sure that a) the software being profiled is deterministic and b) the input used for profiling is the same always the profiles and compilation output should be reproducible as well.

Common culprits for reproducibility problems when using PGO are random inputs and (I assume at least) race conditions in the build system.

If Python is indeed not reproducible with PGO enabled, likely the build system or profiling instrumentation is the culprit for this, not PGO itself. FWIW gcc has similar issues concerning PGO and reproducibility, but it should be possible to solve them for gcc and python (probably requires patching). A derivation that uses PGO in nixpkgs and is reproducible is foot at this time.

danieldk · December 19, 2022, 2:35pm

We are bumping into this issue as well, but I see up to 40% slowdown on a GPU-driven load (basically Amdahl’s law in effect, the less we spend in single-threaded Python, the better). We can override Python to use optimalizations, but this is quite inconvenient – we’d need to have a binary cache to avoid build times. Also a handful of derivations fail to build, probably because our builds have more idle CPU cores available, so we hit more concurrency issues in Python tests (hope to provide some PRs later).