Tensorflow slower as NixOS native than inside a Docker container

zupo · November 11, 2020, 4:37pm

Hey all! I’m trying to make our data science infrastructue more pure. We have a Tensorflow project that does some computation. At the moment, the computation is done inside a Docker container. I’d like to do the computation natively on NixOS, so I can get rid of Docker. The problem is that the computation is about 10% slower natively than it is inside the Docker container. I can’t figure out why and am looking for ideas what else to try. I’m doing the testing on a single g4dn.xlarge AWS instance.

NixOS environment

Tensorflow is installed in NixOS like so:

{ config, pkgs, lib, ... }:
let

  nixos1903 = fetchTarball {
    url = "https://github.com/nixos/nixpkgs/archive/34c7eb7545d155cc5b6f499b23a7cb1c96ab4d59.tar.gz";
    sha256 = "11z6ajj108fy2q5g8y4higlcaqncrbjm3dnv17pvif6avagw4mcb";
  };

  olderPkgs = import nixos1903 {
    inherit (config.nixpkgs) config;
    overlays = [(self: super: {
      linuxPackages = config.boot.kernelPackages;
      inherit (pkgs) cudatoolkit cudnn;
    })];
  };

  python = olderPkgs.python36.withPackages (ps: with ps; [
    (fire.overrideAttrs (old: { doInstallCheck = false; }))
    regex
    requests
    tqdm
    numpy
    tensorflowWithCuda
  ]);

in
{
  system.stateVersion = "20.03";
  ec2.hvm = true;

  imports = [
    <nixpkgs/nixos/modules/virtualisation/amazon-image.nix>
    # [... snip ... ]
  ];

  nixpkgs.config.allowUnfree = true;
  services.xserver.videoDrivers = [ "nvidia" ];
  hardware.opengl.enable = true;
  environment.systemPackages = with pkgs; [ python ];

  # [ ... snip ...]
}

Versions of packages:

$ python --version
Python 3.6.9
$ ls /nix/store/35vk801xg9yv5crbsh716w4h3xjwapkb-python3-3.6.9-env/lib/python3.6/site-packages/ | grep "tensorflow\|numpy"
numpy
numpy-1.16.1.dist-info
tensorflow
tensorflow_estimator
tensorflow_estimator-1.13.0.dist-info
tensorflow_gpu-1.13.1.dist-info

Output of running the computation:

2020-11-11 15:42:58.362216: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2020-11-11 15:42:58.489764: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-11 15:42:58.490740: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x175bd80 executing computations on platform CUDA. Devices:
2020-11-11 15:42:58.490775: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): Tesla T4, Compute Capability 7.5
2020-11-11 15:42:58.492574: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2499995000 Hz
2020-11-11 15:42:58.492744: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x2b97ed0 executing computations on platform Host. Devices:
2020-11-11 15:42:58.492765: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>
2020-11-11 15:42:58.492868: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: 
name: Tesla T4 major: 7 minor: 5 memoryClockRate(GHz): 1.59
pciBusID: 0000:00:1e.0
totalMemory: 14.75GiB freeMemory: 14.66GiB
2020-11-11 15:42:58.492889: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2020-11-11 15:42:58.493876: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-11-11 15:42:58.493893: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0 
2020-11-11 15:42:58.493906: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N 
2020-11-11 15:42:58.493969: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14257 MB memory) -> physical GPU (device: 0, name: Tesla T4, pci bus id: 0000:00:1e.0, compute capability: 7.5)
WARNING:tensorflow:From /nix/store/35vk801xg9yv5crbsh716w4h3xjwapkb-python3-3.6.9-env/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
WARNING:tensorflow:From /home/test/sample.py:55: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
WARNING:tensorflow:From /home/test/sample.py:57: multinomial (from tensorflow.python.ops.random_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.random.categorical instead.
WARNING:tensorflow:From /nix/store/35vk801xg9yv5crbsh716w4h3xjwapkb-python3-3.6.9-env/lib/python3.6/site-packages/tensorflow/python/training/saver.py:1266: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.
2020-11-11 15:43:58.489199: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally

Docker environment

I run the tensorflow/tensorflow:1.13.2-gpu-py3 container with the --gpus flag on the same NixOS machine by adding the following to configuration.nix:

virtualisation.docker.enable = true;
virtualisation.docker.enableNvidia = true;
hardware.opengl.driSupport32Bit = true;

Versions of packages:

root@e412bbf7b3c7:/# python --version
Python 3.6.8
root@e412bbf7b3c7:/# pip list | grep "tensorflow\|numpy"
numpy                1.16.4 
tensorflow-estimator 1.13.0 
tensorflow-gpu       1.13.2

Output of running the computation:

2020-11-11 16:12:05.715050: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2020-11-11 16:12:05.851874: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-11 16:12:05.854790: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x3768f80 executing computations on platform CUDA. Devices:
2020-11-11 16:12:05.854820: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): Tesla T4, Compute Capability 7.5
2020-11-11 16:12:05.883564: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2499995000 Hz
2020-11-11 16:12:05.883836: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x3cbb870 executing computations on platform Host. Devices:
2020-11-11 16:12:05.883858: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>
2020-11-11 16:12:05.884004: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: 
name: Tesla T4 major: 7 minor: 5 memoryClockRate(GHz): 1.59
pciBusID: 0000:00:1e.0
totalMemory: 14.75GiB freeMemory: 14.66GiB
2020-11-11 16:12:05.884029: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2020-11-11 16:12:05.886038: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-11-11 16:12:05.886072: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0 
2020-11-11 16:12:05.886089: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N 
2020-11-11 16:12:05.886182: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14257 MB memory) -> physical GPU (device: 0, name: Tesla T4, pci bus id: 0000:00:1e.0, compute capability: 7.5)
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
WARNING:tensorflow:From /home/test/sample.py:55: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
WARNING:tensorflow:From /home/test/sample.py:57: multinomial (from tensorflow.python.ops.random_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.random.categorical instead.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py:1266: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.
2020-11-11 16:12:44.117649: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally

What I have tried so far

Using the 20.09 channel for everything except Tensorflow → no difference
Using a newer Python 3.7 → no difference
Using a more recent channel for Tensorflow, to get 1.14 or 1.15 release of Tensorflow → computation even slower
Diffing the outputs in hope of finding something actionable in there → no luck

jonringer · November 11, 2020, 6:47pm

My first intuition is that the one packaged from nixpkgs doesn’t use any CPU extensions.

You may want to try:

  (python3Packages.tensorflow.override {
    sse42Support = true;
    avx2Support = true;
    fmaSupport = true;
  })

this will cause a rebuild, but the final build output should be using those extensions.

danieldk · November 11, 2020, 7:04pm

The Docker version also doesn’t (according to the output).

Upstream compiled 1.13.2 with CUDA capabilities 3.5 and 7.0:

github.com

tensorflow/tensorflow/blob/04256c89d8783c5cfd7e550f9512e9478beb6454/configure.py#L38


      
          
          # pylint: disable=g-import-not-at-top
          try:
            from shutil import which
          except ImportError:
            from distutils.spawn import find_executable as which
          # pylint: enable=g-import-not-at-top
          
          _DEFAULT_CUDA_VERSION = '10.0'
          _DEFAULT_CUDNN_VERSION = '7'
          _DEFAULT_CUDA_COMPUTE_CAPABILITIES = '3.5,7.0'
          _DEFAULT_CUDA_PATH = '/usr/local/cuda'
          _DEFAULT_CUDA_PATH_LINUX = '/opt/cuda'
          _DEFAULT_CUDA_PATH_WIN = ('C:/Program Files/NVIDIA GPU Computing '
                                    'Toolkit/CUDA/v%s' % _DEFAULT_CUDA_VERSION)
          _TF_OPENCL_VERSION = '1.2'
          _DEFAULT_COMPUTECPP_TOOLKIT_PATH = '/usr/local/computecpp'
          _DEFAULT_TRISYCL_INCLUDE_DIR = '/usr/local/triSYCL/include'
          _SUPPORTED_ANDROID_NDK_VERSIONS = [10, 11, 12, 13, 14, 15, 16, 17, 18]
          
          _DEFAULT_PROMPT_ASK_ATTEMPTS = 10

We are compiling Tensorflow 1 with capabilities 3.5 and 5.2:

github.com

NixOS/nixpkgs/blob/35ac8ba49252397954aa3fcd06c96b6da2ee7f58/pkgs/development/python-modules/tensorflow/1/default.nix#L24


      
          , jemalloc, openmpi, astor, gast, grpc, sqlite, openssl, jsoncpp, re2
          , curl, snappy, flatbuffers, icu, double-conversion, libpng, libjpeg, giflib
          # Upsteam by default includes cuda support since tensorflow 1.15. We could do
          # that in nix as well. It would make some things easier and less confusing, but
          # it would also make the default tensorflow package unfree. See
          # https://groups.google.com/a/tensorflow.org/forum/#!topic/developers/iRCt5m4qUz0
          , cudaSupport ? false, nvidia_x11 ? null, cudatoolkit ? null, cudnn ? null, nccl ? null
          # XLA without CUDA is broken
          , xlaSupport ? cudaSupport
          # Default from ./configure script
          , cudaCapabilities ? [ "3.5" "5.2" ]
          , sse42Support ? builtins.elem (stdenv.hostPlatform.platform.gcc.arch or "default") ["westmere" "sandybridge" "ivybridge" "haswell" "broadwell" "skylake" "skylake-avx512"]
          , avx2Support  ? builtins.elem (stdenv.hostPlatform.platform.gcc.arch or "default") [                                     "haswell" "broadwell" "skylake" "skylake-avx512"]
          , fmaSupport   ? builtins.elem (stdenv.hostPlatform.platform.gcc.arch or "default") [                                     "haswell" "broadwell" "skylake" "skylake-avx512"]
          # Darwin deps
          , Foundation, Security
          }:
          
          assert cudaSupport -> nvidia_x11 != null
                             && cudatoolkit != null
                             && cudnn != null;

The T4 has CUDA capability 7.5, so most likely you are benefiting in the upstream/Docker build from the higher CUDA capability (which enables newer GPU features). You can try to compile Tensorflow with that capability, using something like:

  (python3Packages.tensorflow.override {
    cudaCapabilities = [ "3.5" "7.5" ];
  })

(If that old checkout’s CUDA supports that capability.)

jonringer · November 11, 2020, 7:51pm

If you want to apply this to any build, you can also leverage https://github.com/NixOS/nixpkgs/pull/61019 and add this to your ~/.config/nixpkgs/config.nix (I have a 3900X, so a zen2 architecture)

{
  platform.gcc.arch = "znver2";
}

or pass it as the config argument

nix-repl> :l
Added 12685 variables.

nix-repl> hostPlatform.avxSupport
false

nix-repl> archPkgs = import ./. { config = { platform.gcc.arch = "znver2"; }; }

nix-repl> archPkgs.hostPlatform.avxSupport
true

available options:

nix-repl> lib.attrNames lib.systems.architectures.features
[ "armv5te" "armv6" "armv7-a" "armv8-a" "bdver1" "bdver2" "bdver3" "bdver4" "broadwell" "btver1" "btver2" "default" "haswell" "ivybridge" "loongson2f" "mips32" "sandybridge" "skylake" "skylake-avx512" "westmere" "znver1" "znver2" ]

NOTE: these options default to false in normal nixpkgs, so any package which listens for these optimizations will cause rebuilds.

foolnotion · January 19, 2021, 12:51am

Hi,

The github link is dead now (pull request disappeared?). Your suggestion

{
  platform.gcc.arch = "znver2";
}

fails with
error: a 'x86_64-linux' with features {gccarch-znver2} is required to build '/nix/store/4nl4p6hv3gbv98bqgmn757f7z4haw02r-bootstrap-stage0-glibc.drv', but I am a 'x86_64-linux' with features {benchmark, big-parallel, kvm, nixos-test}

Do you happen to remember what other changes were specified in that pull request?

Thanks!

volth · January 24, 2021, 8:04am

Need to tell Nix that this machine can build for the arch (well, every x86-machine can build, but not every can run checkPhase).
Nix does not look yet into /proc/cpuinfo so it has to be done manually:

nix.systemFeatures = [ "nixos-test" "benchmark" "big-parallel" "kvm" ] ++ [ "gccarch-znver2" ];

(the first 4 is the default value of nix.systemFeatures)

foolnotion · January 26, 2021, 10:14pm

Thanks, now it worked to rebuild the system (although only a few packages were actually recompiled, for instance libreoffice and some smaller things). The remaining issue is that when I do:

$ gcc -march=native -Q --help=target | grep march
  -march=                               x86-64
  Known valid arguments for -march= option:

it doesn’t return znver2 as I would hope (zen3 cpu).

volth · February 1, 2021, 2:22pm

-march=native is irrelevant here, it means “try fill gcc flags from /proc/cpuinfo”.
And it has some bugs with AMD, with older CPUs too.

Nix ideology is not to rely on such things as it is impure (the builder might be skylake but the target is older wesmere as the oldest CPU in your realm; the danger of -march=native is it would act like -march=skylake or whetever machine gcc happens to run)

foolnotion · February 1, 2021, 5:41pm

That’s weird, because /proc/cpuinfo shows correct info, and gcc inside any other linux distro (ubuntu, arch linux - I run several lxd containers) detects everything correctly. My problem is not rebuilding the system or building nix packages with -march=native, but being able to create a correct nix-shell for a development environment for C++ where the software gets built with the correct native flags.

volth · February 2, 2021, 9:08am

yes, but filling CPU flags from /proc/cpuinfo is what GCC does, not Nix.
If you have badly supported CPU them report them.

foolnotion · May 29, 2021, 2:09am

So, it was not a “badly supported CPU”. It turns out nixos sets an environment variable NIX_ENFORCE_NO_NATIVE=1 which causes -march=native to be ignored.

Solution:

NIX_ENFORCE_NO_NATIVE=0 nix-shell
[nix-shell:~/projects]$ gcc -march=native -Q --help=target | grep march
  -march=                               znver3

volth · May 29, 2021, 2:50am

-march=native is harmful, the resulting derivation will have no mark for which arch they are built and they will crash on older CPUs with “Invalid instruction”.

That is what platform.gcc.arch = "znver3"; to solve

foolnotion · May 30, 2021, 3:17pm

I fully agree it can be harmful in certain situations, but if one knows what they’re doing, one would expect the toolchain to behave predictably. IMHO Nix has no business interfering with the output of gcc, much less so without any message or warning.

For whomever stumbles upon this, @volth is right that this can also be specified in your own shell as:

pkgs = import <nixos> {
    localSystem = {
        gcc.arch = "znver2";
        gcc.tune = "znver2";
        system = "x86_64-linux";
    };
};

However, depending on your build inputs it might trigger a very lengthy build process.

Sandro · November 23, 2022, 11:31am

Shouldn’t march imply mtune?

Atemu · November 23, 2022, 12:39pm

Depends on the architecture and perhaps even the compiler. On x86, that should be true to my knowledge.

Sandro · November 23, 2022, 1:48pm

Is there a way I can compile all my programs with sse4.2 and avx2 globally without compiling everything from stage0?

SergeK · May 4, 2023, 3:06am

+1 to the question, and also I think we absolutely need a sse4.2+avx package set cached by hydra (regardless of whether that also rebuilds stage0)

zoechi · September 6, 2024, 9:58am

Thanks, I missed that
That reduces object detection in Frigate from about 45% CPU utilization down to 15%

zoechi · March 25, 2025, 8:04pm

This stopped working for me a while ago (might have been the latest nixpkgs release) and I can’t figure out why. The parameters are still there (nixpkgs/pkgs/development/python-modules/tensorflow/default.nix at f0946fa5f1fb876a9dc2e1850d9d3a4e3f914092 · NixOS/nixpkgs · GitHub).

nix-build -v '<nixpkgs>' -A 'python312Packages.tensorflow.override { avx2Support = true; }'
Results in: error: attribute ‘override { avx2Support = true; }’ in selection path ‘python312Packages.tensorflow.override { avx2Support = true; }’ not found

Do you have any idea what’s going on?

waffle8946 · March 26, 2025, 3:36am

That should never have worked, the error should be self-explanatory (there’s no attribute called override { avx2Support = true;}, and you used --attr).

Also, you’re looking for tensorflow-build, not tensorflow (which lacks this avx2Support arg).

However, tensorflow-build is not supported on anything newer than python 3.11.

The correct syntax is below:

nix-build --expr '(import <nixpkgs> {}).python311Packages.tensorflow-build.override { avx2Support = true; }'