Tensorflow slower as NixOS native than inside a Docker container

Hey all! I’m trying to make our data science infrastructue more pure. We have a Tensorflow project that does some computation. At the moment, the computation is done inside a Docker container. I’d like to do the computation natively on NixOS, so I can get rid of Docker. The problem is that the computation is about 10% slower natively than it is inside the Docker container. I can’t figure out why and am looking for ideas what else to try. I’m doing the testing on a single g4dn.xlarge AWS instance.

NixOS environment

Tensorflow is installed in NixOS like so:

{ config, pkgs, lib, ... }:
let

  nixos1903 = fetchTarball {
    url = "https://github.com/nixos/nixpkgs/archive/34c7eb7545d155cc5b6f499b23a7cb1c96ab4d59.tar.gz";
    sha256 = "11z6ajj108fy2q5g8y4higlcaqncrbjm3dnv17pvif6avagw4mcb";
  };

  olderPkgs = import nixos1903 {
    inherit (config.nixpkgs) config;
    overlays = [(self: super: {
      linuxPackages = config.boot.kernelPackages;
      inherit (pkgs) cudatoolkit cudnn;
    })];
  };

  python = olderPkgs.python36.withPackages (ps: with ps; [
    (fire.overrideAttrs (old: { doInstallCheck = false; }))
    regex
    requests
    tqdm
    numpy
    tensorflowWithCuda
  ]);

in
{
  system.stateVersion = "20.03";
  ec2.hvm = true;

  imports = [
    <nixpkgs/nixos/modules/virtualisation/amazon-image.nix>
    # [... snip ... ]
  ];

  nixpkgs.config.allowUnfree = true;
  services.xserver.videoDrivers = [ "nvidia" ];
  hardware.opengl.enable = true;
  environment.systemPackages = with pkgs; [ python ];

  # [ ... snip ...]
}

Versions of packages:

$ python --version
Python 3.6.9
$ ls /nix/store/35vk801xg9yv5crbsh716w4h3xjwapkb-python3-3.6.9-env/lib/python3.6/site-packages/ | grep "tensorflow\|numpy"
numpy
numpy-1.16.1.dist-info
tensorflow
tensorflow_estimator
tensorflow_estimator-1.13.0.dist-info
tensorflow_gpu-1.13.1.dist-info

Output of running the computation:

2020-11-11 15:42:58.362216: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2020-11-11 15:42:58.489764: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-11 15:42:58.490740: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x175bd80 executing computations on platform CUDA. Devices:
2020-11-11 15:42:58.490775: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): Tesla T4, Compute Capability 7.5
2020-11-11 15:42:58.492574: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2499995000 Hz
2020-11-11 15:42:58.492744: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x2b97ed0 executing computations on platform Host. Devices:
2020-11-11 15:42:58.492765: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>
2020-11-11 15:42:58.492868: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: 
name: Tesla T4 major: 7 minor: 5 memoryClockRate(GHz): 1.59
pciBusID: 0000:00:1e.0
totalMemory: 14.75GiB freeMemory: 14.66GiB
2020-11-11 15:42:58.492889: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2020-11-11 15:42:58.493876: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-11-11 15:42:58.493893: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0 
2020-11-11 15:42:58.493906: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N 
2020-11-11 15:42:58.493969: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14257 MB memory) -> physical GPU (device: 0, name: Tesla T4, pci bus id: 0000:00:1e.0, compute capability: 7.5)
WARNING:tensorflow:From /nix/store/35vk801xg9yv5crbsh716w4h3xjwapkb-python3-3.6.9-env/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
WARNING:tensorflow:From /home/test/sample.py:55: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
WARNING:tensorflow:From /home/test/sample.py:57: multinomial (from tensorflow.python.ops.random_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.random.categorical instead.
WARNING:tensorflow:From /nix/store/35vk801xg9yv5crbsh716w4h3xjwapkb-python3-3.6.9-env/lib/python3.6/site-packages/tensorflow/python/training/saver.py:1266: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.
2020-11-11 15:43:58.489199: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally

Docker environment

I run the tensorflow/tensorflow:1.13.2-gpu-py3 container with the --gpus flag on the same NixOS machine by adding the following to configuration.nix:

virtualisation.docker.enable = true;
virtualisation.docker.enableNvidia = true;
hardware.opengl.driSupport32Bit = true;

Versions of packages:

root@e412bbf7b3c7:/# python --version
Python 3.6.8
root@e412bbf7b3c7:/# pip list | grep "tensorflow\|numpy"
numpy                1.16.4 
tensorflow-estimator 1.13.0 
tensorflow-gpu       1.13.2 

Output of running the computation:

2020-11-11 16:12:05.715050: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2020-11-11 16:12:05.851874: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-11 16:12:05.854790: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x3768f80 executing computations on platform CUDA. Devices:
2020-11-11 16:12:05.854820: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): Tesla T4, Compute Capability 7.5
2020-11-11 16:12:05.883564: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2499995000 Hz
2020-11-11 16:12:05.883836: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x3cbb870 executing computations on platform Host. Devices:
2020-11-11 16:12:05.883858: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>
2020-11-11 16:12:05.884004: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: 
name: Tesla T4 major: 7 minor: 5 memoryClockRate(GHz): 1.59
pciBusID: 0000:00:1e.0
totalMemory: 14.75GiB freeMemory: 14.66GiB
2020-11-11 16:12:05.884029: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2020-11-11 16:12:05.886038: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-11-11 16:12:05.886072: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0 
2020-11-11 16:12:05.886089: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N 
2020-11-11 16:12:05.886182: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14257 MB memory) -> physical GPU (device: 0, name: Tesla T4, pci bus id: 0000:00:1e.0, compute capability: 7.5)
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
WARNING:tensorflow:From /home/test/sample.py:55: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
WARNING:tensorflow:From /home/test/sample.py:57: multinomial (from tensorflow.python.ops.random_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.random.categorical instead.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py:1266: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.
2020-11-11 16:12:44.117649: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally

What I have tried so far

  • Using the 20.09 channel for everything except Tensorflow → no difference
  • Using a newer Python 3.7 → no difference
  • Using a more recent channel for Tensorflow, to get 1.14 or 1.15 release of Tensorflow → computation even slower
  • Diffing the outputs in hope of finding something actionable in there → no luck
2 Likes

My first intuition is that the one packaged from nixpkgs doesn’t use any CPU extensions.

You may want to try:

  (python3Packages.tensorflow.override {
    sse42Support = true;
    avx2Support = true;
    fmaSupport = true;
  })

this will cause a rebuild, but the final build output should be using those extensions.

4 Likes

The Docker version also doesn’t (according to the output).

Upstream compiled 1.13.2 with CUDA capabilities 3.5 and 7.0:

We are compiling Tensorflow 1 with capabilities 3.5 and 5.2:

The T4 has CUDA capability 7.5, so most likely you are benefiting in the upstream/Docker build from the higher CUDA capability (which enables newer GPU features). You can try to compile Tensorflow with that capability, using something like:

  (python3Packages.tensorflow.override {
    cudaCapabilities = [ "3.5" "7.5" ];
  })

(If that old checkout’s CUDA supports that capability.)

4 Likes

If you want to apply this to any build, you can also leverage https://github.com/NixOS/nixpkgs/pull/61019 and add this to your ~/.config/nixpkgs/config.nix (I have a 3900X, so a zen2 architecture)

{
  platform.gcc.arch = "znver2";
}

or pass it as the config argument

nix-repl> :l
Added 12685 variables.

nix-repl> hostPlatform.avxSupport
false

nix-repl> archPkgs = import ./. { config = { platform.gcc.arch = "znver2"; }; }

nix-repl> archPkgs.hostPlatform.avxSupport
true

available options:

nix-repl> lib.attrNames lib.systems.architectures.features
[ "armv5te" "armv6" "armv7-a" "armv8-a" "bdver1" "bdver2" "bdver3" "bdver4" "broadwell" "btver1" "btver2" "default" "haswell" "ivybridge" "loongson2f" "mips32" "sandybridge" "skylake" "skylake-avx512" "westmere" "znver1" "znver2" ]

NOTE: these options default to false in normal nixpkgs, so any package which listens for these optimizations will cause rebuilds.

5 Likes

Hi,

The github link is dead now (pull request disappeared?). Your suggestion

{
  platform.gcc.arch = "znver2";
}

fails with
error: a 'x86_64-linux' with features {gccarch-znver2} is required to build '/nix/store/4nl4p6hv3gbv98bqgmn757f7z4haw02r-bootstrap-stage0-glibc.drv', but I am a 'x86_64-linux' with features {benchmark, big-parallel, kvm, nixos-test}

Do you happen to remember what other changes were specified in that pull request?

Thanks!

Need to tell Nix that this machine can build for the arch (well, every x86-machine can build, but not every can run checkPhase).
Nix does not look yet into /proc/cpuinfo so it has to be done manually:

nix.systemFeatures = [ "nixos-test" "benchmark" "big-parallel" "kvm" ] ++ [ "gccarch-znver2" ];

(the first 4 is the default value of nix.systemFeatures)

1 Like

Thanks, now it worked to rebuild the system (although only a few packages were actually recompiled, for instance libreoffice and some smaller things). The remaining issue is that when I do:

$ gcc -march=native -Q --help=target | grep march
  -march=                               x86-64
  Known valid arguments for -march= option:

it doesn’t return znver2 as I would hope (zen3 cpu).

-march=native is irrelevant here, it means “try fill gcc flags from /proc/cpuinfo”.
And it has some bugs with AMD, with older CPUs too.

Nix ideology is not to rely on such things as it is impure (the builder might be skylake but the target is older wesmere as the oldest CPU in your realm; the danger of -march=native is it would act like -march=skylake or whetever machine gcc happens to run)

1 Like

That’s weird, because /proc/cpuinfo shows correct info, and gcc inside any other linux distro (ubuntu, arch linux - I run several lxd containers) detects everything correctly. My problem is not rebuilding the system or building nix packages with -march=native, but being able to create a correct nix-shell for a development environment for C++ where the software gets built with the correct native flags.

yes, but filling CPU flags from /proc/cpuinfo is what GCC does, not Nix.
If you have badly supported CPU them report them.

So, it was not a “badly supported CPU”. It turns out nixos sets an environment variable NIX_ENFORCE_NO_NATIVE=1 which causes -march=native to be ignored.

Solution:

NIX_ENFORCE_NO_NATIVE=0 nix-shell
[nix-shell:~/projects]$ gcc -march=native -Q --help=target | grep march
  -march=                               znver3
3 Likes

-march=native is harmful, the resulting derivation will have no mark for which arch they are built and they will crash on older CPUs with “Invalid instruction”.

That is what platform.gcc.arch = "znver3"; to solve

3 Likes

I fully agree it can be harmful in certain situations, but if one knows what they’re doing, one would expect the toolchain to behave predictably. IMHO Nix has no business interfering with the output of gcc, much less so without any message or warning.

For whomever stumbles upon this, @volth is right that this can also be specified in your own shell as:

pkgs = import <nixos> {
    localSystem = {
        gcc.arch = "znver2";
        gcc.tune = "znver2";
        system = "x86_64-linux";
    };
};

However, depending on your build inputs it might trigger a very lengthy build process.

6 Likes

Shouldn’t march imply mtune?

Depends on the architecture and perhaps even the compiler. On x86, that should be true to my knowledge.

Is there a way I can compile all my programs with sse4.2 and avx2 globally without compiling everything from stage0?

5 Likes

+1 to the question, and also I think we absolutely need a sse4.2+avx package set cached by hydra (regardless of whether that also rebuilds stage0)

4 Likes

Thanks, I missed that :rocket:
That reduces object detection in Frigate from about 45% CPU utilization down to 15%