Tensorflow slower as NixOS native than inside a Docker container

Hey all! I’m trying to make our data science infrastructue more pure. We have a Tensorflow project that does some computation. At the moment, the computation is done inside a Docker container. I’d like to do the computation natively on NixOS, so I can get rid of Docker. The problem is that the computation is about 10% slower natively than it is inside the Docker container. I can’t figure out why and am looking for ideas what else to try. I’m doing the testing on a single g4dn.xlarge AWS instance.

NixOS environment

Tensorflow is installed in NixOS like so:

{ config, pkgs, lib, ... }:
let

  nixos1903 = fetchTarball {
    url = "https://github.com/nixos/nixpkgs/archive/34c7eb7545d155cc5b6f499b23a7cb1c96ab4d59.tar.gz";
    sha256 = "11z6ajj108fy2q5g8y4higlcaqncrbjm3dnv17pvif6avagw4mcb";
  };

  olderPkgs = import nixos1903 {
    inherit (config.nixpkgs) config;
    overlays = [(self: super: {
      linuxPackages = config.boot.kernelPackages;
      inherit (pkgs) cudatoolkit cudnn;
    })];
  };

  python = olderPkgs.python36.withPackages (ps: with ps; [
    (fire.overrideAttrs (old: { doInstallCheck = false; }))
    regex
    requests
    tqdm
    numpy
    tensorflowWithCuda
  ]);

in
{
  system.stateVersion = "20.03";
  ec2.hvm = true;

  imports = [
    <nixpkgs/nixos/modules/virtualisation/amazon-image.nix>
    # [... snip ... ]
  ];

  nixpkgs.config.allowUnfree = true;
  services.xserver.videoDrivers = [ "nvidia" ];
  hardware.opengl.enable = true;
  environment.systemPackages = with pkgs; [ python ];

  # [ ... snip ...]
}

Versions of packages:

$ python --version
Python 3.6.9
$ ls /nix/store/35vk801xg9yv5crbsh716w4h3xjwapkb-python3-3.6.9-env/lib/python3.6/site-packages/ | grep "tensorflow\|numpy"
numpy
numpy-1.16.1.dist-info
tensorflow
tensorflow_estimator
tensorflow_estimator-1.13.0.dist-info
tensorflow_gpu-1.13.1.dist-info

Output of running the computation:

2020-11-11 15:42:58.362216: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2020-11-11 15:42:58.489764: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-11 15:42:58.490740: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x175bd80 executing computations on platform CUDA. Devices:
2020-11-11 15:42:58.490775: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): Tesla T4, Compute Capability 7.5
2020-11-11 15:42:58.492574: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2499995000 Hz
2020-11-11 15:42:58.492744: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x2b97ed0 executing computations on platform Host. Devices:
2020-11-11 15:42:58.492765: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>
2020-11-11 15:42:58.492868: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: 
name: Tesla T4 major: 7 minor: 5 memoryClockRate(GHz): 1.59
pciBusID: 0000:00:1e.0
totalMemory: 14.75GiB freeMemory: 14.66GiB
2020-11-11 15:42:58.492889: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2020-11-11 15:42:58.493876: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-11-11 15:42:58.493893: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0 
2020-11-11 15:42:58.493906: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N 
2020-11-11 15:42:58.493969: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14257 MB memory) -> physical GPU (device: 0, name: Tesla T4, pci bus id: 0000:00:1e.0, compute capability: 7.5)
WARNING:tensorflow:From /nix/store/35vk801xg9yv5crbsh716w4h3xjwapkb-python3-3.6.9-env/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
WARNING:tensorflow:From /home/test/sample.py:55: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
WARNING:tensorflow:From /home/test/sample.py:57: multinomial (from tensorflow.python.ops.random_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.random.categorical instead.
WARNING:tensorflow:From /nix/store/35vk801xg9yv5crbsh716w4h3xjwapkb-python3-3.6.9-env/lib/python3.6/site-packages/tensorflow/python/training/saver.py:1266: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.
2020-11-11 15:43:58.489199: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally

Docker environment

I run the tensorflow/tensorflow:1.13.2-gpu-py3 container with the --gpus flag on the same NixOS machine by adding the following to configuration.nix:

virtualisation.docker.enable = true;
virtualisation.docker.enableNvidia = true;
hardware.opengl.driSupport32Bit = true;

Versions of packages:

root@e412bbf7b3c7:/# python --version
Python 3.6.8
root@e412bbf7b3c7:/# pip list | grep "tensorflow\|numpy"
numpy                1.16.4 
tensorflow-estimator 1.13.0 
tensorflow-gpu       1.13.2 

Output of running the computation:

2020-11-11 16:12:05.715050: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2020-11-11 16:12:05.851874: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-11 16:12:05.854790: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x3768f80 executing computations on platform CUDA. Devices:
2020-11-11 16:12:05.854820: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): Tesla T4, Compute Capability 7.5
2020-11-11 16:12:05.883564: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2499995000 Hz
2020-11-11 16:12:05.883836: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x3cbb870 executing computations on platform Host. Devices:
2020-11-11 16:12:05.883858: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>
2020-11-11 16:12:05.884004: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: 
name: Tesla T4 major: 7 minor: 5 memoryClockRate(GHz): 1.59
pciBusID: 0000:00:1e.0
totalMemory: 14.75GiB freeMemory: 14.66GiB
2020-11-11 16:12:05.884029: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2020-11-11 16:12:05.886038: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-11-11 16:12:05.886072: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0 
2020-11-11 16:12:05.886089: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N 
2020-11-11 16:12:05.886182: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14257 MB memory) -> physical GPU (device: 0, name: Tesla T4, pci bus id: 0000:00:1e.0, compute capability: 7.5)
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
WARNING:tensorflow:From /home/test/sample.py:55: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
WARNING:tensorflow:From /home/test/sample.py:57: multinomial (from tensorflow.python.ops.random_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.random.categorical instead.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py:1266: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.
2020-11-11 16:12:44.117649: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally

What I have tried so far

  • Using the 20.09 channel for everything except Tensorflow -> no difference
  • Using a newer Python 3.7 -> no difference
  • Using a more recent channel for Tensorflow, to get 1.14 or 1.15 release of Tensorflow -> computation even slower
  • Diffing the outputs in hope of finding something actionable in there -> no luck

My first intuition is that the one packaged from nixpkgs doesn’t use any CPU extensions.

You may want to try:

  (python3Packages.tensorflow.override {
    sse42Support = true;
    avx2Support = true;
    fmaSupport = true;
  })

this will cause a rebuild, but the final build output should be using those extensions.

1 Like

The Docker version also doesn’t (according to the output).

Upstream compiled 1.13.2 with CUDA capabilities 3.5 and 7.0:

We are compiling Tensorflow 1 with capabilities 3.5 and 5.2:

The T4 has CUDA capability 7.5, so most likely you are benefiting in the upstream/Docker build from the higher CUDA capability (which enables newer GPU features). You can try to compile Tensorflow with that capability, using something like:

  (python3Packages.tensorflow.override {
    cudaCapabilities = [ "3.5" "7.5" ];
  })

(If that old checkout’s CUDA supports that capability.)

3 Likes

If you want to apply this to any build, you can also leverage https://github.com/NixOS/nixpkgs/pull/61019 and add this to your ~/.config/nixpkgs/config.nix (I have a 3900X, so a zen2 architecture)

{
  platform.gcc.arch = "znver2";
}

or pass it as the config argument

nix-repl> :l
Added 12685 variables.

nix-repl> hostPlatform.avxSupport
false

nix-repl> archPkgs = import ./. { config = { platform.gcc.arch = "znver2"; }; }

nix-repl> archPkgs.hostPlatform.avxSupport
true

available options:

nix-repl> lib.attrNames lib.systems.architectures.features
[ "armv5te" "armv6" "armv7-a" "armv8-a" "bdver1" "bdver2" "bdver3" "bdver4" "broadwell" "btver1" "btver2" "default" "haswell" "ivybridge" "loongson2f" "mips32" "sandybridge" "skylake" "skylake-avx512" "westmere" "znver1" "znver2" ]

NOTE: these options default to false in normal nixpkgs, so any package which listens for these optimizations will cause rebuilds.

1 Like