Hey all! I’m trying to make our data science infrastructue more pure. We have a Tensorflow project that does some computation. At the moment, the computation is done inside a Docker container. I’d like to do the computation natively on NixOS, so I can get rid of Docker. The problem is that the computation is about 10% slower natively than it is inside the Docker container. I can’t figure out why and am looking for ideas what else to try. I’m doing the testing on a single g4dn.xlarge
AWS instance.
NixOS environment
Tensorflow is installed in NixOS like so:
{ config, pkgs, lib, ... }:
let
nixos1903 = fetchTarball {
url = "https://github.com/nixos/nixpkgs/archive/34c7eb7545d155cc5b6f499b23a7cb1c96ab4d59.tar.gz";
sha256 = "11z6ajj108fy2q5g8y4higlcaqncrbjm3dnv17pvif6avagw4mcb";
};
olderPkgs = import nixos1903 {
inherit (config.nixpkgs) config;
overlays = [(self: super: {
linuxPackages = config.boot.kernelPackages;
inherit (pkgs) cudatoolkit cudnn;
})];
};
python = olderPkgs.python36.withPackages (ps: with ps; [
(fire.overrideAttrs (old: { doInstallCheck = false; }))
regex
requests
tqdm
numpy
tensorflowWithCuda
]);
in
{
system.stateVersion = "20.03";
ec2.hvm = true;
imports = [
<nixpkgs/nixos/modules/virtualisation/amazon-image.nix>
# [... snip ... ]
];
nixpkgs.config.allowUnfree = true;
services.xserver.videoDrivers = [ "nvidia" ];
hardware.opengl.enable = true;
environment.systemPackages = with pkgs; [ python ];
# [ ... snip ...]
}
Versions of packages:
$ python --version
Python 3.6.9
$ ls /nix/store/35vk801xg9yv5crbsh716w4h3xjwapkb-python3-3.6.9-env/lib/python3.6/site-packages/ | grep "tensorflow\|numpy"
numpy
numpy-1.16.1.dist-info
tensorflow
tensorflow_estimator
tensorflow_estimator-1.13.0.dist-info
tensorflow_gpu-1.13.1.dist-info
Output of running the computation:
2020-11-11 15:42:58.362216: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2020-11-11 15:42:58.489764: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-11 15:42:58.490740: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x175bd80 executing computations on platform CUDA. Devices:
2020-11-11 15:42:58.490775: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): Tesla T4, Compute Capability 7.5
2020-11-11 15:42:58.492574: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2499995000 Hz
2020-11-11 15:42:58.492744: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x2b97ed0 executing computations on platform Host. Devices:
2020-11-11 15:42:58.492765: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): <undefined>, <undefined>
2020-11-11 15:42:58.492868: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties:
name: Tesla T4 major: 7 minor: 5 memoryClockRate(GHz): 1.59
pciBusID: 0000:00:1e.0
totalMemory: 14.75GiB freeMemory: 14.66GiB
2020-11-11 15:42:58.492889: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2020-11-11 15:42:58.493876: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-11-11 15:42:58.493893: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0
2020-11-11 15:42:58.493906: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N
2020-11-11 15:42:58.493969: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14257 MB memory) -> physical GPU (device: 0, name: Tesla T4, pci bus id: 0000:00:1e.0, compute capability: 7.5)
WARNING:tensorflow:From /nix/store/35vk801xg9yv5crbsh716w4h3xjwapkb-python3-3.6.9-env/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
WARNING:tensorflow:From /home/test/sample.py:55: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
WARNING:tensorflow:From /home/test/sample.py:57: multinomial (from tensorflow.python.ops.random_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.random.categorical instead.
WARNING:tensorflow:From /nix/store/35vk801xg9yv5crbsh716w4h3xjwapkb-python3-3.6.9-env/lib/python3.6/site-packages/tensorflow/python/training/saver.py:1266: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.
2020-11-11 15:43:58.489199: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
Docker environment
I run the tensorflow/tensorflow:1.13.2-gpu-py3
container with the --gpus
flag on the same NixOS machine by adding the following to configuration.nix:
virtualisation.docker.enable = true;
virtualisation.docker.enableNvidia = true;
hardware.opengl.driSupport32Bit = true;
Versions of packages:
root@e412bbf7b3c7:/# python --version
Python 3.6.8
root@e412bbf7b3c7:/# pip list | grep "tensorflow\|numpy"
numpy 1.16.4
tensorflow-estimator 1.13.0
tensorflow-gpu 1.13.2
Output of running the computation:
2020-11-11 16:12:05.715050: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2020-11-11 16:12:05.851874: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-11 16:12:05.854790: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x3768f80 executing computations on platform CUDA. Devices:
2020-11-11 16:12:05.854820: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): Tesla T4, Compute Capability 7.5
2020-11-11 16:12:05.883564: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2499995000 Hz
2020-11-11 16:12:05.883836: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x3cbb870 executing computations on platform Host. Devices:
2020-11-11 16:12:05.883858: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): <undefined>, <undefined>
2020-11-11 16:12:05.884004: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties:
name: Tesla T4 major: 7 minor: 5 memoryClockRate(GHz): 1.59
pciBusID: 0000:00:1e.0
totalMemory: 14.75GiB freeMemory: 14.66GiB
2020-11-11 16:12:05.884029: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2020-11-11 16:12:05.886038: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-11-11 16:12:05.886072: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0
2020-11-11 16:12:05.886089: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N
2020-11-11 16:12:05.886182: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14257 MB memory) -> physical GPU (device: 0, name: Tesla T4, pci bus id: 0000:00:1e.0, compute capability: 7.5)
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
WARNING:tensorflow:From /home/test/sample.py:55: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
WARNING:tensorflow:From /home/test/sample.py:57: multinomial (from tensorflow.python.ops.random_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.random.categorical instead.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py:1266: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.
2020-11-11 16:12:44.117649: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
What I have tried so far
- Using the 20.09 channel for everything except Tensorflow → no difference
- Using a newer Python 3.7 → no difference
- Using a more recent channel for Tensorflow, to get 1.14 or 1.15 release of Tensorflow → computation even slower
- Diffing the outputs in hope of finding something actionable in there → no luck