Getting different results for the same build on two equally configured machines

I’m really stumped as to why this build fails remotely, but succeeds on my local machine.

  • Both machines are running ubuntu 20.04.
  • Both machines are checked out to the exact same commit of nixpkgs.
  • Both machines have the exact same overlay.
  • Both machines are x86.

It builds just fine for me locally and produces

/nix/store/525drsfp7wlbcc4rzgqfnq2wc97xglkm-python3.9-jax-0.3.1

but it fails on the CI machine trying to build

/nix/store/hpv1iy07p34xbxk8hj2rhiwqv3gcyrg8-python3.9-jax-0.3.1.drv

which notably has a different hash.

What am I missing here? How can one of these build but the other fails? What aspect of the environment have I failed to control for?

Could it be due to different CPU versions? Apparently GH Actions are run on

3rd Generation Intel® Xeon® Platinum 8370C (Ice Lake), Intel® Xeon® Platinum 8272CL (Cascade Lake), Intel® Xeon® 8171M 2.1GHz (Skylake) or the the Intel® Xeon® E5-2673 v4 2.3 GHz (Broadwell), or the Intel® Xeon® E5-2673 v3 2.4 GHz (Haswell) processors with Intel Turbo Boost Technology 2.0

(source1 and source2).

OTOH the build works just fine for me on an m6a class EC2 instance which apparently run on “3rd generation AMD EPYC processors (code named Milan) with an all-core turbo frequency of 3.6 GHz”.

Well, the .drv path is going to have a different hash than the actual output. nix-instantiate "<nixpkgs>" -A hello will give you a drv path with a different hash than nix-build "<nixpkgs>" -A hello. So check that the actual drv paths are the same. If they’re not, then that difference is probably the source of your failure. If they’re the same, then you’ve got a machine-level problem on your hands.

1 Like

Ah I didn’t notice that distinction. Thanks for pointing that out! On the working machine I’m getting

❯ nix-instantiate -A python3Packages.jax
warning: you did not specify '--add-root'; the result might be removed by the garbage collector
/nix/store/hpv1iy07p34xbxk8hj2rhiwqv3gcyrg8-python3.9-jax-0.3.1.drv

which is identical so this must be some kind of machine-level problem :frowning:

The build fails in a Jax test with a slight numerical difference for a linear algebra operation.
I’m guessing this is coming from the difference in number of cores on these two machines. As floating point math strongly depends on order of operations, parallelism can cause such differences if the used algorithms are not specifically built for reproducibility.

I guess this might be something worthwhile reporting upstream at jax with the exact hardware configuration.

In nixpkgs we might want to disable that test or change the number of cores that Jax will be using for that test, hopefully leading to the test succeeding. (Numpy parallelism can be changed via the OMP_NUM_THREADS environment variable, but there might be another setting for Jax)

2 Likes

Oh! I see you’re using some sort of MKL overlay? Then you might want MKL_NUM_THREADS instead.
Also I’m not exactly sure if you might need to tweak some other setting so that the Intel MKL library will run “as desired” on an AMD machine…

1 Like

Thanks for the pointers! I went ahead and disabled those tests on the nixpkgs side of things. I’ll have to give MKL_NUM_THREADS a try in the future

MKL-powered results are known to be nondeterministic without additional configuration. See here: http://sc13.supercomputing.org/sites/default/files/WorkshopsArchive/pdfs/wp140s1.pdf

1 Like

Thanks for these docs @alexv! This gives me pause in using MKL…

OOC, what implementations of BLAS/LAPACK does Nix use by default? I’m not able to find any info in the docs afters some cursory googling.

According to https://github.com/NixOS/nixpkgs/blob/7a9ee0a0efeb4e28a8cfc58a65c3266260177ac1/pkgs/development/libraries/opencv/3.x.nix#L38 it seems to be OpenBLAS.

You may be able to make use of diffoscope as described on https://r13y.com/

2 Likes

Very cool, I was not aware of diffoscope!

Testing reproducibility with these numerical packages is a real mess since most of the tests look like gross_function_that_may_be_random() < epsilon. So there’s leeway for nondeterminism to go undetected.

1 Like