OTOH the build works just fine for me on an m6a class EC2 instance which apparently run on “3rd generation AMD EPYC processors (code named Milan) with an all-core turbo frequency of 3.6 GHz”.
Well, the .drv path is going to have a different hash than the actual output. nix-instantiate "<nixpkgs>" -A hello will give you a drv path with a different hash than nix-build "<nixpkgs>" -A hello. So check that the actual drv paths are the same. If they’re not, then that difference is probably the source of your failure. If they’re the same, then you’ve got a machine-level problem on your hands.
Ah I didn’t notice that distinction. Thanks for pointing that out! On the working machine I’m getting
❯ nix-instantiate -A python3Packages.jax
warning: you did not specify '--add-root'; the result might be removed by the garbage collector
/nix/store/hpv1iy07p34xbxk8hj2rhiwqv3gcyrg8-python3.9-jax-0.3.1.drv
which is identical so this must be some kind of machine-level problem
The build fails in a Jax test with a slight numerical difference for a linear algebra operation.
I’m guessing this is coming from the difference in number of cores on these two machines. As floating point math strongly depends on order of operations, parallelism can cause such differences if the used algorithms are not specifically built for reproducibility.
I guess this might be something worthwhile reporting upstream at jax with the exact hardware configuration.
In nixpkgs we might want to disable that test or change the number of cores that Jax will be using for that test, hopefully leading to the test succeeding. (Numpy parallelism can be changed via the OMP_NUM_THREADS environment variable, but there might be another setting for Jax)
Oh! I see you’re using some sort of MKL overlay? Then you might want MKL_NUM_THREADS instead.
Also I’m not exactly sure if you might need to tweak some other setting so that the Intel MKL library will run “as desired” on an AMD machine…
Testing reproducibility with these numerical packages is a real mess since most of the tests look like gross_function_that_may_be_random() < epsilon. So there’s leeway for nondeterminism to go undetected.