I am trying to upgrade from NixOS 22.11 to 23.05 but run into an invalid opcode problem.
The issue I ran into was with tiledb as a dependency of gdal. I already raised a ticket on this but I don’t think this is a package specific problem, but more a problem with how packages get build.
After some investigation I found out that the version of tiledb that is provided by the nixpkgs cache contains AVX instructions (see also the comment I made on this in the ticket). But according to this page all binaries should have been build without those instructions.
I also tried to rebuild gdal without tiledb support, but then I ran into similar problems with the parquet dependency. So it doesn’t seem that this issue is isolated to a single package.
Did something change with NixOS 23.05 that it now requires a higher minimal CPU instruction set, or is there an issue with the way nixpkgs are build?
And the normal way is that packages build multiple versions of functions where such instructions matter (normally a significant difference happens just for small very specific parts) – and choose the best version during runtime based on CPU.
From what I can tell, TileDB is not performing such a runtime check.
Looking at the source package, it has a configure-time check to determine if AVX2 is supported.
This check is there in 2.3.3 (nixos 22.11) and 2.8.3 (nixos 23.05) and doesn’t seem to have changed.
So somehow the build environment has changed between 22.11 and 23.05 for the build of this package to suddenly determine that AVX2 is now possible? This is why I fear that this issue may affect more packages now.
In any case, disabling AVX2 for tiledb seems to be configurable, so I will see if I can provide a PR to fix this.
It’s certainly possible there’s some non-AVX2 builder in the build farm.
The build depending on the builder’s CPU configuration would be considered an impurity and therefore a bug however. We currently target an ancient x86 CPU µarch for all our packages by default; a package requiring any sort of vector extensions would be quite out of the ordinary.
Packages providing optional vector extensioin support at runtime can have them enabled as they’d fall back to non-vector when running on a lesser CPU but requiring them should not happen.
If you make a PR to fix that, please add an override flag for those who know they have current hardware and want better perf.
Try to get a backtrace to an illegal instruction in your program. That might help nailing down where it comes from: the program itself, or some library.
Sometimes libraries use -march=native compiler flag and encode assumptions about builder machine. I think zig (as an example) does it by default unless overridden.
I still don’t see a backtrace there. Can you post an exact link? Sometimes presence itself of unsupported instructions is fine (if the library supports multiversion. examples would be glibc and lz4).
tiledb does not use runtime checks for AVX as far as I can tell, it is a configure/build time option that is enabled by default. As mentioned earlier, the proposed fix (for the derivation) is to make this option off by default, but allow it to be enabled again using a derivation setting.
Related, when trying to build gdal without tiledb on the system without AVX, I got another illegal instruction during the test runs:
ogr/ogr_parquet.py .Fatal Python error: /nix/store/yhzmjsnxrgy27fs7kfjl8klc2qigliyn-pytest-check-hook/nix-support/setup-hook: line 53: 13726 Illegal instruction (core dumped) /nix/store/763kk6xg6vaslzh1hgvwgdk1h582b7s3-python3-3.10.12/bin/python3.10 -m pytest -k "not test_jp2openjpeg_45 and not test_transformer_dem_overrride_srs and not test_osr_ct_options_area_of_interest and not test_sentinel2_zipped and not test_SetPROJAuxDbPaths" --ignore="gcore/vsis3.py" --ignore="gdrivers/gdalhttp.py" --ignore="gdrivers/wms.py"
/nix/store/91hwyc6c4r474zs56n9idbqj17dlcnna-stdenv-linux/setup: line 1604: pop_var_context: head of shell_variables not a function context
Since parquet is part of arrow, I think this is related to this post. However, I did not see a resolution to what was discussed there.
Interesting. I hadn’t looked at the code yet. It is probably not L98 though (since that is in a #if .. && 0 block), but rather L213.
So probably something else is going on. I would suggest to continue the discussion in the ticket. I will add a stack trace there too.
On legacy x86_64 platforms do not support SSE4.2, Arrow binary may fail with SIGILL (Illegal Instruction). User must rebuild Arrow and PyArrow from scratch by setting cmake option ARROW_SIMD_LEVEL=NONE .
That is kind of the inverse of what is stated at the top of this page.
If we want to follow the nix conventions then I would guess that for the nix package the default should be to use ARROW_SIMD_LEVEL=NONE and give users a mechanism to override this to get an optimised version.