As new users enter your ecosystem they will google things, and the fist thing they find is the thing they’ll use. Also I’d argue I’m following the recommendation of that link - as it says # Set to false for proprietary drivers - which I am using proprietary drivers, not Nouveau. Regardless, I have flipped to open = true.
I understand that and have a reproducible case - I am looking for guidance on how to debug this further or what I can provide you to debug this further.
$ nix-channel --list
nixos https://nixos.org/channels/nixos-24.11 # System is using this.
nixos-unstable https://nixos.org/channels/nixos-unstable # only used for Wine 10.
I belive the commits you are asking about are provided in the output of my logs above, if not please let me know which.
And yes. I understand that. But, again, I have no SMART errors and no stability issues outside this one item. If there is some other way you’d like me to validate the NVME is behaving correctly please provide that.
Yes, unfortunately. This is why I make sure to link such users to the real wiki every time this comes up, because the other one remains higher in page rank so Google lists it first; linking to the proper wiki in reply to a link to the wrong wiki at least keeps the score even. This also helps such users know that Google is unreliable in this case.
Right, so, nvidia’s driver is also open source now, and you are in fact not using a proprietary driver when you set that to true, but also not nouveau. Maybe that needs clarifying on the wiki.
Honestly IMO we should just set the default in nixpkgs to open and purge any reference to that option from the web.
Yep, that looks right. Something’s still very broken here. I guess it’s time to look at the build.
It could be a nix bug, but that seems really unlikely given the amount of testing nix has had between all its users, and how specific this bug would therefore need to be. Guess I’ll try building this driver until it fails…
Well happened again this morning when updating to 6.6.75
3 failures with seg faults and things at different points in the build, then magically on the 4th build it succeeded.
I was able to capture the full outputs of each run - Fail 1 · GitHub - these were all run in a few minute time span.
I do agree it seems unlikely to be a nix bug as my (basic) understanding of how this works it shouldn’t happen, but here we are with such a strange inconsistency! Thanks for taking the time to look into this further.
The other errors are driver-internal, too, as if part of the driver’s sources are incompatible with itself. I don’t see how an issue with the build can cause this.
Have you tried doing a nix store check/repair, in case your gcc is corrupted or such?
as noted above I ran a memtest recently - but I will again. (and that nothing else seems to fail the same way, I use this machine all day building other packages!)
I do not know how to run this - can you provide the command? My best guess is sudo nix store verify /nix/store/* - which gives me 498 ... is untrusted items and exit’s with 2.
#!/usr/bin/env bash
set -e
successful_builds=0
while sleep 1; do
nix build --file - <<EOF
let
pkgs = import (builtins.getFlake "github:NixOS/nixpkgs?rev=7672ef7452230390cc0e171e1ad0139f45736d4b") {
system = "x86_64-linux";
config.allowUnfree = true;
};
in
pkgs.linuxKernel.packages.linux_6_6.nvidia_x11
EOF
successful_builds=$((successful_builds + 1))
echo "Built driver successfully ${successful_builds} times"
# Delete the nvidia driver again
store_path="$(readlink ./result-bin)"
rm ./result-bin
nix store delete "$store_path"
done
Edit: Ah, actually, this will require you not having that store path as part of your system closure. I’d suggest updating before you do that, the 6.12 kernel seems to have landed in stable so that will bring you to a state where that nvidia kernel version can be deleted.
If you can’t update, pick a nixpkgs revision where you’ve had this failure before.
Since I experienced this issue on 6.12 as well so I used 6.12 in the script instead of the system (because I did get a hard lockup on 6.12 and had to roll back) - but I will also test that way
Memtest - 4 full runs, no errors
Results of script with linux_6_12.nvidia instead of 6.6
Failed first time with seg fault,
/build/NVIDIA-Linux-x86_64-565.77/kernel/nvidia-uvm/uvm_pmm_test.c: In function 'destroy_test_chunk':
/build/NVIDIA-Linux-x86_64-565.77/kernel/nvidia-uvm/uvm_pmm_test.c:468:1: internal compiler error: Segmentation fault
468 | }
Built 4 times successfully
Failed on the 5th with same seg fault
Now will upgrade system, reboot and try 6.6 just to validate the same inconsistency.
edit: 6.6 building (from a 6.12 system)
Succeeded 2 times
Failed 3rd time
In file included from /build/NVIDIA-Linux-x86_64-565.77/kernel/nvidia-uvm/uvm_va_space.h:41,
from /build/NVIDIA-Linux-x86_64-565.77/kernel/nvidia-uvm/uvm_api.h:34,
from /build/NVIDIA-Linux-x86_64-565.77/kernel/nvidia-uvm/uvm_thread_context_tes
t.c:23:
/build/NVIDIA-Linux-x86_64-565.77/kernel/nvidia-uvm/uvm_va_block.h: In function 'uvm_page_mask_c
omplement':
/build/NVIDIA-Linux-x86_64-565.77/kernel/nvidia-uvm/uvm_va_block.h:1826:5: internal compiler err
or: Segmentation fault
1826 | bitmap_complement(mask_out->bitmap, mask_in->bitmap, PAGES_PER_UVM_VA_BLOCK);
| ^~~~~~~~~~~~~~~~~
edit again:
All just so random. had 1 success, 3 success, then 11 success before fail. Again seg fault.
Right, I had 100 consecutive successes and no failures - I asked for the specific commit/kernel to ensure nothing else software wise differs, using flakes to get fully pure builds. Your error messages and the behavior are consistent with hardware failure.
Since memory and drive appear to be fine, I’m actually agreeing with @Postboote here. Despite the semi-deterministic behavior, sounds like a faulty CPU. Do you happen to be on a somewhat recent intel CPU? They’re facing a class-action for knowlingly selling faulty hardware, you may be affected by that.
Even if not, I don’t have any other explanation besides gremlins.
It’s an AMD 3900X that I’ve been running for years and has worked in Windows 10 that while time and continues to work - just not compiling nvidia. Did badblocks on the NVMe drive - no issues. I don’t know a way to test a CPU. Prime 95? I really can’t understand why it’s just this one thing that shows the error.
It’s a relatively CPU intensive task, and takes quite a bit longer than normal builds. It just might push the CPU that little bit harder to tickle out a failure. Maybe part of a cache is broken and you don’t fully utilize the cache any other way. Perhaps it’s a specific instruction. Or maybe you just haven’t noticed other corruption and just think it’s normal.
Testing CPU health is indeed hard. I’d personally consider swapping it out for a known good one and repeating that test; if you continue to see failures after that it’s time to reinstall and follow the manual more closely, but I cannot see any way that would actually help - that nix build command should shield you from any OS installation issues. Could still be a faulty motherboard I guess? Personally, I’d be looking for an exorcist at that point.
Maybe before buying a new CPU try a full reinstallation of your system. If this failure than continues than it is time to test the CPU with a newer one.
But before that I found a stress test that compiles GCC and takes around 1000 seconds on a ryzen 3900. So this test could indicate a problem with the cpu. To use it
It doesn’t hurt to try, but nix build effectively is a full “reinstallation”.
It’d be easier to boot the NixOS installer and run the driver build script I shared in there if you do want to try that anyway. We have a reproduction scenario that doesn’t require a full OS at this point, just a copy of nix.
Updating firmware might be a reasonable idea, too.
Updated BIOS - seems to have helped reduce hard crashes actually.
Realized ram wasn’t in the right banks for dual channel (doh!) fixed.
Pulled off the CPU cooler and repasted - was dry/crusty (also going to order some better paste, only had some 10+ year old stuff)
Played with XMP, Voltages, etc, nothing really helped.
Now that the BIOS update stopped hard crashes I noticed a pattern of which CPU’s would “die” during stress tests.
Walked through each core disabling them while the nixbuild of nvidia was running (getting to at least 6 successes before proceeding)
Found that 3 CPU’s (6HT) or 25% of my cores are unstable!
I’ll need to find a way to set this on boot early (any suggestions?) - echo 0 > /sys/devices/system/cpu/cpu{4, 5, 6, 7, 18, 19}/online
Hopefully this limps me along for a while - I was planning on upgrading - but not this soon!
I want to say thanks everyone who replied in this thread and offered guidance and suggestions - I learned a lot about how NixOS works while debugging this odd hardware failure and have not lost trust in the nix build system
You wrote that you changed the voltage. Did you turn it down or increased it?
If these cores are unstable and causing the issue it’s a matter of time before other cores might fail too. (If you Overclocked the CPU and turned up the Voltage it’s more likely) So I would recommend you to buy a jew Cpu l. You are on the AM Platform so switching to let’s say a 5900 is rather easy.
I did go up and down steps each way in voltage - nothing changed much stability.
When I first got the CPU 5-6 years ago I ran the AutoOC as i was curious how fast it could go, but beyond that I left it stock with Auto boost. I will likely buy a new 5000 series AM4 but a lot going on so if I can limp by for 6-12 months I’ll be happy. I’ll keep good backups and a close eye on random failures
I have a friend mailing me a known good AM4 CPU to test so I can 100% rule out the motherboard as well.
I had misread the option, the possible_cpus option specifies the number of possible cpus, it might be better to use maxcpus=4 and online the rest later
Unsure this will trigger by itself, but well, generally modifying device state should be done with udev. Not that I’ve done so for CPU hotplugging before, so share whatever works if you figure it out.
You can also turn that into a package and use the services.udev.packages option if you want a bit more control and separation between your udev rules.