Nvidia driver 565.77 linux 6.6.74 failed this morning

nulleric · February 3, 2025, 7:48pm

As new users enter your ecosystem they will google things, and the fist thing they find is the thing they’ll use. Also I’d argue I’m following the recommendation of that link - as it says # Set to false for proprietary drivers - which I am using proprietary drivers, not Nouveau. Regardless, I have flipped to open = true.

I understand that and have a reproducible case - I am looking for guidance on how to debug this further or what I can provide you to debug this further.

$ nix-channel --list
nixos https://nixos.org/channels/nixos-24.11 # System is using this. 
nixos-unstable https://nixos.org/channels/nixos-unstable # only used for Wine 10.

I belive the commits you are asking about are provided in the output of my logs above, if not please let me know which.

And yes. I understand that. But, again, I have no SMART errors and no stability issues outside this one item. If there is some other way you’d like me to validate the NVME is behaving correctly please provide that.

TLATER · February 4, 2025, 2:58am

Yes, unfortunately. This is why I make sure to link such users to the real wiki every time this comes up, because the other one remains higher in page rank so Google lists it first; linking to the proper wiki in reply to a link to the wrong wiki at least keeps the score even. This also helps such users know that Google is unreliable in this case.

Right, so, nvidia’s driver is also open source now, and you are in fact not using a proprietary driver when you set that to true, but also not nouveau. Maybe that needs clarifying on the wiki.

Honestly IMO we should just set the default in nixpkgs to open and purge any reference to that option from the web.

nulleric:

$ nix-channel --list
nixos https://nixos.org/channels/nixos-24.11 # System is using this. 
nixos-unstable https://nixos.org/channels/nixos-unstable # only used for Wine 10.
I belive the commits you are asking about are provided in the output of my logs above, if not please let me know which.

Yep, that looks right. Something’s still very broken here. I guess it’s time to look at the build.

It could be a nix bug, but that seems really unlikely given the amount of testing nix has had between all its users, and how specific this bug would therefore need to be. Guess I’ll try building this driver until it fails…

nulleric · February 4, 2025, 2:57pm

Well happened again this morning when updating to 6.6.75

3 failures with seg faults and things at different points in the build, then magically on the 4th build it succeeded.

I was able to capture the full outputs of each run - Fail 1 · GitHub - these were all run in a few minute time span.

I do agree it seems unlikely to be a nix bug as my (basic) understanding of how this works it shouldn’t happen, but here we are with such a strange inconsistency! Thanks for taking the time to look into this further.

TLATER · February 4, 2025, 3:01pm

Umm

malloc(): unaligned tcache chunk detected

I think your memory is shot.

The other errors are driver-internal, too, as if part of the driver’s sources are incompatible with itself. I don’t see how an issue with the build can cause this.

Have you tried doing a nix store check/repair, in case your gcc is corrupted or such?

nulleric · February 4, 2025, 3:16pm

as noted above I ran a memtest recently - but I will again. (and that nothing else seems to fail the same way, I use this machine all day building other packages!)

I do not know how to run this - can you provide the command? My best guess is sudo nix store verify /nix/store/* - which gives me 498 ... is untrusted items and exit’s with 2.

Trying the similar repair results in this:

$ sudo nix store repair /nix/store/*
error: cannot repair path '/nix/store/00672jm4jj5nq70cqpnlysbq7sm3d1gb-libvdpau-1.5.tar.bz2.drv'

Again I’m just learning so any guidance appreciated to get the data you’re after.

edit: maybe it’s this command which exits fine:

$ sudo nix-store --verify --check-contents --repair
reading the Nix store...
checking path existence...
checking link hashes...
checking store hashes...
$ echo $?
0

Edit2: memory seems fine. Will let it run for a few hours.

TLATER · February 4, 2025, 5:04pm

Yep, that’s the command, no store corruption.

Could you run this script for a while?

#!/usr/bin/env bash

set -e

successful_builds=0

while sleep 1; do
    nix build --file - <<EOF
let
  pkgs = import (builtins.getFlake "github:NixOS/nixpkgs?rev=7672ef7452230390cc0e171e1ad0139f45736d4b") {
    system = "x86_64-linux";
    config.allowUnfree = true;
  };
in
  pkgs.linuxKernel.packages.linux_6_6.nvidia_x11
EOF

    successful_builds=$((successful_builds + 1))
    echo "Built driver successfully ${successful_builds} times"

    # Delete the nvidia driver again
    store_path="$(readlink ./result-bin)"
    rm ./result-bin
    nix store delete "$store_path"
done

Edit: Ah, actually, this will require you not having that store path as part of your system closure. I’d suggest updating before you do that, the 6.12 kernel seems to have landed in stable so that will bring you to a state where that nvidia kernel version can be deleted.

If you can’t update, pick a nixpkgs revision where you’ve had this failure before.

nulleric · February 4, 2025, 7:48pm

Since I experienced this issue on 6.12 as well so I used 6.12 in the script instead of the system (because I did get a hard lockup on 6.12 and had to roll back) - but I will also test that way

Memtest - 4 full runs, no errors
Results of script with linux_6_12.nvidia instead of 6.6

Failed first time with seg fault,

/build/NVIDIA-Linux-x86_64-565.77/kernel/nvidia-uvm/uvm_pmm_test.c: In function 'destroy_test_chunk':
/build/NVIDIA-Linux-x86_64-565.77/kernel/nvidia-uvm/uvm_pmm_test.c:468:1: internal compiler error: Segmentation fault
468 | }

Built 4 times successfully
Failed on the 5th with same seg fault

Now will upgrade system, reboot and try 6.6 just to validate the same inconsistency.

edit: 6.6 building (from a 6.12 system)

Succeeded 2 times
Failed 3rd time

In file included from /build/NVIDIA-Linux-x86_64-565.77/kernel/nvidia-uvm/uvm_va_space.h:41,
from /build/NVIDIA-Linux-x86_64-565.77/kernel/nvidia-uvm/uvm_api.h:34,
from /build/NVIDIA-Linux-x86_64-565.77/kernel/nvidia-uvm/uvm_thread_context_tes
t.c:23:
/build/NVIDIA-Linux-x86_64-565.77/kernel/nvidia-uvm/uvm_va_block.h: In function 'uvm_page_mask_c
omplement':
/build/NVIDIA-Linux-x86_64-565.77/kernel/nvidia-uvm/uvm_va_block.h:1826:5: internal compiler err
or: Segmentation fault
1826 |     bitmap_complement(mask_out->bitmap, mask_in->bitmap, PAGES_PER_UVM_VA_BLOCK);
|     ^~~~~~~~~~~~~~~~~

edit again:
All just so random. had 1 success, 3 success, then 11 success before fail. Again seg fault.

TLATER · February 5, 2025, 3:01am

Right, I had 100 consecutive successes and no failures - I asked for the specific commit/kernel to ensure nothing else software wise differs, using flakes to get fully pure builds. Your error messages and the behavior are consistent with hardware failure.

Since memory and drive appear to be fine, I’m actually agreeing with @Postboote here. Despite the semi-deterministic behavior, sounds like a faulty CPU. Do you happen to be on a somewhat recent intel CPU? They’re facing a class-action for knowlingly selling faulty hardware, you may be affected by that.

Even if not, I don’t have any other explanation besides gremlins.

nulleric · February 5, 2025, 3:16am

It’s an AMD 3900X that I’ve been running for years and has worked in Windows 10 that while time and continues to work - just not compiling nvidia. Did badblocks on the NVMe drive - no issues. I don’t know a way to test a CPU. Prime 95? I really can’t understand why it’s just this one thing that shows the error.

TLATER · February 5, 2025, 3:24am

It’s a relatively CPU intensive task, and takes quite a bit longer than normal builds. It just might push the CPU that little bit harder to tickle out a failure. Maybe part of a cache is broken and you don’t fully utilize the cache any other way. Perhaps it’s a specific instruction. Or maybe you just haven’t noticed other corruption and just think it’s normal.

Testing CPU health is indeed hard. I’d personally consider swapping it out for a known good one and repeating that test; if you continue to see failures after that it’s time to reinstall and follow the manual more closely, but I cannot see any way that would actually help - that nix build command should shield you from any OS installation issues. Could still be a faulty motherboard I guess? Personally, I’d be looking for an exorcist at that point.

Postboote · February 5, 2025, 8:39am

Maybe before buying a new CPU try a full reinstallation of your system. If this failure than continues than it is time to test the CPU with a newer one.
But before that I found a stress test that compiles GCC and takes around 1000 seconds on a ryzen 3900. So this test could indicate a problem with the cpu. To use it

and this is even in the nixpkgs

https://search.nixos.org/packages?channel=unstable&show=phoronix-test-suite&from=0&size=50&sort=relevance&type=packages&query=Phoronix+Test+Suite

Maybe just run this test a few times. If it fails try a full reinstallation and try again. And if that shows no success try out a new CPU.

TLATER · February 5, 2025, 11:59am

It doesn’t hurt to try, but nix build effectively is a full “reinstallation”.

It’d be easier to boot the NixOS installer and run the driver build script I shared in there if you do want to try that anyway. We have a reproduction scenario that doesn’t require a full OS at this point, just a copy of nix.

Updating firmware might be a reasonable idea, too.

nulleric · February 5, 2025, 11:02pm

Ok update after a day of hardware debugging.

Updated BIOS - seems to have helped reduce hard crashes actually.

Realized ram wasn’t in the right banks for dual channel (doh!) fixed.

Pulled off the CPU cooler and repasted - was dry/crusty (also going to order some better paste, only had some 10+ year old stuff)

Played with XMP, Voltages, etc, nothing really helped.

Now that the BIOS update stopped hard crashes I noticed a pattern of which CPU’s would “die” during stress tests.

Walked through each core disabling them while the nixbuild of nvidia was running (getting to at least 6 successes before proceeding)

Found that 3 CPU’s (6HT) or 25% of my cores are unstable!

I’ll need to find a way to set this on boot early (any suggestions?) -
echo 0 > /sys/devices/system/cpu/cpu{4, 5, 6, 7, 18, 19}/online

Hopefully this limps me along for a while - I was planning on upgrading - but not this soon!

I want to say thanks everyone who replied in this thread and offered guidance and suggestions - I learned a lot about how NixOS works while debugging this odd hardware failure and have not lost trust in the nix build system

MaxHearnden · February 5, 2025, 11:09pm

Try setting the possible_cpus= kernel command line switch. You can set it using boot.kernelParams. Hope this helps.

Postboote · February 6, 2025, 6:58am

So i got 2 more things to add here:

You wrote that you changed the voltage. Did you turn it down or increased it?
If these cores are unstable and causing the issue it’s a matter of time before other cores might fail too. (If you Overclocked the CPU and turned up the Voltage it’s more likely) So I would recommend you to buy a jew Cpu l. You are on the AM Platform so switching to let’s say a 5900 is rather easy.

nulleric · February 6, 2025, 3:33pm

I did go up and down steps each way in voltage - nothing changed much stability.

When I first got the CPU 5-6 years ago I ran the AutoOC as i was curious how fast it could go, but beyond that I left it stock with Auto boost. I will likely buy a new 5000 series AM4 but a lot going on so if I can limp by for 6-12 months I’ll be happy. I’ll keep good backups and a close eye on random failures

I have a friend mailing me a known good AM4 CPU to test so I can 100% rule out the motherboard as well.

nulleric · February 6, 2025, 11:10pm

Thanks for the tip, though it didn’t seem to work (at least from my basic understanding of how to set it, not a ton of examples for this!)

I tried the int and hex representation of the bitmask - all CPU’s were enabled on restart

eg:

boot.kernelParams = ["possible_cpus=0xD9FD9F"];
and
boot.kernelParams = ["possible_cpus=14286239"];

MaxHearnden · February 6, 2025, 11:13pm

I had misread the option, the possible_cpus option specifies the number of possible cpus, it might be better to use maxcpus=4 and online the rest later

TLATER · February 7, 2025, 4:29am

If that doesn’t work, you’ll probably want to write some udev rules.

Udev rules are always frightfully hard to figure out, sadly. As far as I can gather, it’d be something like:

services.udev.extraRules = concatStringsSep ", " [
  ''SUBSYSTEM=="cpu"''
  ''DRIVERS=="processor"''
  ''DEVPATH=="/devices/system/cpu/cpu19"''
  ''ATTR{online}="0"''
];

Unsure this will trigger by itself, but well, generally modifying device state should be done with udev. Not that I’ve done so for CPU hotplugging before, so share whatever works if you figure it out.

You can also turn that into a package and use the services.udev.packages option if you want a bit more control and separation between your udev rules.