Continue Freeze and Machine Code Error

Hi!
I’ve been having PC problems for a few months now.
Continuous freezes ( about 1 time a day, in a “non-deterministic” way ), practically no logs about it.
It freezes completely and I have to reboot. Other times, it restarts by itself and just before asking me for luks password I see the following errors:

[Hardware Error]: CPU 3: Machine Check: 0 Bank 0...
[Hardware Error]: TSC 0 MISC
[Hardware Error]: PROCESSOR 2:800F11 TIME 1639044401 SOCKET 0 APIC 8 microcode 8001130

I guess that the problems belongs to AMD… I’ve those settings in configuration.nix


  kernelParams = [ "amdgpu.ppfeaturemask=0xffffbffb" ];

  hardware = {
    enableRedistributableFirmware = true;
    cpu.amd.updateMicrocode = true;
    opengl.extraPackages = with pkgs; [
      amdvlk
      ];
    ## enable vulkan
    opengl.driSupport = true;
    opengl.driSupport32Bit = true;

    videoDrivers = [ "amdgpu" ];

I’ve also tried with different kernel version, but still same issue ( Actually 5.15.4 )

a bank is mentioned, maybe RAM error?

similar? mce: hardware error cpu 0 machine check 0 bank 0. System just freezes, everything just stops including mouse on Linux Mint XFCE 20 - Unix & Linux Stack Exchange

After a bit of googling I stumbled across this via a thread over on archland discussing similar issues: GitHub - suaefar/ryzen-test: Tools to reproduce randomly crashing processes under load on AMD Ryzen processors on Linux

Seems problems like this are a known issue with some AMD CPUs, and people are getting them RMA’d, I’d give that test a go and if positive check with support. Good luck!

i did a memtest a while ago and was okay :confused:

OK, already tried BIOS set to factory defaults?

Is there a newer BIOS?

CPU too hot?

And disable the microcode line in configuration.nix?

I have to try things one at a time… Never tried to remove microcode line.
Never tried to reset BIOS to factory defaults and i’ve just saw there’s a new BIOS version, BUT in beta…

reset BIOS to factory defaults probably the easiest to try?

the error you see, is that also when you switch the PC on, after it has been off for a long time? (when it is cold) or does it only happen when the pc has been used for a certain time? (warm or hot)

you say the problems are there for a few months now, were there (hardware, software, environmental) changes then?

I’ve updated the BIOS to the latest version available (not in Beta). The temperature, for both CPU and GPU are totally fine. And those freeze happens totally random, also when i’ve just turn on the PC.

If there are 2 RAM banks, could you swap the memory modules? Just to see if the error changes… If there are more banks swap them too, so all of them take another place

Just tried. And 5 min ago another crash. I’ll share what dmseg show: https://clbin.com/MM8Np

You can get more info using rasdaemon.

What started in your first post as hardware error bank 0

Is now hardware error bank 5, (reading your dmesg), after you swapped RAM in other sockets…

So I would try to see if one of your RAM modules is faulty?

Maybe the one that is now in bank 5? (not sure if that error about bank # refers to RAM bank, but trying!)

for the dmesg part: [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)

you could try disabling that in the BIOS if possible

https://bbs.archlinux.org/viewtopic.php?id=252238