Continue Freeze and Machine Code Error

heph · December 19, 2021, 7:47am

Hi!
I’ve been having PC problems for a few months now.
Continuous freezes ( about 1 time a day, in a “non-deterministic” way ), practically no logs about it.
It freezes completely and I have to reboot. Other times, it restarts by itself and just before asking me for luks password I see the following errors:

[Hardware Error]: CPU 3: Machine Check: 0 Bank 0...
[Hardware Error]: TSC 0 MISC
[Hardware Error]: PROCESSOR 2:800F11 TIME 1639044401 SOCKET 0 APIC 8 microcode 8001130

I guess that the problems belongs to AMD… I’ve those settings in configuration.nix


  kernelParams = [ "amdgpu.ppfeaturemask=0xffffbffb" ];

  hardware = {
    enableRedistributableFirmware = true;
    cpu.amd.updateMicrocode = true;
    opengl.extraPackages = with pkgs; [
      amdvlk
      ];
    ## enable vulkan
    opengl.driSupport = true;
    opengl.driSupport32Bit = true;

    videoDrivers = [ "amdgpu" ];

I’ve also tried with different kernel version, but still same issue ( Actually 5.15.4 )

M12 · December 19, 2021, 12:18pm

a bank is mentioned, maybe RAM error?

similar? mce: hardware error cpu 0 machine check 0 bank 0. System just freezes, everything just stops including mouse on Linux Mint XFCE 20 - Unix & Linux Stack Exchange

TLATER · December 19, 2021, 12:23pm

After a bit of googling I stumbled across this via a thread over on archland discussing similar issues: GitHub - suaefar/ryzen-test: Tools to reproduce randomly crashing processes under load on AMD Ryzen processors on Linux

Seems problems like this are a known issue with some AMD CPUs, and people are getting them RMA’d, I’d give that test a go and if positive check with support. Good luck!

heph · December 19, 2021, 12:38pm

i did a memtest a while ago and was okay

M12 · December 19, 2021, 12:44pm

OK, already tried BIOS set to factory defaults?

Is there a newer BIOS?

CPU too hot?

And disable the microcode line in configuration.nix?

heph · December 19, 2021, 1:00pm

I have to try things one at a time… Never tried to remove microcode line.
Never tried to reset BIOS to factory defaults and i’ve just saw there’s a new BIOS version, BUT in beta…

M12 · December 19, 2021, 1:02pm

reset BIOS to factory defaults probably the easiest to try?

the error you see, is that also when you switch the PC on, after it has been off for a long time? (when it is cold) or does it only happen when the pc has been used for a certain time? (warm or hot)

you say the problems are there for a few months now, were there (hardware, software, environmental) changes then?

heph · December 19, 2021, 2:17pm

I’ve updated the BIOS to the latest version available (not in Beta). The temperature, for both CPU and GPU are totally fine. And those freeze happens totally random, also when i’ve just turn on the PC.

M12 · December 19, 2021, 9:07pm

If there are 2 RAM banks, could you swap the memory modules? Just to see if the error changes… If there are more banks swap them too, so all of them take another place

heph · December 20, 2021, 5:07pm

Just tried. And 5 min ago another crash. I’ll share what dmseg show: https://clbin.com/MM8Np

TLATER · December 20, 2021, 5:17pm

You can get more info using rasdaemon.

M12 · December 20, 2021, 7:15pm

What started in your first post as hardware error bank 0

Is now hardware error bank 5, (reading your dmesg), after you swapped RAM in other sockets…

So I would try to see if one of your RAM modules is faulty?

Maybe the one that is now in bank 5? (not sure if that error about bank # refers to RAM bank, but trying!)

M12 · December 20, 2021, 8:13pm

for the dmesg part: [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)

you could try disabling that in the BIOS if possible

https://bbs.archlinux.org/viewtopic.php?id=252238