riza
May 21, 2025, 8:39pm
1
Hi all,
kernel 6.14.7 seems to have introduced a bug with AMD GPUs locking up constantly the whole system:
Mai 21 18:36:44 puffy kernel: amdgpu 0000:2b:00.0: [drm] *ERROR* dc_dmub_srv_log_diagnostic_data: DMCUB error - collecting diagnostic data
Mai 21 18:36:44 puffy kernel: amdgpu 0000:2b:00.0: [drm] *ERROR* dc_dmub_srv_log_diagnostic_data: DMCUB error - collecting diagnostic data
Mai 21 18:36:44 puffy kernel: amdgpu 0000:2b:00.0: [drm] *ERROR* dc_dmub_srv_log_diagnostic_data: DMCUB error - collecting diagnostic data
Mai 21 18:36:44 puffy kernel: amdgpu 0000:2b:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000012 SMN_C2PMSG_82:0x00000005
Mai 21 18:36:49 puffy kernel: amdgpu 0000:2b:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000012 SMN_C2PMSG_82:0x00000005
Mai 21 18:36:49 puffy kernel: amdgpu 0000:2b:00.0: amdgpu: Failed to disable gfxoff!
Mai 21 18:36:54 puffy kernel: amdgpu 0000:2b:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000012 SMN_C2PMSG_82:0x00000005
Mai 21 18:36:54 puffy kernel: amdgpu 0000:2b:00.0: amdgpu: Failed to retrieve enabled ppfeatures!
Mai 21 18:36:59 puffy kernel: watchdog: CPU2: Watchdog detected hard LOCKUP on cpu 2
Mai 21 18:36:59 puffy kernel: rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
Mai 21 18:36:59 puffy kernel: rcu: 2-...0: (0 ticks this GP) idle=f0f4/1/0x4000000000000002 softirq=29712/29712 fqs=3888
Mai 21 18:36:59 puffy kernel: rcu: (detected by 15, t=21002 jiffies, g=57805, q=567 ncpus=16)
Mai 21 18:36:59 puffy kernel: amdgpu 0000:2b:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000012 SMN_C2PMSG_82:0x00000005
Mai 21 18:37:04 puffy kernel: watchdog: BUG: soft lockup - CPU#4 stuck for 22s! [.gnome-shell-wr:2539]
Mai 21 18:37:05 puffy kernel: amdgpu 0000:2b:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000012 SMN_C2PMSG_82:0x00000005
Mai 21 18:37:05 puffy kernel: amdgpu 0000:2b:00.0: amdgpu: Failed to disable gfxoff!
Mai 21 18:37:10 puffy kernel: amdgpu 0000:2b:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000012 SMN_C2PMSG_82:0x00000005
Mai 21 18:37:15 puffy kernel: amdgpu 0000:2b:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000012 SMN_C2PMSG_82:0x00000005
Mai 21 18:37:15 puffy kernel: amdgpu 0000:2b:00.0: amdgpu: Failed to retrieve enabled ppfeatures!
Mai 21 18:37:20 puffy kernel: watchdog: BUG: soft lockup - CPU#7 stuck for 23s! [IPC Launch:3353]
Mai 21 18:37:20 puffy kernel: amdgpu 0000:2b:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000012 SMN_C2PMSG_82:0x00000005
Mai 21 18:37:20 puffy kernel: amdgpu 0000:2b:00.0: amdgpu: Failed to disable gfxoff!
Mai 21 18:37:26 puffy kernel: amdgpu 0000:2b:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000012 SMN_C2PMSG_82:0x00000005
Mai 21 18:37:26 puffy kernel: amdgpu 0000:2b:00.0: amdgpu: Failed to retrieve enabled ppfeatures!
Mai 21 18:37:31 puffy kernel: amdgpu 0000:2b:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000012 SMN_C2PMSG_82:0x00000005
Mai 21 18:37:32 puffy kernel: watchdog: BUG: soft lockup - CPU#4 stuck for 48s! [.gnome-shell-wr:2539]
Mai 21 18:37:36 puffy kernel: amdgpu 0000:2b:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000012 SMN_C2PMSG_82:0x00000005
Mai 21 18:37:36 puffy kernel: amdgpu 0000:2b:00.0: amdgpu: Failed to disable gfxoff!
Mai 21 18:37:37 puffy kernel: amdgpu 0000:2b:00.0: amdgpu: ring gfx_0.1.0 timeout, signaled seq=7713, emitted seq=7714
Mai 21 18:37:37 puffy kernel: amdgpu 0000:2b:00.0: amdgpu: Process information: process .gnome-shell-wr pid 2539 thread gnome-shel:cs0 pid 2575
Mai 21 18:37:37 puffy kernel: amdgpu 0000:2b:00.0: amdgpu: Starting gfx_0.1.0 ring reset
Mai 21 18:37:37 puffy kernel: amdgpu 0000:2b:00.0: amdgpu: Ring gfx_0.1.0 reset failure
Mai 21 18:37:42 puffy kernel: amdgpu 0000:2b:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000012 SMN_C2PMSG_82:0x00000005
Mai 21 18:37:47 puffy kernel: amdgpu 0000:2b:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000012 SMN_C2PMSG_82:0x00000005
Mai 21 18:37:47 puffy kernel: amdgpu 0000:2b:00.0: amdgpu: Failed to retrieve enabled ppfeatures!
See issue [drm] ERROR dc_dmub_srv_log_diagnostic_data: DMCUB error on freedesktop git.
Reverting to 6.14.6 seems to mitigate the issue for now until 6.14.8 is available:
boot.kernelPackages = pkgs.linuxPackagesFor (pkgs.linux_6_14.override {
argsOverride = rec {
src = pkgs.fetchurl {
url = "mirror://kernel/linux/kernel/v6.x/linux-${version}.tar.xz";
sha256 = "sha256-IYF/GZjiIw+B9+T2Bfpv3LBA4U+ifZnCfdsWznSXl6k=";
};
version = "6.14.6";
modDirVersion = "6.14.6";
};
});
Anyone else with similar issues?
Please note that 6.14.7 was released to 24.11 and 25.05
5 Likes
I haven’t got anything useful to add but I can never resist a direct question, so, yes.
riza
May 23, 2025, 8:34pm
3
6.14.8 didn’t solve the issue. Staying with 6.14.6 for now:
Mai 23 20:18:46 puffy kernel: Linux version 6.14.8 (nixbld@localhost) (gcc (GCC) 14.2.1 20250322, GNU ld (GNU Binutils) 2.44) #1-NixOS SMP PREEMPT_DYNAMIC Thu May 22 12:31:58 UTC 20>
....
Mai 23 20:58:15 puffy kernel: amdgpu 0000:2b:00.0: [drm] *ERROR* dc_dmub_srv_log_diagnostic_data: DMCUB error - collecting diagnostic data
Mai 23 20:58:15 puffy kernel: amdgpu 0000:2b:00.0: [drm] *ERROR* dc_dmub_srv_log_diagnostic_data: DMCUB error - collecting diagnostic data
Mai 23 20:58:15 puffy kernel: amdgpu 0000:2b:00.0: [drm] *ERROR* dc_dmub_srv_log_diagnostic_data: DMCUB error - collecting diagnostic data
Mai 23 20:58:21 puffy kernel: amdgpu 0000:2b:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000012 SMN_C2PMSG_82:0x00000005
Mai 23 20:58:26 puffy kernel: amdgpu 0000:2b:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000012 SMN_C2PMSG_82:0x00000005
Mai 23 20:58:31 puffy kernel: amdgpu 0000:2b:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000012 SMN_C2PMSG_82:0x00000005
Switching to default kernel 6.12.30 seems to be stable. Kernel versions 6.12.27 and later, as well as 6.14.7 and later, seem to be affected.
This may be what happened to my system: I updated my system with AMD GPU a couple of days ago which pulled in linux-6.14.7
and I had a couple of hangs immediately after. I haven’t had time to look into it yet so I just rolled back (love NixOS for that of course) and left the problem for later. So I don’t have anything useful to add (either), but thanks for revealing the probable cause and the workaround!
riza
May 25, 2025, 11:39am
6
Applied revert patch from sochotnicky to my config and system is running stable with both 6.12.30 and 6.14.8.
4 Likes
Thank you for sharing the problem AND the solution!
saved me a day of debugging
Not the first time amdgpu gets broken and noone notices… why does it keep happening to it!
I did something different for a workaround, check RX 6500 XT freezes the image after login - #3 by shackra
edit: I had my first freeze today after many hours of this reply
may 26 22:41:39 woody kernel: amdgpu 0000:0c:00.0: [drm] *ERROR* dc_dmub_srv_log_diagnostic_data: DMCUB error - collecting diagnostic data
may 26 22:41:39 woody kernel: amdgpu 0000:0c:00.0: [drm] *ERROR* dc_dmub_srv_log_diagnostic_data: DMCUB error - collecting diagnostic data
may 26 22:41:39 woody kernel: amdgpu 0000:0c:00.0: [drm] *ERROR* dc_dmub_srv_log_diagnostic_data: DMCUB error - collecting diagnostic data
may 26 22:41:44 woody kernel: amdgpu 0000:0c:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000029 SMN_C2PMSG_82:0x00000000
may 26 22:41:44 woody kernel: amdgpu 0000:0c:00.0: amdgpu: Failed to disable gfxoff!
may 26 22:41:48 woody kernel: watchdog: Watchdog detected hard LOCKUP on cpu 6
Thus I implemented the patching of the amdgpu driver Usa el mejor "workaround" para evitar que el GPU se congele y falle (809ef6af) · Commits · Jorge Javier Araya Navarro / puntoarchivos · GitLab
For me, upgrading to 6.14.7 started to cause problems with amdgpu lockups, initially i figured the trigger might have been electron but after almost 40 hours of debugging over the course of 4 days I’m a bit delirious.
The crashes were basically sudden power-offs, before which the system would start locking up electron based apps.
The fix above: Lockups with kernel 6.14.7 and AMD GPUs - #6 by riza
worked for stabilizing the situation. I still won’t be using electron based apps for the forseeable future until a fix is posted.
The insidious thing about this seems to be, at least on my end, that, while 6.14.6 and 6.14.5 were working fine, once .7 hit they became affected as well, same when .8 dropped, seemingly due to a regressive change. I then went to openSuse, they are on .6 but because of how tightly snapper and backups are integrated that actually screwed me over even more because the system became unrecoverable after the first lockup (not after I spent 3 hours trying to fix it and patching with .15-rc6/7, though).
I then went to NixOS 24.11, and while 6.6 was working alright, I had a crash of the same nature because of a different issue, I forget why. However, that was only once in around 4 hours, as opposed to crashing at least every half an hour on 14.7 and .8.
Currently, I’m on unstable, with 6.14.8 as my kernel, and the fix implemented. Stable as of now.
// Kernel selection: flip the comment to switch
boot.kernelPackages = pkgs.linuxPackages_latest; // Switch to 6.14+
// boot.kernelPackages = pkgs.linuxPackages_testing; // 6.15-rc6 — bleeding edge, might fix Vulkan crashes
// boot.kernelPackages = pkgs.linuxPackages_6_6; // 6.6 LTS — fallback if 6.15 explodes machine
boot.kernelParams = [
“mem_sleep_default=s2idle”
“amdgpu.noretry=0”
“amdgpu.vm_update_mode=3”
“amdgpu.sg_display=0”
“amdgpu.preempt_mm=0” // the supposed, magic fix
];