AMD APU; graphics driver resets periodically on wayland after upgrading to 25.11

Symptom: The whole screen freezes for a few seconds, then goes black briefly, then starts working again. Kwin shows a notification about having to restart graphical effects due to GPU reset.

dmesg output when it happens is below

Easiest way to get it reproduce is to open up a maximized terminal (I tried kitty, xterm, and konsole) and run base64 on a large file to generate a lot of terminal output (1GB seems to be enough), but it still happens every few hours under normal usage.

Kitty and xterm both crash when it happens when I run the above test, but konsole seems to recover.

AMD Ryzen 7 9700X using on-chip GPU.

dmesg logs when it happens (instead of the kitty executable being logged it’s kwin-wayland when running a non-GL program):

[94746.288695] amdgpu 0000:0c:00.0: amdgpu: Dumping IP State
[94746.289609] amdgpu 0000:0c:00.0: amdgpu: Dumping IP State Completed
[94746.299659] amdgpu 0000:0c:00.0: amdgpu: ring gfx_0.0.0 timeout, signaled seq=2812581, emitted seq=2812583
[94746.299664] amdgpu 0000:0c:00.0: amdgpu: Process information: process .kitty-wrapped pid 451698 thread kitty:cs0 pid 451700
[94746.478056] amdgpu 0000:0c:00.0: amdgpu: GPU reset begin!
[94746.606830] amdgpu 0000:0c:00.0: amdgpu: MODE2 reset
[94746.614308] amdgpu 0000:0c:00.0: amdgpu: GPU reset succeeded, trying to resume
[94746.614596] [drm] PCIE GART of 1024M enabled (table at 0x000000F47FC00000).
[94746.614614] amdgpu 0000:0c:00.0: amdgpu: PSP is resuming...
[94746.636187] amdgpu 0000:0c:00.0: amdgpu: reserve 0xa00000 from 0xf47e000000 for PSP TMR
[94746.836187] amdgpu 0000:0c:00.0: amdgpu: RAS: optional ras ta ucode is not available
[94746.841884] amdgpu 0000:0c:00.0: amdgpu: RAP: optional rap ta ucode is not available
[94746.841886] amdgpu 0000:0c:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
[94746.841889] amdgpu 0000:0c:00.0: amdgpu: SMU is resuming...
[94746.842204] amdgpu 0000:0c:00.0: amdgpu: SMU is resumed successfully!
[94746.842390] [drm] kiq ring mec 2 pipe 1 q 0
[94746.845647] [drm] DMUB hardware initialized: version=0x05002C00
[94746.933911] amdgpu 0000:0c:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[94746.933915] amdgpu 0000:0c:00.0: amdgpu: ring gfx_0.1.0 uses VM inv eng 1 on hub 0
[94746.933915] amdgpu 0000:0c:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 4 on hub 0
[94746.933916] amdgpu 0000:0c:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 5 on hub 0
[94746.933917] amdgpu 0000:0c:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 6 on hub 0
[94746.933917] amdgpu 0000:0c:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 7 on hub 0
[94746.933918] amdgpu 0000:0c:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 8 on hub 0
[94746.933918] amdgpu 0000:0c:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 9 on hub 0
[94746.933919] amdgpu 0000:0c:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 10 on hub 0
[94746.933919] amdgpu 0000:0c:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 11 on hub 0
[94746.933920] amdgpu 0000:0c:00.0: amdgpu: ring kiq_0.2.1.0 uses VM inv eng 12 on hub 0
[94746.933921] amdgpu 0000:0c:00.0: amdgpu: ring sdma0 uses VM inv eng 13 on hub 0
[94746.933921] amdgpu 0000:0c:00.0: amdgpu: ring vcn_dec_0 uses VM inv eng 0 on hub 8
[94746.933922] amdgpu 0000:0c:00.0: amdgpu: ring vcn_enc_0.0 uses VM inv eng 1 on hub 8
[94746.933922] amdgpu 0000:0c:00.0: amdgpu: ring vcn_enc_0.1 uses VM inv eng 4 on hub 8
[94746.933923] amdgpu 0000:0c:00.0: amdgpu: ring jpeg_dec uses VM inv eng 5 on hub 8
[94746.935759] amdgpu 0000:0c:00.0: amdgpu: GPU reset(17) succeeded!
[94746.942425] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
2 Likes

Hi there. This looks like a regression occurred in the latest update to linux-firmware for amdgpu (https://gitlab.freedesktop.org/drm/amd/-/issues/4737). It has been reverted upstream, and there is an issue raised for it in Nixpkgs as well, which provides some workarounds for now.

5 Likes

Thank you so much; your bug locating skills are superior to mine