Getting AMDGPU error that crashes desktop

I’ve been getting a driver error after switching to new hardware. Never had to debug anything like this, so please ask for more details. Here’s some info:

$ uname -a
Linux big-system 6.6.44 #1-NixOS SMP PREEMPT_DYNAMIC Sat Aug  3 06:54:42 UTC 2024 x86_64 GNU/Linux
$ dmesg | rg amdgpu
[    0.000000] Command line: initrd=\EFI\nixos\i6l1b3gwjhmgqfha8wirqnwwi2d7z5lw-initrd-linux-6.6.44-initrd.efi init=/nix/store/abdplibma8crxqczj3n3nisq8qzkb8zs-nixos-system-big-system-24.05.20240810.a781ff3/init amdgpu.runpm=0 nohibernate loglevel=4
[    0.044886] Kernel command line: initrd=\EFI\nixos\i6l1b3gwjhmgqfha8wirqnwwi2d7z5lw-initrd-linux-6.6.44-initrd.efi init=/nix/store/abdplibma8crxqczj3n3nisq8qzkb8zs-nixos-system-big-system-24.05.20240810.a781ff3/init amdgpu.runpm=0 nohibernate loglevel=4
[    0.529091] stage-1-init: [Mon Aug 12 15:52:45 UTC 2024] loading module amdgpu...
[    2.956418] [drm] amdgpu kernel modesetting enabled.
[    2.956542] amdgpu: Virtual CRAT table created for CPU
[    2.956560] amdgpu: Topology: Add CPU node
[    2.960616] amdgpu 0000:2b:00.0: No more image in the PCI ROM
[    2.960634] amdgpu 0000:2b:00.0: amdgpu: Fetched VBIOS from ROM BAR
[    2.960639] amdgpu: ATOM BIOS: 115-D632BP2-100
[    2.986743] amdgpu 0000:2b:00.0: vgaarb: deactivate vga console
[    2.986746] amdgpu 0000:2b:00.0: amdgpu: Trusted Memory Zone (TMZ) feature disabled as experimental (default)
[    2.986816] amdgpu 0000:2b:00.0: amdgpu: VRAM: 4080M 0x0000008000000000 - 0x00000080FEFFFFFF (4080M used)
[    2.986819] amdgpu 0000:2b:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
[    2.986821] amdgpu 0000:2b:00.0: amdgpu: AGP: 267894784M 0x0000008400000000 - 0x0000FFFFFFFFFFFF
[    2.986943] [drm] amdgpu: 4080M of VRAM memory ready
[    2.986945] [drm] amdgpu: 7970M of GTT memory ready.
[    4.884787] amdgpu 0000:2b:00.0: amdgpu: STB initialized to 2048 entries
[    4.885580] amdgpu 0000:2b:00.0: amdgpu: Will use PSP to load VCN firmware
[    5.053960] amdgpu 0000:2b:00.0: amdgpu: RAS: optional ras ta ucode is not available
[    5.069648] amdgpu 0000:2b:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
[    5.069672] amdgpu 0000:2b:00.0: amdgpu: smu driver if version = 0x0000000d, smu fw if version = 0x00000010, smu fw program = 0, version = 0x00492400 (73.36.0)
[    5.069675] amdgpu 0000:2b:00.0: amdgpu: SMU driver if version not matched
[    5.069708] amdgpu 0000:2b:00.0: amdgpu: use vbios provided pptable
[    5.112009] amdgpu 0000:2b:00.0: amdgpu: SMU is initialized successfully!
[    5.166577] amdgpu: HMM registered 4080MB device memory
[    5.167775] kfd kfd: amdgpu: Allocated 3969056 bytes on gart
[    5.167797] kfd kfd: amdgpu: Total number of KFD nodes to be created: 1
[    5.167992] amdgpu: Virtual CRAT table created for GPU
[    5.168158] amdgpu: Topology: Add dGPU node [0x743f:0x1002]
[    5.168160] kfd kfd: amdgpu: added device 1002:743f
[    5.168180] amdgpu 0000:2b:00.0: amdgpu: SE 1, SH per SE 2, CU per SH 8, active_cu_number 12
[    5.169022] amdgpu 0000:2b:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[    5.169025] amdgpu 0000:2b:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[    5.169026] amdgpu 0000:2b:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[    5.169028] amdgpu 0000:2b:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
[    5.169030] amdgpu 0000:2b:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
[    5.169031] amdgpu 0000:2b:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
[    5.169033] amdgpu 0000:2b:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
[    5.169034] amdgpu 0000:2b:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
[    5.169036] amdgpu 0000:2b:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
[    5.169037] amdgpu 0000:2b:00.0: amdgpu: ring kiq_0.2.1.0 uses VM inv eng 11 on hub 0
[    5.169039] amdgpu 0000:2b:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
[    5.169041] amdgpu 0000:2b:00.0: amdgpu: ring vcn_dec_0 uses VM inv eng 0 on hub 8
[    5.170539] [drm] Initialized amdgpu 3.54.0 20150101 for 0000:2b:00.0 on minor 1
[    5.176756] fbcon: amdgpudrmfb (fb0) is primary device
[    5.268724] amdgpu 0000:2b:00.0: [drm] fb0: amdgpudrmfb frame buffer device
[   10.829284] snd_hda_intel 0000:2b:00.1: bound 0000:2b:00.0 (ops amdgpu_dm_audio_component_bind_ops [amdgpu])
[  273.406616] amdgpu 0000:2b:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:7 pasid:32771, for process wezterm-gui pid 5044 thread wezterm-gu:cs0 pid 5072)
[  273.406641] amdgpu 0000:2b:00.0: amdgpu:   in page starting at address 0x000080019560e000 from client 0x1b (UTCL2)
[  273.406645] amdgpu 0000:2b:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00701031
[  273.406649] amdgpu 0000:2b:00.0: amdgpu: 	 Faulty UTCL2 client ID: TCP (0x8)
[  273.406652] amdgpu 0000:2b:00.0: amdgpu: 	 MORE_FAULTS: 0x1
[  273.406656] amdgpu 0000:2b:00.0: amdgpu: 	 WALKER_ERROR: 0x0
[  273.406658] amdgpu 0000:2b:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x3
[  273.406661] amdgpu 0000:2b:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
[  273.406664] amdgpu 0000:2b:00.0: amdgpu: 	 RW: 0x0
[  273.406675] amdgpu 0000:2b:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:7 pasid:32771, for process wezterm-gui pid 5044 thread wezterm-gu:cs0 pid 5072)
[  273.406680] amdgpu 0000:2b:00.0: amdgpu:   in page starting at address 0x0000800195612000 from client 0x1b (UTCL2)
[  273.406684] amdgpu 0000:2b:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
[  273.406686] amdgpu 0000:2b:00.0: amdgpu: 	 Faulty UTCL2 client ID: CB/DB (0x0)
[  273.406689] amdgpu 0000:2b:00.0: amdgpu: 	 MORE_FAULTS: 0x0
[  273.406692] amdgpu 0000:2b:00.0: amdgpu: 	 WALKER_ERROR: 0x0
[  273.406694] amdgpu 0000:2b:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x0
[  273.406696] amdgpu 0000:2b:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
[  273.406700] amdgpu 0000:2b:00.0: amdgpu: 	 RW: 0x0
[  273.406706] amdgpu 0000:2b:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:7 pasid:32771, for process wezterm-gui pid 5044 thread wezterm-gu:cs0 pid 5072)
[  273.406710] amdgpu 0000:2b:00.0: amdgpu:   in page starting at address 0x000080050560a000 from client 0x1b (UTCL2)
[  273.406714] amdgpu 0000:2b:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
[  273.406717] amdgpu 0000:2b:00.0: amdgpu: 	 Faulty UTCL2 client ID: CB/DB (0x0)
[  273.406720] amdgpu 0000:2b:00.0: amdgpu: 	 MORE_FAULTS: 0x0
[  273.406722] amdgpu 0000:2b:00.0: amdgpu: 	 WALKER_ERROR: 0x0
[  273.406725] amdgpu 0000:2b:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x0
[  273.406728] amdgpu 0000:2b:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
[  273.406730] amdgpu 0000:2b:00.0: amdgpu: 	 RW: 0x0
[  273.406737] amdgpu 0000:2b:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:7 pasid:32771, for process wezterm-gui pid 5044 thread wezterm-gu:cs0 pid 5072)
[  273.406741] amdgpu 0000:2b:00.0: amdgpu:   in page starting at address 0x0000800505606000 from client 0x1b (UTCL2)
[  273.406743] amdgpu 0000:2b:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
[  273.406746] amdgpu 0000:2b:00.0: amdgpu: 	 Faulty UTCL2 client ID: CB/DB (0x0)
[  273.406749] amdgpu 0000:2b:00.0: amdgpu: 	 MORE_FAULTS: 0x0
[  273.406752] amdgpu 0000:2b:00.0: amdgpu: 	 WALKER_ERROR: 0x0
[  273.406755] amdgpu 0000:2b:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x0
[  273.406757] amdgpu 0000:2b:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
[  273.406760] amdgpu 0000:2b:00.0: amdgpu: 	 RW: 0x0
[  283.724101] amdgpu 0000:2b:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:7 pasid:32771, for process wezterm-gui pid 5044 thread wezterm-gu:cs0 pid 5072)
[  283.724123] amdgpu 0000:2b:00.0: amdgpu:   in page starting at address 0x000080050560a000 from client 0x1b (UTCL2)
[  283.724128] amdgpu 0000:2b:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00701031
[  283.724131] amdgpu 0000:2b:00.0: amdgpu: 	 Faulty UTCL2 client ID: TCP (0x8)
[  283.724134] amdgpu 0000:2b:00.0: amdgpu: 	 MORE_FAULTS: 0x1
[  283.724137] amdgpu 0000:2b:00.0: amdgpu: 	 WALKER_ERROR: 0x0
[  283.724139] amdgpu 0000:2b:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x3
[  283.724141] amdgpu 0000:2b:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
[  283.724143] amdgpu 0000:2b:00.0: amdgpu: 	 RW: 0x0
[  283.724153] amdgpu 0000:2b:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:7 pasid:32771, for process wezterm-gui pid 5044 thread wezterm-gu:cs0 pid 5072)
[  283.724157] amdgpu 0000:2b:00.0: amdgpu:   in page starting at address 0x0000800505606000 from client 0x1b (UTCL2)
[  283.724161] amdgpu 0000:2b:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
[  283.724163] amdgpu 0000:2b:00.0: amdgpu: 	 Faulty UTCL2 client ID: CB/DB (0x0)
[  283.724166] amdgpu 0000:2b:00.0: amdgpu: 	 MORE_FAULTS: 0x0
[  283.724168] amdgpu 0000:2b:00.0: amdgpu: 	 WALKER_ERROR: 0x0
[  283.724170] amdgpu 0000:2b:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x0
[  283.724172] amdgpu 0000:2b:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
[  283.724175] amdgpu 0000:2b:00.0: amdgpu: 	 RW: 0x0
[  283.724181] amdgpu 0000:2b:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:7 pasid:32771, for process wezterm-gui pid 5044 thread wezterm-gu:cs0 pid 5072)
[  283.724185] amdgpu 0000:2b:00.0: amdgpu:   in page starting at address 0x000080019560e000 from client 0x1b (UTCL2)
[  283.724188] amdgpu 0000:2b:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
[  283.724191] amdgpu 0000:2b:00.0: amdgpu: 	 Faulty UTCL2 client ID: CB/DB (0x0)
[  283.724194] amdgpu 0000:2b:00.0: amdgpu: 	 MORE_FAULTS: 0x0
[  283.724196] amdgpu 0000:2b:00.0: amdgpu: 	 WALKER_ERROR: 0x0
[  283.724199] amdgpu 0000:2b:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x0
[  283.724202] amdgpu 0000:2b:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
[  283.724204] amdgpu 0000:2b:00.0: amdgpu: 	 RW: 0x0
[  283.724211] amdgpu 0000:2b:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:7 pasid:32771, for process wezterm-gui pid 5044 thread wezterm-gu:cs0 pid 5072)
[  283.724216] amdgpu 0000:2b:00.0: amdgpu:   in page starting at address 0x0000800195612000 from client 0x1b (UTCL2)
[  283.724219] amdgpu 0000:2b:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
[  283.724221] amdgpu 0000:2b:00.0: amdgpu: 	 Faulty UTCL2 client ID: CB/DB (0x0)
[  283.724224] amdgpu 0000:2b:00.0: amdgpu: 	 MORE_FAULTS: 0x0
[  283.724226] amdgpu 0000:2b:00.0: amdgpu: 	 WALKER_ERROR: 0x0
[  283.724228] amdgpu 0000:2b:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x0
[  283.724230] amdgpu 0000:2b:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
[  283.724232] amdgpu 0000:2b:00.0: amdgpu: 	 RW: 0x0
[  283.724239] amdgpu 0000:2b:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:7 pasid:32771, for process wezterm-gui pid 5044 thread wezterm-gu:cs0 pid 5072)
[  283.724242] amdgpu 0000:2b:00.0: amdgpu:   in page starting at address 0x0000800195612000 from client 0x1b (UTCL2)
[  283.724244] amdgpu 0000:2b:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
[  283.724246] amdgpu 0000:2b:00.0: amdgpu: 	 Faulty UTCL2 client ID: CB/DB (0x0)
[  283.724248] amdgpu 0000:2b:00.0: amdgpu: 	 MORE_FAULTS: 0x0
[  283.724250] amdgpu 0000:2b:00.0: amdgpu: 	 WALKER_ERROR: 0x0
[  283.724252] amdgpu 0000:2b:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x0
[  283.724254] amdgpu 0000:2b:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
[  283.724255] amdgpu 0000:2b:00.0: amdgpu: 	 RW: 0x0
[  283.724262] amdgpu 0000:2b:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:7 pasid:32771, for process wezterm-gui pid 5044 thread wezterm-gu:cs0 pid 5072)
[  283.724265] amdgpu 0000:2b:00.0: amdgpu:   in page starting at address 0x0000800195612000 from client 0x1b (UTCL2)
[  283.724267] amdgpu 0000:2b:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
[  283.724269] amdgpu 0000:2b:00.0: amdgpu: 	 Faulty UTCL2 client ID: CB/DB (0x0)
[  283.724271] amdgpu 0000:2b:00.0: amdgpu: 	 MORE_FAULTS: 0x0
[  283.724272] amdgpu 0000:2b:00.0: amdgpu: 	 WALKER_ERROR: 0x0
[  283.724274] amdgpu 0000:2b:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x0
[  283.724276] amdgpu 0000:2b:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
[  283.724278] amdgpu 0000:2b:00.0: amdgpu: 	 RW: 0x0
[  283.733953] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=49294, emitted seq=49296
[  283.734711] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process wezterm-gui pid 5044 thread wezterm-gu:cs0 pid 5072
[  283.735063] amdgpu 0000:2b:00.0: amdgpu: GPU reset begin!
[  283.916028] amdgpu 0000:2b:00.0: amdgpu: MODE1 reset
[  283.916038] amdgpu 0000:2b:00.0: amdgpu: GPU mode1 reset
[  283.916121] amdgpu 0000:2b:00.0: amdgpu: GPU smu mode1 reset
[  284.420071] amdgpu 0000:2b:00.0: amdgpu: GPU reset succeeded, trying to resume
[  284.600682] amdgpu 0000:2b:00.0: amdgpu: RAS: optional ras ta ucode is not available
[  284.616890] amdgpu 0000:2b:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
[  284.616933] amdgpu 0000:2b:00.0: amdgpu: SMU is resuming...
[  284.616940] amdgpu 0000:2b:00.0: amdgpu: smu driver if version = 0x0000000d, smu fw if version = 0x00000010, smu fw program = 0, version = 0x00492400 (73.36.0)
[  284.616945] amdgpu 0000:2b:00.0: amdgpu: SMU driver if version not matched
[  284.616980] amdgpu 0000:2b:00.0: amdgpu: use vbios provided pptable
[  284.661035] amdgpu 0000:2b:00.0: amdgpu: SMU is resumed successfully!
[  284.744332] amdgpu 0000:2b:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[  284.744336] amdgpu 0000:2b:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[  284.744339] amdgpu 0000:2b:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[  284.744342] amdgpu 0000:2b:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
[  284.744344] amdgpu 0000:2b:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
[  284.744347] amdgpu 0000:2b:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
[  284.744349] amdgpu 0000:2b:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
[  284.744352] amdgpu 0000:2b:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
[  284.744354] amdgpu 0000:2b:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
[  284.744356] amdgpu 0000:2b:00.0: amdgpu: ring kiq_0.2.1.0 uses VM inv eng 11 on hub 0
[  284.744359] amdgpu 0000:2b:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
[  284.744361] amdgpu 0000:2b:00.0: amdgpu: ring vcn_dec_0 uses VM inv eng 0 on hub 8
[  284.747468] amdgpu 0000:2b:00.0: amdgpu: recover vram bo from shadow start
[  284.752213] amdgpu 0000:2b:00.0: amdgpu: recover vram bo from shadow done
[  284.752276] amdgpu 0000:2b:00.0: amdgpu: GPU reset(2) succeeded!
[  284.779374] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
1 Like

I don’t think anyone not employed at AMD can help you here.

2 Likes

You’re probably right; but given this is an in-kernel driver, a more recent (or older?) kernel may have fixed whatever causes this.

Try setting one with boot.kernelPackages.

1 Like

Okay, thank you for advice. I’ve switched to 6.8 for now

@Mr-Andersen did that end up solving it for you? Still seeing this issue on the latest kernel…

If my sleuthing isn’t off, looks like it’s fixed starting with kernel 6.7.2: NULL pointer dereference in dma_resv_add_fence (#2991) · Issues · drm / amd · GitLab

It seems like it was indeed fixed in the kernel. I am at commit 06cf0e1da4208d3766d898b7fdab6513366d45b9 now, Linux version 6.10.11, having no issues

2 Likes

After updating I don’t think I’ve noticed it since then. Glad it was fixed!!! Thanks for the useful post.

2 Likes

And just an hour after writing previous comment, I’ve received my first page fault, just like before. :+1:

Could this issue be caused by wezterm or does it happen when using other applications as well?

It happens in different applications, last instance happened in xfwm4, before I saw this in X

The issue might be related to by default the kernel sets a maximum gpu clock that exceeds the manufacturers specifications, causing hardware crashes (#3131) · Issues · drm / amd · GitLab

You can monitor your GPU with LACT or nvtop and check if its frequency exceeds its maximum value. If this turns out to be the case, you could use LACT to change the max frequency (and power?) as was suggested in the thread.