Getting AMDGPU error that crashes desktop

I’ve been getting a driver error after switching to new hardware. Never had to debug anything like this, so please ask for more details. Here’s some info:

$ uname -a
Linux big-system 6.6.44 #1-NixOS SMP PREEMPT_DYNAMIC Sat Aug  3 06:54:42 UTC 2024 x86_64 GNU/Linux
$ dmesg | rg amdgpu
[    0.000000] Command line: initrd=\EFI\nixos\i6l1b3gwjhmgqfha8wirqnwwi2d7z5lw-initrd-linux-6.6.44-initrd.efi init=/nix/store/abdplibma8crxqczj3n3nisq8qzkb8zs-nixos-system-big-system-24.05.20240810.a781ff3/init amdgpu.runpm=0 nohibernate loglevel=4
[    0.044886] Kernel command line: initrd=\EFI\nixos\i6l1b3gwjhmgqfha8wirqnwwi2d7z5lw-initrd-linux-6.6.44-initrd.efi init=/nix/store/abdplibma8crxqczj3n3nisq8qzkb8zs-nixos-system-big-system-24.05.20240810.a781ff3/init amdgpu.runpm=0 nohibernate loglevel=4
[    0.529091] stage-1-init: [Mon Aug 12 15:52:45 UTC 2024] loading module amdgpu...
[    2.956418] [drm] amdgpu kernel modesetting enabled.
[    2.956542] amdgpu: Virtual CRAT table created for CPU
[    2.956560] amdgpu: Topology: Add CPU node
[    2.960616] amdgpu 0000:2b:00.0: No more image in the PCI ROM
[    2.960634] amdgpu 0000:2b:00.0: amdgpu: Fetched VBIOS from ROM BAR
[    2.960639] amdgpu: ATOM BIOS: 115-D632BP2-100
[    2.986743] amdgpu 0000:2b:00.0: vgaarb: deactivate vga console
[    2.986746] amdgpu 0000:2b:00.0: amdgpu: Trusted Memory Zone (TMZ) feature disabled as experimental (default)
[    2.986816] amdgpu 0000:2b:00.0: amdgpu: VRAM: 4080M 0x0000008000000000 - 0x00000080FEFFFFFF (4080M used)
[    2.986819] amdgpu 0000:2b:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
[    2.986821] amdgpu 0000:2b:00.0: amdgpu: AGP: 267894784M 0x0000008400000000 - 0x0000FFFFFFFFFFFF
[    2.986943] [drm] amdgpu: 4080M of VRAM memory ready
[    2.986945] [drm] amdgpu: 7970M of GTT memory ready.
[    4.884787] amdgpu 0000:2b:00.0: amdgpu: STB initialized to 2048 entries
[    4.885580] amdgpu 0000:2b:00.0: amdgpu: Will use PSP to load VCN firmware
[    5.053960] amdgpu 0000:2b:00.0: amdgpu: RAS: optional ras ta ucode is not available
[    5.069648] amdgpu 0000:2b:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
[    5.069672] amdgpu 0000:2b:00.0: amdgpu: smu driver if version = 0x0000000d, smu fw if version = 0x00000010, smu fw program = 0, version = 0x00492400 (73.36.0)
[    5.069675] amdgpu 0000:2b:00.0: amdgpu: SMU driver if version not matched
[    5.069708] amdgpu 0000:2b:00.0: amdgpu: use vbios provided pptable
[    5.112009] amdgpu 0000:2b:00.0: amdgpu: SMU is initialized successfully!
[    5.166577] amdgpu: HMM registered 4080MB device memory
[    5.167775] kfd kfd: amdgpu: Allocated 3969056 bytes on gart
[    5.167797] kfd kfd: amdgpu: Total number of KFD nodes to be created: 1
[    5.167992] amdgpu: Virtual CRAT table created for GPU
[    5.168158] amdgpu: Topology: Add dGPU node [0x743f:0x1002]
[    5.168160] kfd kfd: amdgpu: added device 1002:743f
[    5.168180] amdgpu 0000:2b:00.0: amdgpu: SE 1, SH per SE 2, CU per SH 8, active_cu_number 12
[    5.169022] amdgpu 0000:2b:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[    5.169025] amdgpu 0000:2b:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[    5.169026] amdgpu 0000:2b:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[    5.169028] amdgpu 0000:2b:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
[    5.169030] amdgpu 0000:2b:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
[    5.169031] amdgpu 0000:2b:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
[    5.169033] amdgpu 0000:2b:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
[    5.169034] amdgpu 0000:2b:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
[    5.169036] amdgpu 0000:2b:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
[    5.169037] amdgpu 0000:2b:00.0: amdgpu: ring kiq_0.2.1.0 uses VM inv eng 11 on hub 0
[    5.169039] amdgpu 0000:2b:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
[    5.169041] amdgpu 0000:2b:00.0: amdgpu: ring vcn_dec_0 uses VM inv eng 0 on hub 8
[    5.170539] [drm] Initialized amdgpu 3.54.0 20150101 for 0000:2b:00.0 on minor 1
[    5.176756] fbcon: amdgpudrmfb (fb0) is primary device
[    5.268724] amdgpu 0000:2b:00.0: [drm] fb0: amdgpudrmfb frame buffer device
[   10.829284] snd_hda_intel 0000:2b:00.1: bound 0000:2b:00.0 (ops amdgpu_dm_audio_component_bind_ops [amdgpu])
[  273.406616] amdgpu 0000:2b:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:7 pasid:32771, for process wezterm-gui pid 5044 thread wezterm-gu:cs0 pid 5072)
[  273.406641] amdgpu 0000:2b:00.0: amdgpu:   in page starting at address 0x000080019560e000 from client 0x1b (UTCL2)
[  273.406645] amdgpu 0000:2b:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00701031
[  273.406649] amdgpu 0000:2b:00.0: amdgpu: 	 Faulty UTCL2 client ID: TCP (0x8)
[  273.406652] amdgpu 0000:2b:00.0: amdgpu: 	 MORE_FAULTS: 0x1
[  273.406656] amdgpu 0000:2b:00.0: amdgpu: 	 WALKER_ERROR: 0x0
[  273.406658] amdgpu 0000:2b:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x3
[  273.406661] amdgpu 0000:2b:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
[  273.406664] amdgpu 0000:2b:00.0: amdgpu: 	 RW: 0x0
[  273.406675] amdgpu 0000:2b:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:7 pasid:32771, for process wezterm-gui pid 5044 thread wezterm-gu:cs0 pid 5072)
[  273.406680] amdgpu 0000:2b:00.0: amdgpu:   in page starting at address 0x0000800195612000 from client 0x1b (UTCL2)
[  273.406684] amdgpu 0000:2b:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
[  273.406686] amdgpu 0000:2b:00.0: amdgpu: 	 Faulty UTCL2 client ID: CB/DB (0x0)
[  273.406689] amdgpu 0000:2b:00.0: amdgpu: 	 MORE_FAULTS: 0x0
[  273.406692] amdgpu 0000:2b:00.0: amdgpu: 	 WALKER_ERROR: 0x0
[  273.406694] amdgpu 0000:2b:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x0
[  273.406696] amdgpu 0000:2b:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
[  273.406700] amdgpu 0000:2b:00.0: amdgpu: 	 RW: 0x0
[  273.406706] amdgpu 0000:2b:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:7 pasid:32771, for process wezterm-gui pid 5044 thread wezterm-gu:cs0 pid 5072)
[  273.406710] amdgpu 0000:2b:00.0: amdgpu:   in page starting at address 0x000080050560a000 from client 0x1b (UTCL2)
[  273.406714] amdgpu 0000:2b:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
[  273.406717] amdgpu 0000:2b:00.0: amdgpu: 	 Faulty UTCL2 client ID: CB/DB (0x0)
[  273.406720] amdgpu 0000:2b:00.0: amdgpu: 	 MORE_FAULTS: 0x0
[  273.406722] amdgpu 0000:2b:00.0: amdgpu: 	 WALKER_ERROR: 0x0
[  273.406725] amdgpu 0000:2b:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x0
[  273.406728] amdgpu 0000:2b:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
[  273.406730] amdgpu 0000:2b:00.0: amdgpu: 	 RW: 0x0
[  273.406737] amdgpu 0000:2b:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:7 pasid:32771, for process wezterm-gui pid 5044 thread wezterm-gu:cs0 pid 5072)
[  273.406741] amdgpu 0000:2b:00.0: amdgpu:   in page starting at address 0x0000800505606000 from client 0x1b (UTCL2)
[  273.406743] amdgpu 0000:2b:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
[  273.406746] amdgpu 0000:2b:00.0: amdgpu: 	 Faulty UTCL2 client ID: CB/DB (0x0)
[  273.406749] amdgpu 0000:2b:00.0: amdgpu: 	 MORE_FAULTS: 0x0
[  273.406752] amdgpu 0000:2b:00.0: amdgpu: 	 WALKER_ERROR: 0x0
[  273.406755] amdgpu 0000:2b:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x0
[  273.406757] amdgpu 0000:2b:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
[  273.406760] amdgpu 0000:2b:00.0: amdgpu: 	 RW: 0x0
[  283.724101] amdgpu 0000:2b:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:7 pasid:32771, for process wezterm-gui pid 5044 thread wezterm-gu:cs0 pid 5072)
[  283.724123] amdgpu 0000:2b:00.0: amdgpu:   in page starting at address 0x000080050560a000 from client 0x1b (UTCL2)
[  283.724128] amdgpu 0000:2b:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00701031
[  283.724131] amdgpu 0000:2b:00.0: amdgpu: 	 Faulty UTCL2 client ID: TCP (0x8)
[  283.724134] amdgpu 0000:2b:00.0: amdgpu: 	 MORE_FAULTS: 0x1
[  283.724137] amdgpu 0000:2b:00.0: amdgpu: 	 WALKER_ERROR: 0x0
[  283.724139] amdgpu 0000:2b:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x3
[  283.724141] amdgpu 0000:2b:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
[  283.724143] amdgpu 0000:2b:00.0: amdgpu: 	 RW: 0x0
[  283.724153] amdgpu 0000:2b:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:7 pasid:32771, for process wezterm-gui pid 5044 thread wezterm-gu:cs0 pid 5072)
[  283.724157] amdgpu 0000:2b:00.0: amdgpu:   in page starting at address 0x0000800505606000 from client 0x1b (UTCL2)
[  283.724161] amdgpu 0000:2b:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
[  283.724163] amdgpu 0000:2b:00.0: amdgpu: 	 Faulty UTCL2 client ID: CB/DB (0x0)
[  283.724166] amdgpu 0000:2b:00.0: amdgpu: 	 MORE_FAULTS: 0x0
[  283.724168] amdgpu 0000:2b:00.0: amdgpu: 	 WALKER_ERROR: 0x0
[  283.724170] amdgpu 0000:2b:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x0
[  283.724172] amdgpu 0000:2b:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
[  283.724175] amdgpu 0000:2b:00.0: amdgpu: 	 RW: 0x0
[  283.724181] amdgpu 0000:2b:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:7 pasid:32771, for process wezterm-gui pid 5044 thread wezterm-gu:cs0 pid 5072)
[  283.724185] amdgpu 0000:2b:00.0: amdgpu:   in page starting at address 0x000080019560e000 from client 0x1b (UTCL2)
[  283.724188] amdgpu 0000:2b:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
[  283.724191] amdgpu 0000:2b:00.0: amdgpu: 	 Faulty UTCL2 client ID: CB/DB (0x0)
[  283.724194] amdgpu 0000:2b:00.0: amdgpu: 	 MORE_FAULTS: 0x0
[  283.724196] amdgpu 0000:2b:00.0: amdgpu: 	 WALKER_ERROR: 0x0
[  283.724199] amdgpu 0000:2b:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x0
[  283.724202] amdgpu 0000:2b:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
[  283.724204] amdgpu 0000:2b:00.0: amdgpu: 	 RW: 0x0
[  283.724211] amdgpu 0000:2b:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:7 pasid:32771, for process wezterm-gui pid 5044 thread wezterm-gu:cs0 pid 5072)
[  283.724216] amdgpu 0000:2b:00.0: amdgpu:   in page starting at address 0x0000800195612000 from client 0x1b (UTCL2)
[  283.724219] amdgpu 0000:2b:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
[  283.724221] amdgpu 0000:2b:00.0: amdgpu: 	 Faulty UTCL2 client ID: CB/DB (0x0)
[  283.724224] amdgpu 0000:2b:00.0: amdgpu: 	 MORE_FAULTS: 0x0
[  283.724226] amdgpu 0000:2b:00.0: amdgpu: 	 WALKER_ERROR: 0x0
[  283.724228] amdgpu 0000:2b:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x0
[  283.724230] amdgpu 0000:2b:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
[  283.724232] amdgpu 0000:2b:00.0: amdgpu: 	 RW: 0x0
[  283.724239] amdgpu 0000:2b:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:7 pasid:32771, for process wezterm-gui pid 5044 thread wezterm-gu:cs0 pid 5072)
[  283.724242] amdgpu 0000:2b:00.0: amdgpu:   in page starting at address 0x0000800195612000 from client 0x1b (UTCL2)
[  283.724244] amdgpu 0000:2b:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
[  283.724246] amdgpu 0000:2b:00.0: amdgpu: 	 Faulty UTCL2 client ID: CB/DB (0x0)
[  283.724248] amdgpu 0000:2b:00.0: amdgpu: 	 MORE_FAULTS: 0x0
[  283.724250] amdgpu 0000:2b:00.0: amdgpu: 	 WALKER_ERROR: 0x0
[  283.724252] amdgpu 0000:2b:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x0
[  283.724254] amdgpu 0000:2b:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
[  283.724255] amdgpu 0000:2b:00.0: amdgpu: 	 RW: 0x0
[  283.724262] amdgpu 0000:2b:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:7 pasid:32771, for process wezterm-gui pid 5044 thread wezterm-gu:cs0 pid 5072)
[  283.724265] amdgpu 0000:2b:00.0: amdgpu:   in page starting at address 0x0000800195612000 from client 0x1b (UTCL2)
[  283.724267] amdgpu 0000:2b:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
[  283.724269] amdgpu 0000:2b:00.0: amdgpu: 	 Faulty UTCL2 client ID: CB/DB (0x0)
[  283.724271] amdgpu 0000:2b:00.0: amdgpu: 	 MORE_FAULTS: 0x0
[  283.724272] amdgpu 0000:2b:00.0: amdgpu: 	 WALKER_ERROR: 0x0
[  283.724274] amdgpu 0000:2b:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x0
[  283.724276] amdgpu 0000:2b:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
[  283.724278] amdgpu 0000:2b:00.0: amdgpu: 	 RW: 0x0
[  283.733953] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=49294, emitted seq=49296
[  283.734711] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process wezterm-gui pid 5044 thread wezterm-gu:cs0 pid 5072
[  283.735063] amdgpu 0000:2b:00.0: amdgpu: GPU reset begin!
[  283.916028] amdgpu 0000:2b:00.0: amdgpu: MODE1 reset
[  283.916038] amdgpu 0000:2b:00.0: amdgpu: GPU mode1 reset
[  283.916121] amdgpu 0000:2b:00.0: amdgpu: GPU smu mode1 reset
[  284.420071] amdgpu 0000:2b:00.0: amdgpu: GPU reset succeeded, trying to resume
[  284.600682] amdgpu 0000:2b:00.0: amdgpu: RAS: optional ras ta ucode is not available
[  284.616890] amdgpu 0000:2b:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
[  284.616933] amdgpu 0000:2b:00.0: amdgpu: SMU is resuming...
[  284.616940] amdgpu 0000:2b:00.0: amdgpu: smu driver if version = 0x0000000d, smu fw if version = 0x00000010, smu fw program = 0, version = 0x00492400 (73.36.0)
[  284.616945] amdgpu 0000:2b:00.0: amdgpu: SMU driver if version not matched
[  284.616980] amdgpu 0000:2b:00.0: amdgpu: use vbios provided pptable
[  284.661035] amdgpu 0000:2b:00.0: amdgpu: SMU is resumed successfully!
[  284.744332] amdgpu 0000:2b:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[  284.744336] amdgpu 0000:2b:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[  284.744339] amdgpu 0000:2b:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[  284.744342] amdgpu 0000:2b:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
[  284.744344] amdgpu 0000:2b:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
[  284.744347] amdgpu 0000:2b:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
[  284.744349] amdgpu 0000:2b:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
[  284.744352] amdgpu 0000:2b:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
[  284.744354] amdgpu 0000:2b:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
[  284.744356] amdgpu 0000:2b:00.0: amdgpu: ring kiq_0.2.1.0 uses VM inv eng 11 on hub 0
[  284.744359] amdgpu 0000:2b:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
[  284.744361] amdgpu 0000:2b:00.0: amdgpu: ring vcn_dec_0 uses VM inv eng 0 on hub 8
[  284.747468] amdgpu 0000:2b:00.0: amdgpu: recover vram bo from shadow start
[  284.752213] amdgpu 0000:2b:00.0: amdgpu: recover vram bo from shadow done
[  284.752276] amdgpu 0000:2b:00.0: amdgpu: GPU reset(2) succeeded!
[  284.779374] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!

I don’t think anyone not employed at AMD can help you here.

1 Like

You’re probably right; but given this is an in-kernel driver, a more recent (or older?) kernel may have fixed whatever causes this.

Try setting one with boot.kernelPackages.

Okay, thank you for advice. I’ve switched to 6.8 for now