Yet another GCVM_L2_PROTECTION_FAULT_STATUS problem

hi every1!!! noob here… :sob:

ill try to be very short and very clear:

nixOS is installed on a steam deck LCD and i love it, btw, the system i mean, the device is a PITA, but thats besides the point. so anyway, i keep having this page fault everytime i… simply exist - open wezterm, open mpv, open librewolf - ANYTHING that uses GPU, except, INTERESTINGLY, games! the games dont crash, in fact, they are the only safe environment/condition that i can guarantee my deck isnt gonna crash. by the way, when i say crash, i mean "first, the rendered frame and menu elements will freeze and/or disappear (rarely corrupt), then the screen will soon dim to about 75% (EDIT: hyprland’s unresponsive window dimming feature), then my mosue will freeze and finally, it will kick me out from the session back into the login screen (tuigreet), but sometimes it cant exit the session and infinitely waits for some PID to finish (which doesnt exist anymore…), OR it doesnt actually exit the session at all, and resets its graphics successfully (hyprland)… its always a coin flip, a random number generator. there is no way to force it to happen, IT JUST HAPPENS when it want to

so i had a look at dmesg

[ 5133.879897] amdgpu 0000:04:00.0: amdgpu: Dumping IP State
[ 5133.880808] amdgpu 0000:04:00.0: amdgpu: Dumping IP State Completed
[ 5133.880899] amdgpu 0000:04:00.0: amdgpu: ring sdma0 timeout, signaled seq=3810, emitted seq=3812
[ 5133.880905] amdgpu 0000:04:00.0: amdgpu: Starting sdma0 ring reset
[ 5134.078180] amdgpu 0000:04:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:5 pasid:32778)
[ 5134.078193] amdgpu 0000:04:00.0: amdgpu:  in process .librewolf-wrap pid 11512 thread .librewolf:cs0 pid 11606
[ 5134.078199] amdgpu 0000:04:00.0: amdgpu:   in page starting at address 0x0000800001600000 from client 0x1b (UTCL2)
[ 5134.078204] amdgpu 0000:04:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00501430
[ 5134.078208] amdgpu 0000:04:00.0: amdgpu:      Faulty UTCL2 client ID: SQC (data) (0xa)
[ 5134.078212] amdgpu 0000:04:00.0: amdgpu:      MORE_FAULTS: 0x0
[ 5134.078216] amdgpu 0000:04:00.0: amdgpu:      WALKER_ERROR: 0x0
[ 5134.078219] amdgpu 0000:04:00.0: amdgpu:      PERMISSION_FAULTS: 0x3
[ 5134.078222] amdgpu 0000:04:00.0: amdgpu:      MAPPING_ERROR: 0x0
[ 5134.078225] amdgpu 0000:04:00.0: amdgpu:      RW: 0x0
[ 5134.078236] amdgpu 0000:04:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:40 vmid:5 pasid:32778)
[ 5134.078241] amdgpu 0000:04:00.0: amdgpu:  in process .librewolf-wrap pid 11512 thread .librewolf:cs0 pid 11606
[ 5134.078245] amdgpu 0000:04:00.0: amdgpu:   in page starting at address 0x000080010a060000 from client 0x1b (UTCL2)
[ 5134.078249] amdgpu 0000:04:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00541051
[ 5134.078252] amdgpu 0000:04:00.0: amdgpu:      Faulty UTCL2 client ID: TCP (0x8)
[ 5134.078255] amdgpu 0000:04:00.0: amdgpu:      MORE_FAULTS: 0x1
[ 5134.078258] amdgpu 0000:04:00.0: amdgpu:      WALKER_ERROR: 0x0
[ 5134.078261] amdgpu 0000:04:00.0: amdgpu:      PERMISSION_FAULTS: 0x5
[ 5134.078265] amdgpu 0000:04:00.0: amdgpu:      MAPPING_ERROR: 0x0
[ 5134.078268] amdgpu 0000:04:00.0: amdgpu:      RW: 0x1
[ 5134.078272] amdgpu 0000:04:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:40 vmid:5 pasid:32778)
[ 5134.078276] amdgpu 0000:04:00.0: amdgpu:  in process .librewolf-wrap pid 11512 thread .librewolf:cs0 pid 11606
[ 5134.078280] amdgpu 0000:04:00.0: amdgpu:   in page starting at address 0x000080010a061000 from client 0x1b (UTCL2)
[ 5134.078285] amdgpu 0000:04:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:40 vmid:5 pasid:32778)
[ 5134.078289] amdgpu 0000:04:00.0: amdgpu:  in process .librewolf-wrap pid 11512 thread .librewolf:cs0 pid 11606
[ 5134.078293] amdgpu 0000:04:00.0: amdgpu:   in page starting at address 0x000080010a061000 from client 0x1b (UTCL2)
[ 5134.078298] amdgpu 0000:04:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:40 vmid:5 pasid:32778)
[ 5134.078302] amdgpu 0000:04:00.0: amdgpu:  in process .librewolf-wrap pid 11512 thread .librewolf:cs0 pid 11606
[ 5134.078306] amdgpu 0000:04:00.0: amdgpu:   in page starting at address 0x000080010a060000 from client 0x1b (UTCL2)
[ 5134.078311] amdgpu 0000:04:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:40 vmid:5 pasid:32778)
[ 5134.078315] amdgpu 0000:04:00.0: amdgpu:  in process .librewolf-wrap pid 11512 thread .librewolf:cs0 pid 11606
[ 5134.078318] amdgpu 0000:04:00.0: amdgpu:   in page starting at address 0x000080010a060000 from client 0x1b (UTCL2)
[ 5134.078323] amdgpu 0000:04:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:40 vmid:5 pasid:32778)
[ 5134.078327] amdgpu 0000:04:00.0: amdgpu:  in process .librewolf-wrap pid 11512 thread .librewolf:cs0 pid 11606
[ 5134.078331] amdgpu 0000:04:00.0: amdgpu:   in page starting at address 0x000080010a061000 from client 0x1b (UTCL2)
[ 5134.078336] amdgpu 0000:04:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:40 vmid:5 pasid:32778)
[ 5134.078340] amdgpu 0000:04:00.0: amdgpu:  in process .librewolf-wrap pid 11512 thread .librewolf:cs0 pid 11606
[ 5134.078343] amdgpu 0000:04:00.0: amdgpu:   in page starting at address 0x000080010a061000 from client 0x1b (UTCL2)
[ 5134.078348] amdgpu 0000:04:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:40 vmid:5 pasid:32778)
[ 5134.078352] amdgpu 0000:04:00.0: amdgpu:  in process .librewolf-wrap pid 11512 thread .librewolf:cs0 pid 11606
[ 5134.078356] amdgpu 0000:04:00.0: amdgpu:   in page starting at address 0x000080010a062000 from client 0x1b (UTCL2)
[ 5134.078361] amdgpu 0000:04:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:40 vmid:5 pasid:32778)
[ 5134.078365] amdgpu 0000:04:00.0: amdgpu:  in process .librewolf-wrap pid 11512 thread .librewolf:cs0 pid 11606
[ 5134.078369] amdgpu 0000:04:00.0: amdgpu:   in page starting at address 0x000080010a062000 from client 0x1b (UTCL2)
[ 5144.119856] amdgpu 0000:04:00.0: amdgpu: Dumping IP State
[ 5144.120877] amdgpu 0000:04:00.0: amdgpu: Dumping IP State Completed
[ 5144.130894] amdgpu 0000:04:00.0: amdgpu: ring gfx_0.1.0 timeout, signaled seq=49196, emitted seq=49198
[ 5144.130899] amdgpu 0000:04:00.0: amdgpu: Process information: process .Hyprland-wrapp pid 1599 thread Hyprland:cs0 pid 1607
[ 5144.130903] amdgpu 0000:04:00.0: amdgpu: Starting gfx_0.1.0 ring reset
[ 5144.327026] amdgpu 0000:04:00.0: amdgpu: Ring gfx_0.1.0 reset failure
[ 5144.327029] amdgpu 0000:04:00.0: amdgpu: GPU reset begin!
[ 5144.409954] amdgpu 0000:04:00.0: amdgpu: MODE2 reset
[ 5144.420100] amdgpu 0000:04:00.0: amdgpu: GPU reset succeeded, trying to resume
[ 5144.420540] [drm] PCIE GART of 1024M enabled (table at 0x000000F43FC00000).
[ 5144.420572] amdgpu 0000:04:00.0: amdgpu: PSP is resuming...
[ 5144.442727] amdgpu 0000:04:00.0: amdgpu: reserve 0xa00000 from 0xf43e000000 for PSP TMR
[ 5145.318237] amdgpu 0000:04:00.0: amdgpu: SMU is resuming...
[ 5145.319255] amdgpu 0000:04:00.0: amdgpu: SMU is resumed successfully!
[ 5145.319673] [drm] kiq ring mec 2 pipe 1 q 0
[ 5145.332059] [drm] DMUB hardware initialized: version=0x0300000A
[ 5145.410386] [drm] Failed to add display topology, DTM TA is not initialized.
[ 5145.439757] amdgpu 0000:04:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[ 5145.439762] amdgpu 0000:04:00.0: amdgpu: ring gfx_0.1.0 uses VM inv eng 1 on hub 0
[ 5145.439765] amdgpu 0000:04:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 4 on hub 0
[ 5145.439767] amdgpu 0000:04:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 5 on hub 0
[ 5145.439770] amdgpu 0000:04:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 6 on hub 0
[ 5145.439772] amdgpu 0000:04:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 7 on hub 0
[ 5145.439774] amdgpu 0000:04:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 8 on hub 0
[ 5145.439777] amdgpu 0000:04:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 9 on hub 0
[ 5145.439779] amdgpu 0000:04:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 10 on hub 0
[ 5145.439782] amdgpu 0000:04:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 11 on hub 0
[ 5145.439784] amdgpu 0000:04:00.0: amdgpu: ring kiq_0.2.1.0 uses VM inv eng 12 on hub 0
[ 5145.439787] amdgpu 0000:04:00.0: amdgpu: ring sdma0 uses VM inv eng 13 on hub 0
[ 5145.439789] amdgpu 0000:04:00.0: amdgpu: ring vcn_dec_0 uses VM inv eng 0 on hub 8
[ 5145.439792] amdgpu 0000:04:00.0: amdgpu: ring vcn_enc_0.0 uses VM inv eng 1 on hub 8
[ 5145.439794] amdgpu 0000:04:00.0: amdgpu: ring vcn_enc_0.1 uses VM inv eng 4 on hub 8
[ 5145.439796] amdgpu 0000:04:00.0: amdgpu: ring jpeg_dec uses VM inv eng 5 on hub 8
[ 5145.443634] amdgpu 0000:04:00.0: amdgpu: GPU reset(3) succeeded!

they are different and happen at random times, every single time. if i will find a more interesting one, i will share! i can get THIS exact same scenario 10 seconds after boot, or maybe while watching a video, etc. etc.

so i looked up “nixos GCVM_L2_PROTECTION_FAULT_STATUS” and then i went to this page:

but i didnt want to install LACT, because i didnt have this issue before… besides, i can undervolt via BIOS (which is very dangerous and could lead to a black screen!!!), so then i found this site

but its like, definitely not an APU issue anymore then? these guys own actual dedicated GPU’s! my gpu is an AMD Custom GPU 0405 (RADV VANGOGH) of PHYSICAL_DEVICE_TYPE_INTEGRATED_GPU, by the way, idk if this is related or not, but i get this in vulkaninfo:

WARNING: [Loader Message] Code 0 : terminator_CreateInstance: Received return code -3 from call to vkCreateInstance in ICD /nix/store/j16gwk21wpzliw3slgglbdb7nk5hrcdm-mesa-25.1.2/lib/libvulkan_dzn.so. Skipping this driver.

EDIT: i just want to say really quick, that when i installed AMDVLK drivers (hardware.amdgpu.amdvlk.enable), these errors went away, BUT!!! they do NOT support APU 0405 (as written in the list of supported devices, so instead i stuck with regular mesa, i.e. hardware.graphics.enable

so yeah. something is terribly wrong and i dont know what i did or who did what. BY THE WAY, i DO have every power management setting maxed out for maximum performance (i use my steam deck in a dock station, like a PC!!!), but i dont think thats relevant because i still get the same hangs and freezes and crashes with or without TLP and/or powerManagement.cpuFreqGovernor = "perfrmance" or services.tlp.settings with every CPU/GPU preference set to "performance"

should i finally try steam deck drivers from here?

OR PERHAPS try chaotic-nyx flake to install pkgs.mesa-git?

cos i cant use the system like this! its very annoying and still HASNT been fixed, despite some official claims it was fixed in an update… :sob:

P.S. sorry i keep forgetting to set the tag to “Help”

P.P.S. yes the output of dmesg is the one i got while writing this

hmm… i commented out this module (despite the instructions on the wiki):

boot.initrd.kernelModules = [ 
  # "amdgpu"  
];

to remove the amdgpu driver from loading at stage 1 and…

it works now? not sure, but doesnt crash immediately. DEFINITELY loads things faster, though!

hahahaha, NEVERMIND.

its gotten WORSE.

check this out, literally seconds as i said it was working, now its a segfault!

[ 3558.459271] amdgpu 0000:04:00.0: amdgpu: Dumping IP State
[ 3558.460172] amdgpu 0000:04:00.0: amdgpu: Dumping IP State Completed
[ 3558.460249] amdgpu 0000:04:00.0: amdgpu: ring sdma0 timeout, signaled seq=18222, emitted seq=18224
[ 3558.460255] amdgpu 0000:04:00.0: amdgpu: Starting sdma0 ring reset
[ 3559.745077] .xdg-desktop-po[1680]: segfault at 6c ip 00007f72bd8c9e5a sp 00007ffd04e95e70 error 6 in libwayland-client.so.0.23.1[be5a,7f72bd8c4000+7000] likely on CPU 2 (core 1, socket 0)
[ 3559.745093] Code: 00 01 81 fb 00 00 f0 00 77 66 48 8b 45 00 48 c1 e8 03 39 d8 72 6a 39 c3 74 26 43 8d 04 24 48 8b 55 10 83 e0 02 48 09 c1 31 c0 <48> 89 0c da 48 83 c4 10 5b 5d 41 5c 31 d2 31 c9 31 f6 31 ff c3 90

i hWHAT IS GOING ON??? as i am typing this, it froze again, TWICE, with the screen dimming as usual (EDIT: hyprland’s unresponsive window dimming feature), then librewolf just… became transparent? and then it got back again. haha, okay, lets see the dmesg this time…

hmmmm… nothing new, actually. just way, way, WAAAAAAAAAAAAAAY more page fault errors and FIVE (5) whole resets in a row… it didnt kick me out of the session though, so thats pogress!

maybe it is the same issue as Possibly graphical problems with upgrading from 24.11 to 25.05 - #8 by TLATER

completely forgot to mention, i have been on nixos-unstable Xantusia (with flakes) the whole time, so i dont think i “upgraded”…

i am also using a boot.kernelPackages = pkgs.linuxPackages_zen; kernel…

what i also noticed is that i dont think i’ve ever gotten a crash in a x11 session… SO FAR. im about to stress test!

P.S. my xserver has a services.xserver.videoDrivers = [ "amdgpu" ]; though, could this driver be older and lacking this problem? gotta find out!

aaaand UPDATE!

xserver is also affected.

[ 3517.351791] amdgpu 0000:04:00.0: amdgpu: Dumping IP State
[ 3517.352640] amdgpu 0000:04:00.0: amdgpu: Dumping IP State Completed
[ 3517.352709] amdgpu 0000:04:00.0: amdgpu: ring sdma0 timeout, signaled seq=13627, emitted seq=13629
[ 3517.352715] amdgpu 0000:04:00.0: amdgpu: Starting sdma0 ring reset

less errors this time ,tho!

THIS IS THE ONE.

boot.kernelPackages = pkgs.linuxPackages_testing;

obviously, it was the kernel itself! they added some “freesync” and “vrr” stuff to the LTS kernel that my APU doesnt support (even though i was on the zen kernel), so OBVIOUSLY it kept crashing. this was fixed in the testing branch, NOT latest!!!

finally!!!

EDIT: HAHAHAHAHAHAHA. hahahaha. no. it didnt help. just prolonged the inevitable. same error. WHAT ELSE? microcode?