Unstable GPU and Kernel panic on every few boots on latest 23.05

On my EliteBook 855 G7 laptop, I feel like the graphics drivers have become less and less stable in the last couple of months. I think it was during the upgrade to 23.05 that I started seeing errors like this

jul 19 11:18:10 nixhpix kernel: amdgpu 0000:03:00.0: amdgpu: Will use PSP to load VCN firmware
jul 19 11:18:10 nixhpix kernel: [drm] reserve 0x400000 from 0xf41f800000 for PSP TMR
jul 19 11:18:10 nixhpix kernel: amdgpu 0000:03:00.0: amdgpu: RAS: optional ras ta ucode is not available
jul 19 11:18:10 nixhpix kernel: amdgpu 0000:03:00.0: amdgpu: RAP: optional rap ta ucode is not available
jul 19 11:18:10 nixhpix kernel: [drm] psp gfx command LOAD_TA(0x1) failed and response status is (0x7)
jul 19 11:18:10 nixhpix kernel: [drm] psp gfx command INVOKE_CMD(0x3) failed and response status is (0x4)
jul 19 11:18:10 nixhpix kernel: amdgpu 0000:03:00.0: amdgpu: Secure display: Generic Failure.
jul 19 11:18:10 nixhpix kernel: amdgpu 0000:03:00.0: amdgpu: SECUREDISPLAY: query securedisplay TA failed. ret 0x0
jul 19 11:18:10 nixhpix kernel: amdgpu 0000:03:00.0: amdgpu: SMU is initialized successfully!

And now when I last did nix flake update, brave-browser cannot render images properly and is unusable.

I have also gotten quite a few kernel crashes. I managed to find the logs for one of them.

jul 19 10:48:45 nixhpix systemd[1]: Finished Permit User Sessions.
jul 19 10:48:45 nixhpix systemd[1]: Starting X11 Server...
jul 19 10:48:45 nixhpix kernel: [drm] kiq ring mec 2 pipe 1 q 0
jul 19 10:48:45 nixhpix kernel: [drm] VCN decode and encode initialized successfully(under DPG Mode).
jul 19 10:48:45 nixhpix kernel: [drm] JPEG decode initialized successfully.
jul 19 10:48:45 nixhpix kernel: kfd kfd: amdgpu: Allocated 3969056 bytes on gart
jul 19 10:48:45 nixhpix kernel: amdgpu: sdma_bitmap: 3
jul 19 10:48:45 nixhpix kernel: amdgpu: SRAT table not found
jul 19 10:48:45 nixhpix kernel: amdgpu: Virtual CRAT table created for GPU
jul 19 10:48:45 nixhpix kernel: amdgpu: Topology: Add dGPU node [0x1636:0x1002]
jul 19 10:48:45 nixhpix kernel: kfd kfd: amdgpu: added device 1002:1636
jul 19 10:48:45 nixhpix kernel: amdgpu 0000:03:00.0: amdgpu: SE 1, SH per SE 1, CU per SH 8, active_cu_number 6
jul 19 10:48:45 nixhpix kernel: amdgpu 0000:03:00.0: amdgpu: ring gfx uses VM inv eng 0 on hub 0
jul 19 10:48:45 nixhpix kernel: amdgpu 0000:03:00.0: amdgpu: ring gfx_low uses VM inv eng 1 on hub 0
jul 19 10:48:45 nixhpix kernel: amdgpu 0000:03:00.0: amdgpu: ring gfx_high uses VM inv eng 4 on hub 0
jul 19 10:48:45 nixhpix kernel: amdgpu 0000:03:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 5 on hub 0
jul 19 10:48:45 nixhpix kernel: amdgpu 0000:03:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 6 on hub 0
jul 19 10:48:45 nixhpix kernel: amdgpu 0000:03:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 7 on hub 0
jul 19 10:48:45 nixhpix kernel: amdgpu 0000:03:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 8 on hub 0
jul 19 10:48:45 nixhpix kernel: amdgpu 0000:03:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 9 on hub 0
jul 19 10:48:45 nixhpix kernel: amdgpu 0000:03:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 10 on hub 0
jul 19 10:48:45 nixhpix kernel: amdgpu 0000:03:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 11 on hub 0
jul 19 10:48:45 nixhpix kernel: amdgpu 0000:03:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 12 on hub 0
jul 19 10:48:45 nixhpix kernel: amdgpu 0000:03:00.0: amdgpu: ring kiq_2.1.0 uses VM inv eng 13 on hub 0
jul 19 10:48:45 nixhpix kernel: amdgpu 0000:03:00.0: amdgpu: ring sdma0 uses VM inv eng 0 on hub 1
jul 19 10:48:45 nixhpix kernel: amdgpu 0000:03:00.0: amdgpu: ring vcn_dec uses VM inv eng 1 on hub 1
jul 19 10:48:45 nixhpix kernel: amdgpu 0000:03:00.0: amdgpu: ring vcn_enc0 uses VM inv eng 4 on hub 1
jul 19 10:48:45 nixhpix kernel: amdgpu 0000:03:00.0: amdgpu: ring vcn_enc1 uses VM inv eng 5 on hub 1
jul 19 10:48:45 nixhpix kernel: amdgpu 0000:03:00.0: amdgpu: ring jpeg_dec uses VM inv eng 6 on hub 1
jul 19 10:48:45 nixhpix kernel: [drm] Initialized amdgpu 3.52.0 20150101 for 0000:03:00.0 on minor 0
jul 19 10:48:45 nixhpix kernel: BUG: kernel NULL pointer dereference, address: 0000000000000012
jul 19 10:48:45 nixhpix kernel: #PF: supervisor read access in kernel mode

I have tried to add the nixos-hardware repository and tried the profile for the hp-elitebook-845g9 laptop, which should be similar in hardware to my machine. It did add some additional module that errors out on boot, so boot takes a bit longer, but the graphics issues on brave are still there.

Additional info that might be important. I am running ZFS, so I have set boot.kernelPackages = mkForce config.boot.zfs.package.latestCompatibleLinuxPackages

I have experienced the kernel crashes too on a few occasions. This is an Acer laptop also using amdgpu, but running unstable. I have tried going back to earlier kernel versions but that doesn’t change it. The odd thing is that some days are perfectly fine. Then I get 4 reboots after a kernel panic in a row. Then it’s fine for 5 days. Then again forced reboots.

I’m also having issues with brave specifically as of today, things aren’t rendering properly and it’s unusable. Chrome and Firefox are both fine.

I tried updating again today, and it seems to be working again for me now. No idea what the problem with Brave was, but it seemed temporary.

I tried updating again today, and it seems to be working again for me now. No idea what the problem with Brave was, but it seemed temporary.

How about the kernel panics?

Seems to be the same as before. Occasional crashes on boot /within5 minutes of boot. If it survives more than 5 minutes, I have not seen any kernel panics either before or after the update.

If it survives more than 5 minutes, I have not seen any kernel panics either before or after the update.

Yeah, that’s pretty much my experience too.

I am considering hooking up netconsole but strangely the issue hasn’t really manifested itself while at home where I have something I can log to.