Constant crashes with amdgpu

I’ve installed NixOS on an ASUS Zephyrus G15 GA503RM.
I’m running Linux latest to get the Wifi drivers required for the device. So I have Linux kernel 5.19 running.

I’m experiencing:

  • graphical glitches (random horizontal white streaks on the screen for a single frame, about 2mm → 1cm in width),
  • eventual display crash along with hard freeze (audio plays, but display and keyboard are gone).

The white streaks suggest memory corruption to me.

I’ve updated the firmware from 307 to 308, no change.

Here is a Gist of journalctl.

The relevant section is at the end and seems to begin throwing kernel info at Aug 11 22:13:01 dygra kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!

I believe the other errors (clutter, chrome, etc) to be a symptom of the crash, not the cause, as in other logs I’ve seen epiphany suddenly assert that the display with is no longer > 0.

What can I try to resolve this short of RMA’ing the whole thing?

Any help would be greatly appreciated.

Are you sure that laptop is an amdgpu one? Looking at the specs sheet (can’t find the “MR” variant, so I may be wrong) it’s advertised to have an nvidia gpu: 2021 ROG Zephyrus G15 GA503 | Gaming Laptops|ROG - Republic of Gamers|ROG Global

If you do have an Nvidia GPU in there, you probably want to look into configuring optimus: Nvidia - NixOS Wiki. I will say that I haven’t seen anyone do AMD CPU+Nvidia GPU in a laptop before. You may run into some dragons.

I’ve seen a couple of people report weird behavior when optimus wasn’t configured correctly, from the system choosing the wrong GPU for specific applications all the way to now display out. Yours might be the new chart topper in terms of weird issues :wink:

@TLATER I think the Ryzen 9 5900HS that’s in the laptop has integrated AMD graphics.

It has an AMD integrated.

01:00.0 VGA compatible controller: NVIDIA Corporation GA106M [GeForce RTX 3060 Mobile / Max-Q] (rev a1)
06:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Device 1681 (rev c8)

The specs are very confusing, the box says GA503R, the site says that doesn’t exist.
The bottom of the device says GA503RM, which online says its a 3080, which just isn’t true.

Because the NV sucks power, I’m just running integrated at the moment.
Once asusctl hits mainline in nixpkgs I’ll worry about prime.
I do actually have it working at the moment in “passoff” mode and demonstrated that works with glxgrears.

There seem to be a number of posts on reddit ZephyrusG15 posting about similar issues.
My concern is that this isn’t a simple driver problem, but a design flaw / firmware issue.
I need to determine if this is a driver issue or an RMA.

Having booted the system a-fresh. I’m getting large artifacts blitting onto the screen which just reeks of a memory corruption somewhere.

I was going to try the “amdgpu-pro” driver, but there’s currently an issue blocking using it with kernel modules “latest”. Reading AMDs own notes on it, it seems to not be something of use.

My next steps are to try:

  • Running Nvidia exclusively (but without the Mux it will be DMA’ed to the AMD iGPU anyway, so doubtful it will do anything).
  • Revert to even older kernel (which breaks wifi) and see if that works.

Is there a way to run a specific version of “amdgpu” ? I can’t see anything in the docs.

After messing around with various configs, I re-installed Windows 11 and still had the issue.
Tried different AMD drivers, including the preview release, and still the issue was present.
Have RMA’ed the laptop.