Kernel null pointer dereference when booting installer image

andrewhelwer · April 15, 2023, 6:32pm

I am trying to install NixOS on my computer. Arch Linux successfully boots on this computer. When I download the latest ISO image onto a USB drive running Ventoy and try to boot from it, I get a null pointer dereference error (attaching two pictures of my screen because I don’t know how I can actually get this log otherwise, please let me know if there is another way).

I am using the latest Plasma 64-bit Intel/AMD image. This also occurs with the latest minimal image. I can boot into an Arch ISO image from the same USB drive just fine. Thanks for any help or directions for further debugging you can provide!

My computer hardware is as follows:

Motherboard: ASUS ROG Maximus VI Impact
CPU: Intel i7-4790k
GPU: ASRock AMD Radeon RX 5700 XT 8GB
RAM: 16GB (2x8GB) G.Skill RipjawsX F3-12800CL10D-16GBXL16GB
Storage:
- Samsung 860 QVO 1TB SATA3 2.5" SSD MZ-76Q1T0B/AM
- SK Hynix Gold S31 1TB SATA3 2.5" SSD
Display: 3x LG 24UD58-B 24" 2160p 60hz IPS LED (DisplayPort)

sigprof · April 15, 2023, 8:02pm

Interesting; the crash is inside the amdgpu video driver. A similar crash log can be found at Bug #1981943 “kernel null pointer dereference on resume w/ amdgp...” : Bugs : linux package : Ubuntu (that machine apparently had Radeon RX 6700/6700 XT/6800M), but there is no solution there.

I also have AMD Radeon RX 5700 XT, but a different display (a single Iiyama XUB2792UHSU 27" 3840x2160), and did not notice any problems like that (but of course I did not actually test the install image for a long time; the installed system, however, uses the same 5.15.106 kernel at the moment).

The crashing function is apparently update_config() in amdgpu_dm_hdcp.c « amdgpu_dm « display « amd « drm « gpu « drivers - kernel/git/stable/linux.git - Linux kernel stable tree.

Now, because I have the exact same kernel module here, I can find where the update_config+0x103 address points to:

  209d13:       49 8b 84 24 d0 04 00    mov    0x4d0(%r12),%rax
  209d1a:       00
  209d1b:       49 8b bc 24 d8 04 00    mov    0x4d8(%r12),%rdi
  209d22:       00
>>209d23:       8b 30                   mov    (%rax),%esi
  209d25:       e8 00 00 00 00          callq  209d2a <update_config+0x10a>
                        209d26: R_X86_64_PLT32  dc_link_is_hdcp14-0x4
  209d2a:       88 83 20 0f 00 00       mov    %al,0xf20(%rbx)
  209d30:       49 8b 84 24 d8 04 00    mov    0x4d8(%r12),%rax
  209d37:       00
  209d38:       0f b6 80 68 01 00 00    movzbl 0x168(%rax),%eax

The crashing instruction loads the second argument for dc_link_is_hdcp14() into %esi, which corresponds to this piece of C code:

	link->hdcp_supported_informational = dc_link_is_hdcp14(aconnector->dc_link,
			aconnector->dc_sink->sink_signal) ? 1 : 0;

And apparently aconnector->dc_sink is NULL here.

Note also that there is a NULL check above that code:

	if (aconnector->dc_sink != NULL)
		link->mode = mod_hdcp_signal_type_to_operation_mode(aconnector->dc_sink->sink_signal);

So apparently the case when aconnector->dc_sink is NULL was expected, but the check was not added in all places where it is actually required.

Apparently in Linux 6.0-rc1 the update_config() function was changed, and the code now has the NULL check there (and some other problem with “emulated dc_sink” (whatever it is) was fixed): drm/amd/display: Take emulated dc_sink into account for HDCP · gregkh/linux@4d31819 · GitHub. However, that commit was not backported to the 5.15.y stable series (checked 5.15.107 and it’s not there too).

As a temporary workaround, you may try disconnecting all monitors except one; if this would allow the installer to start properly, you could then add something like

boot.kernelPackages = pkgs.linuxPackages_6_1;

to your system configuration, so that the installed system would use a newer kernel without this bug. Alternatively, you may try grabbing a nixos-unstable ISO image from https://channels.nixos.org/nixos-unstable (these images should use a 6.1.y kernel too, so you may be able to use them without messing with your monitors).

andrewhelwer · April 15, 2023, 9:21pm

Wow! Thanks for the incredible debug help. I hope to one day reach the ability to jump between symbols and memory offsets and kernel versions and source code. I can confirm that turning off 2/3 of my monitors fixes the issue. Will report back on whether upgrading the kernel fixes it permanently!