Dear all,
Since upgrading to NixOS 24.05, I regularly get kernel panics (System frozen, Caps Lock LED light blinking). Since the cause did not get logged, I enabled crashDump:
boot.crashDump.enable = true;
Yet, I am still not sure where to get the logs.
‘Luckily’, sometimes the system crashes when I switch to tty1, so I get to see the kernel error messages (see picture attached). Could someone help me parse them? The folder mentioned in the screenshot-picture (/var/log/journal/2e356a0c8081417bb29463c6981bdbb5
) is full of *.journal files.
Any help would be greatly appreciated!
Hardware: Thinkpad X1 Extreme Gen4
2 Likes
pstore
should be enabled by default, backed by EFI. /var/lib/systemd/pstore/
you should find the full logs there, hopefully systemd manages. It did work on my server a few weeks back on 24.05 so it should be functional.
2 Likes
Oh and those files get populated at the next bootup, systemd fetches the latest logs from the pstore and dumps them there.
Thanks a lot @MagicRB , there they are! Seems to be due to kdeconnect
.
...
<6>[ 4381.810802] usb 2-2.3: USB disconnect, device number 3
<6>[ 4381.811383] r8152-cfgselector 2-2.4: USB disconnect, device number 4
<6>[ 4382.061795] usb 3-4.3.2: USB disconnect, device number 9
<6>[ 4382.101406] usb 3-4.3.5: USB disconnect, device number 10
<4>[ 4382.138338] xhci_hcd 0000:00:14.0: WARN Set TR Deq Ptr cmd failed due to incorrect slot or ep state.
<6>[ 4382.138677] usb 3-4.5: USB disconnect, device number 8
<6>[ 4383.895043] .kdeconnectd-wr[1961]: segfault at c ip 00007f5f4ed251d1 sp 00007ffdc8c95500 error 4 in libQt5XcbQpa.so.5.15.11[7f5f4eced000+aa000] likely on CPU 15 (core 7, socket 0)
<6>[ 4383.895059] Code: ff ff 0f 1f 80 00 00 00 00 48 8b 7b 20 8b 54 24 44 8b 74 24 4c e8 bf 83 fc ff 48 8b 7b 20 31 d2 89 c6 e8 e2 89 fc ff 49 89 c6 <8b> 40 0c 85 c0 0f 85 6b 04 00 00 e8 ff 1d fd ff 0f b6 40 10 84 c0
<6>[ 4384.235069] wlp9s0: deauthenticating from 34:2c:c4:35:fe:8f by local choice (Reason: 3=DEAUTH_LEAVING)
<1>[ 4398.103809] BUG: kernel NULL pointer dereference, address: 0000000000000caf
<1>[ 4398.103814] #PF: supervisor read access in kernel mode
Oops#1 Part4
<1>[ 4398.103816] #PF: error_code(0x0000) - not-present page
<6>[ 4398.103817] PGD 0 P4D 0
<4>[ 4398.103819] Oops: 0000 [#1] PREEMPT SMP NOPTI
<4>[ 4398.103821] CPU: 5 PID: 49 Comm: ksoftirqd/5 Tainted: P O 6.6.40 #1-NixOS
<4>[ 4398.103823] Hardware name: LENOVO 20Y5S04T00/20Y5S04T00, BIOS N40ET34W (1.16 ) 04/08/2022
<4>[ 4398.103824] RIP: 0010:rb_first+0xf/0x30
<4>[ 4398.103829] Code: 10 c3 cc cc cc cc 0f 1f 44 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 48 8b 07 48 85 c0 74 14 48 89 c2 <48> 8b 40 10 48 85 c0 75 f4 48 89 d0 c3 cc cc cc cc 31 d2 eb f4 66
<4>[ 4398.103831] RSP: 0018:ffffc900002bbd88 EFLAGS: 00010206
<4>[ 4398.103833] RAX: 0000000000000c9f RBX: ffff8881557722c0 RCX: 000000000002d805
<4>[ 4398.103834] RDX: 0000000000000c9f RSI: 0000000000000000 RDI: ffff8881077abb28
<4>[ 4398.103836] RBP: ffff888155772240 R08: 0000000000000000 R09: 0000000000038d40
<4>[ 4398.103837] R10: 0000000000000000 R11: 0000000000000033 R12: 0000000000000000
<4>[ 4398.103838] R13: ffff8881077abb28 R14: 000000000000000a R15: 0000000000000000
<4>[ 4398.103839] FS: 0000000000000000(0000) GS:ffff88884f280000(0000) knlGS:0000000000000000
<4>[ 4398.103840] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[ 4398.103842] CR2: 0000000000000caf CR3: 0000000106826001 CR4: 0000000000f70ee0
<4>[ 4398.103843] PKRU: 55555554
<4>[ 4398.103844] Call Trace:
<4>[ 4398.103845] <TASK>
<4>[ 4398.103848] ? __die+0x23/0x70
<4>[ 4398.103852] ? page_fault_oops+0x171/0x4e0
<4>[ 4398.103855] ? __pick_eevdf+0x16c/0x180
Kdeconnect does segfault, but it really should not bring the kernel with it, thats a kernel bug.
Atemu
August 2, 2024, 8:11am
6
Note that what is shown (but snipped?) here is a kernel Oops, not a panic. The system should continue to function although what triggered the Oops may also have triggered a panic shortly after.
I also don’t think this has anything to do with KDE connect. Its segfault happened 15s before the Oops.
The kernel claims that its tainted with an externally-built proprietary module. Do you use the Nvidia module?
I’d consider it the cause until proven otherwise.
@Atemu indeed, I just had another (real) panic, even though I got rid of kde-connect (see below). And yes, I am using the nvidia
kernel module/driver.
...
<6>[ 7644.895312] wlp9s0: associated
<7>[ 7644.895596] wlp9s0: Limiting TX power to 30 (30 - 0) dBm as advertised by a4:53:0e:7d:99:ee
<6>[ 8539.883099] wlp9s0: deauthenticating from a4:53:0e:7d:99:ee by local choice (Reason: 3=DEAUTH_LEAVING)
<1>[ 8540.514248] BUG: kernel NULL pointer dereference, address: 00000000000005f3
<1>[ 8540.514253] #PF: supervisor read access in kernel mode
Panic#2 Part5
<1>[ 8540.514255] #PF: error_code(0x0000) - not-present page
<6>[ 8540.514256] PGD 0 P4D 0
<4>[ 8540.514259] Oops: 0000 [#1] PREEMPT SMP NOPTI
<4>[ 8540.514261] CPU: 14 PID: 103 Comm: ksoftirqd/14 Tainted: P O 6.6.40 #1-NixOS
<4>[ 8540.514263] Hardware name: LENOVO 20Y5S04T00/20Y5S04T00, BIOS N40ET34W (1.16 ) 04/08/2022
<4>[ 8540.514264] RIP: 0010:rb_first+0xf/0x30
<4>[ 8540.514269] Code: 10 c3 cc cc cc cc 0f 1f 44 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 48 8b 07 48 85 c0 74 14 48 89 c2 <48> 8b 40 10 48 85
c0 75 f4 48 89 d0 c3 cc cc cc cc 31 d2 eb f4 66
<4>[ 8540.514271] RSP: 0018:ffffc9000048fd88 EFLAGS: 00010202
<4>[ 8540.514273] RAX: 00000000000005e3 RBX: ffff88813c2d8350 RCX: 0000000080800047
<4>[ 8540.514274] RDX: 00000000000005e3 RSI: 0000000000000000 RDI: ffff88811a22ff88
<4>[ 8540.514276] RBP: ffff88813c2d82d0 R08: 0000000000000000 R09: 0000000080800047
<4>[ 8540.514277] R10: ffff88813970f160 R11: 0000000000000cc8 R12: 0000000000000000
<4>[ 8540.514278] R13: ffff88811a22ff88 R14: 000000000000000a R15: 0000000000000000
<4>[ 8540.514279] FS: 0000000000000000(0000) GS:ffff88884f700000(0000) knlGS:0000000000000000
<4>[ 8540.514281] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[ 8540.514282] CR2: 00000000000005f3 CR3: 000000010e568004 CR4: 0000000000f70ee0
<4>[ 8540.514283] PKRU: 55555554
...
<4>[ 8540.514349] Modules linked in: vhost_net vhost vhost_iotlb tls rfcomm nft_chain_nat xt_MASQUERADE nf_conntrack_netlink xfrm_user xfrm_algo xt_addrtype overlay ccm
af_packet cmac algif_hash algif_skcipher af_alg bnep nf_conntrack_netbios_ns nf_conntrack_broadcast xt_CT msr xt_conntrack xt_pkttype xt_LOG nf_log_syslog xt_tcpudp nft_
compat nf_tables sch_fq_codel snd_ctl_led snd_soc_skl_hda_dsp snd_soc_intel_hda_dsp_common snd_soc_hdac_hdmi snd_sof_probes snd_soc_dmic snd_hda_codec_realtek snd_hda_co
dec_generic snd_sof_pci_intel_tgl snd_sof_intel_hda_common snd_soc_hdac_hda soundwire_intel snd_sof_intel_hda_mlink soundwire_cadence snd_sof_intel_hda snd_sof_pci snd_s
of_xtensa_dsp snd_sof snd_sof_utils snd_hda_ext_core snd_soc_acpi_intel_match snd_soc_acpi soundwire_generic_allocation soundwire_bus snd_soc_core snd_hda_codec_hdmi snd
_compress ac97_bus snd_pcm_dmaengine iwlmvm mac80211 joydev ptp pps_core libarc4 iwlwifi uvcvideo intel_tcc_cooling hid_multitouch btusb x86_pkg_temp_thermal intel_power
clamp cmdlinepart btrtl
Panic#2 Part3
<4>[ 8540.514387] nls_iso8859_1 videobuf2_vmalloc r8153_ecm snd_hda_intel btintel coretemp cdc_ether uvc iTCO_wdt spi_nor videobuf2_memops nls_cp437 snd_intel_dspcfg us
bnet crc32_pclmul btbcm polyval_clmulni videobuf2_v4l2 polyval_generic vfat mei_wdt intel_pmc_bxt btmtk snd_intel_sdw_acpi gf128mul ghash_clmulni_intel bluetooth mtd mei
_hdcp mei_pxp ee1004 mousedev watchdog intel_rapl_msr videodev snd_hda_codec fat sha512_ssse3 sha256_ssse3 cfg80211 snd_hda_core sha1_ssse3 aesni_intel processor_thermal
_device_pci_legacy processor_thermal_device videobuf2_common snd_hwdep crypto_simd processor_thermal_rfim cryptd ucsi_acpi processor_thermal_mbox ecdh_generic hid_generi
c mc ecc thinkpad_acpi snd_pcm intel_lpss_pci i2c_i801 think_lmi processor_thermal_rapl intel_lpss mei_me typec_ucsi idma64 tpm_crb intel_rapl_common nvram rapl spi_inte
l_pci i2c_hid_acpi ledtrig_audio r8152 snd_timer platform_profile intel_cstate intel_uncore psmouse typec tpm_tis mii firmware_attributes_class rfkill spi_intel mei snd
i2c_smbus wmi_bmof
Panic#2 Part2
<4>[ 8540.514435] 8250_pci virt_dma thermal tpm_tis_core intel_soc_dts_iosf fan roles i2c_hid soundcore int3403_thermal battery ac tiny_power_button int340x_thermal_zon
e usbhid hid int3400_thermal intel_pmc_core acpi_tad acpi_thermal_rel acpi_pad input_leds pinctrl_tigerlake evdev button mac_hid serio_raw nvidia_drm(PO) nvidia_modeset(
PO) nvidia_uvm(PO) nvidia(PO) loop xt_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c br_netfilter veth tun tap macvlan bridge stp llc kvm_intel kvm irqb
ypass fuse efi_pstore configfs nfnetlink efivarfs tpm rng_core dmi_sysfs ip_tables x_tables autofs4 ext4 crc32c_generic crc16 mbcache jbd2 sdhci_pci cqhci nvme sdhci xhc
i_pci atkbd xhci_pci_renesas thunderbolt nvme_core libps2 led_class xhci_hcd vivaldi_fmap mmc_core nvme_common t10_pi crc32c_intel rtc_cmos crc64_rocksoft crc64 crc_t10d
if crct10dif_generic crct10dif_pclmul crct10dif_common i8042 serio dm_mod dax i915 i2c_algo_bit drm_buddy video wmi backlight ttm intel_gtt drm_display_helper firmware_c
lass cec
Panic#2 Part1
<4>[ 8540.514488] CR2: 00000000000005f3
<4>[ 8540.514490] ---[ end trace 0000000000000000 ]---
<4>[ 8540.962009] RIP: 0010:rb_first+0xf/0x30
<4>[ 8540.962028] Code: 10 c3 cc cc cc cc 0f 1f 44 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 48 8b 07 48 85 c0 74 14 48 89 c2 <48> 8b 40 10 48 85
c0 75 f4 48 89 d0 c3 cc cc cc cc 31 d2 eb f4 66
<4>[ 8540.962031] RSP: 0018:ffffc9000048fd88 EFLAGS: 00010202
<4>[ 8540.962035] RAX: 00000000000005e3 RBX: ffff88813c2d8350 RCX: 0000000080800047
<4>[ 8540.962037] RDX: 00000000000005e3 RSI: 0000000000000000 RDI: ffff88811a22ff88
<4>[ 8540.962038] RBP: ffff88813c2d82d0 R08: 0000000000000000 R09: 0000000080800047
Would you suggest to install another version of it to see if it fixes things?
Atemu
August 2, 2024, 10:53am
8
You should try to reproduce the issue without the module; removing any taint. Nouveau+NVK should be usable these days.
Another thing I notice is that the WiFi has deauthed right before the panic/oops twice now. Which card and driver is it? Can you repro this with the rfkill engaged?
Do the panics really not contain a call trace? Because all the one’s I’ve seen so far do and it contains the most useful information.
2 Likes
possible related to Series 550 freezes laptop - #18 by patrick4242 - Linux - NVIDIA Developer Forums ?
switching to open
kernel seems to fix the issue
{
hardware = {
nvidia = {
open = true;
};
};
}
1 Like
@Atemu here is the call trace (sorry, I did not know which information was most relevant)
<4>[ 8540.514277] R10: ffff88813970f160 R11: 0000000000000cc8 R12: 0000000000000000
<4>[ 8540.514278] R13: ffff88811a22ff88 R14: 000000000000000a R15: 0000000000000000
<4>[ 8540.514279] FS: 0000000000000000(0000) GS:ffff88884f700000(0000) knlGS:0000000000000000
<4>[ 8540.514281] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[ 8540.514282] CR2: 00000000000005f3 CR3: 000000010e568004 CR4: 0000000000f70ee0
<4>[ 8540.514283] PKRU: 55555554
<4>[ 8540.514284] Call Trace:
<4>[ 8540.514286] <TASK>
<4>[ 8540.514288] ? __die+0x23/0x70
<4>[ 8540.514292] ? page_fault_oops+0x171/0x4e0
<4>[ 8540.514295] ? place_entity+0x1b/0xf0
<4>[ 8540.514298] ? exc_page_fault+0x71/0x160
<4>[ 8540.514302] ? asm_exc_page_fault+0x26/0x30
Oops#1 Part3
<4>[ 8540.514307] ? rb_first+0xf/0x30
<4>[ 8540.514309] simple_xattrs_free+0x29/0x90
<4>[ 8540.514312] kernfs_free_rcu+0x2f/0x50
<4>[ 8540.514316] rcu_do_batch+0x1e6/0x7c0
<4>[ 8540.514320] rcu_core+0x1b5/0x4c0
<4>[ 8540.514323] handle_softirqs+0xe2/0x2e0
<4>[ 8540.514326] ? __pfx_smpboot_thread_fn+0x10/0x10
<4>[ 8540.514329] run_ksoftirqd+0x31/0x40
<4>[ 8540.514330] smpboot_thread_fn+0xd9/0x1d0
<4>[ 8540.514333] kthread+0xe5/0x120
<4>[ 8540.514336] ? __pfx_kthread+0x10/0x10
<4>[ 8540.514339] ret_from_fork+0x31/0x50
<4>[ 8540.514342] ? __pfx_kthread+0x10/0x10
<4>[ 8540.514344] ret_from_fork_asm+0x1b/0x30
<4>[ 8540.514348] </TASK>
I rely on CUDA, so nouveau is not really an option. I’ll try it out though if the open
driver suggested by @JimJ92120 does not solve the issue.
Regarding the WIFI driver, I seem to use iwlwifi with the iwlmvm module. I don’t know what you refer to by ‘engaging rfkill’ though.
Thanks a lot for providing advice!
Atemu
August 4, 2024, 8:43pm
11
This suggests to me that the issue might lie in the VFS or kernfs (i.e. /sys/
) code. It might also be a red herring though.
Is this the oops trace or the panic trace? The panic would be more interesting.
Is it always the same call trace? Collect a few and compare.
This is just for troubleshooting. If you cannot reproduce the issue without the Nvidia module, only Nvidia can help you and you’d have to take it to them.
Okay so it’s a relatively recent Intel card. They should be okay but I heard they’ve gotten a bit shoddy in recent years.
It disables wireless (r adio f requency) devices such as WiFi. See man rfkill
.
1 Like
I’ll keep an eye on it and report back! So far no crashes with the open
driver, although the suspend
not has some issues (potentially this problem )
Moritz
futile
August 7, 2024, 2:55pm
13
I also had problems with kernel panics due to new nvidia drivers, and switching to open
wasn’t possible for me because my gfx card is too old (gtx 1060). Instead I switched back to an older driver version, the 535-series in my case.
See here (driver switch at the bottom): nixos-config/hosts/nixos-home/hardware-configuration.nix at 68222bdb283adeebea40a3b7e0b3bd30dd765c6e · futile/nixos-config · GitHub
1 Like