Nixos 21.05 crashed 3 times within 2 days

Hi all,

maybe you can help me inspecting my instable system. Since yesterday my nixos crashed 3 times.

From my perspective the timepoints were completely random.
In Addition the crash is immediate. The system restarts directly, without any waiting and is back after kind of 30 seconds in total. Of course everything is gone after such a restart.

Which logs can I inspect to understand better what is going on?
Any other hints and ideas?

using nixos 21.05.2693.d5aadbefd65 (Okapi)
with kde

Maybe journalctl -b -1, which should give you the logs for the previous boot? (--list-boots to look for older logs)

thank you very much, I found the critical line:

Aug 31 14:20:54 gram17 kernel: thermal thermal_zone3: acpitz: critical temperature reached, shutting down

I wasn’t hearing any fan…

Does someone knows any tools to inspect the precise temperatures and fan activities?

1 Like

https://wiki.archlinux.org/title/lm_sensors#Using_sensor_data

About fan control drivers: it depends on your machine. Usually it shouldn’t come to manually changing when fan should start or turn off.

3 Likes

I found another thread where this crash is also reported by others for the Laptop LG Gram 17.

https://bbs.archlinux.org/viewtopic.php?id=268721

Unfortunately no solution yet, however it seems to have to do with RAM and sleep / hibernation mode.

I have a laptop on unstable that also recently has started not being able to survive being suspended half the time. Changed the RAM as I thought that was the problem - no dice.

1 Like

So far I haven’t had any crashes after doing this:

boot.kernelParams = [
  # https://bbs.archlinux.org/viewtopic.php?pid=1902231#p1902231
  "i915.enable_psr=0"
];
3 Likes

There are at least two monitoring utilities in nixpkgs that can display your CPU temps, bpytop and gotop.

There are several others that can show fan speed, but I haven’t used any.

Oh and it might be worth checking if you have powerManagement.cpuFreqGovernor set to performance, and change it to ondemand or powersave instead. eg, powerManagement.cpuFreqGovernor = lib.mkDefault "powersave";, either in configuration.nix or hardware-configuration.nix.

1 Like

I actually ran into this issue on my wifes laptop the other day as well. The stange thing is, I was monitoring the temps with btm and it never actually crossed the threshold as far as I could tell. I was able to manually disable the trigger at runtime by modifying an option in /sys to the thermal device. Can’t remember the exact option name atm. I’ll see if I can update this post later after I get a chance to review my shell history on her laptop.

2 Likes

unfortunately this didn’t work for me. I am happy it worked for you
Looking further

The thread on ArchLinux reported that the issue is understood, reported and a patch already in work

https://lore.kernel.org/linux-pm/202109 … nel.org/T/

Antoine Tenart wrote:

What happens is this drivers uses a global variable to keep track of the tcc offset (tcc_offset_save) and uses it on resume. The issue is this variable is initialized to 0, but is only set in tcc_offset_degree_celsius_store, i.e. when the tcc offset is explicitly set by userspace. If that does not happen, the resume path will set the offset to 0 (in my case the h/w default being 3, the offset would become too low after a suspend/resume cycle).

however, for now no workaround yet

Actually someone reported switching to linux kernel 4.19 solves the problem.

I tried setting

boot.kernelPackages = pkgs.linuxPackages_4_19;

However setting this kernel, on reboot the system does not start my desktop anylonger (Plasma KDE) but stays in a plain terminal.

The problem is fixed now in the latest linux kernel.

I am very happy that I can use sleep-mode again

1 Like