maybe you can help me inspecting my instable system. Since yesterday my nixos crashed 3 times.
From my perspective the timepoints were completely random.
In Addition the crash is immediate. The system restarts directly, without any waiting and is back after kind of 30 seconds in total. Of course everything is gone after such a restart.
Which logs can I inspect to understand better what is going on?
Any other hints and ideas?
using nixos 21.05.2693.d5aadbefd65 (Okapi)
There are at least two monitoring utilities in nixpkgs that can display your CPU temps, bpytop and gotop.
There are several others that can show fan speed, but I haven’t used any.
Oh and it might be worth checking if you have powerManagement.cpuFreqGovernor set to performance, and change it to ondemand or powersave instead. eg, powerManagement.cpuFreqGovernor = lib.mkDefault "powersave";, either in configuration.nix or hardware-configuration.nix.
I actually ran into this issue on my wifes laptop the other day as well. The stange thing is, I was monitoring the temps with btm and it never actually crossed the threshold as far as I could tell. I was able to manually disable the trigger at runtime by modifying an option in /sys to the thermal device. Can’t remember the exact option name atm. I’ll see if I can update this post later after I get a chance to review my shell history on her laptop.
What happens is this drivers uses a global variable to keep track of the tcc offset (tcc_offset_save) and uses it on resume. The issue is this variable is initialized to 0, but is only set in tcc_offset_degree_celsius_store, i.e. when the tcc offset is explicitly set by userspace. If that does not happen, the resume path will set the offset to 0 (in my case the h/w default being 3, the offset would become too low after a suspend/resume cycle).