I’m a long time NixOS user and really love using it on my laptop, desktop and servers. However I do not consider myself an expert or anywhere near. This is why I’m asking here and hopefully someone can help me figure this out.
Basically since I began using NixOS on my desktop and laptop, I am having issues with the systems being unresponsive when they have disk i/o. For example when running nixos-rebuild switch, home-manager switch, when updating steam games or when nixos-collect-garbage runs I am having issues with the system being basically unusable. Basically every time my HDD LED is on for a longer time period the whole system freezes every few seconds for one to ~30 seconds. Even the mouse pointer is not movable and sound output stops.
I have tried to somehow debug this, looked at journalctl, tried to tweak my mount options, googled for issues and read like everything I could find on issues like this, but to this day I could not figure it out.
My Nixos config can be found here: https://git.sr.ht/~martinimoe/nixos-config/tree
Affected hosts are the ones with graphical user interfaces (or at least thats where I notice this): galactica and omnissiah. I am using btrfs on these hosts btw (maybe that matters?).
I’d be really happy and thankful if someone here has any ideas on what could cause this problem, or ideas on how to debug this and narrow it down
might worth to check your hardware and try to monitor resources usage (e.g run htop and keep it open to monitor) , at least to eliminate that possibilities
i have similar symptoms when running on devices with little available resources but kinda expected
Also, before trying the noatime option, I had some reduction on the freezing when I removed read and write work queues from the encrypted drive. For this, see: dm-crypt/Specialties - ArchWiki
Additionally, this sounds a lot like thrashing. You wouldn’t really expect that to look like this on modern systems, but with a particularly full SSD nearing the end of its life and little RAM… Keeping an eye on memory use with htop or btm sounds like a great idea. Also double check your backups.
linux_xanmod with its realtime scheduling might also hide the effects of IO work a little better from humans, if this turns out to be safe to ignore.
Edit: You’re using a swapfile-on-btrfs-on-luks? That might be worth double checking. clamav might also be a culprit, especially coupled with little ram and an… interesting swapfile. clamav is particularly memory hungry.
Thank you all so much for your answers! I really appreciate it <3
Lets go through your ideas:
I will keep btm running on my next rebuild and watch memory usage
I already have noatime as a mount option
I set boot.initrd.luks.devices.<name>.bypassWorkqueues to true for my LUKS device
I disabled ClamAV
Now I will test some heavy I/O tasks and give feedback here.
As for the SWAP file: That also sounds like a really good explanation! Unfortunately I am not sure how I can easily test this without repartitioning my disk But will go for this, when the other ideas do not help!
if possible, check your drive health also specs (read/write speed), cause if not “good enough”, swap I/O may be affected and may not give the expected performance
you also don’t need to re-partition your disks, still a good idea to do so if you can confirm that swap helps performances and to have a clean install
They’re already using a swapfile. Using a swapfile on btrfs has significant limitations, you should really consider those before just using a swapfile with this filesystem. It’s why I’m suggesting not doing that.
If I read it correctly it’s also a really tiny swapfile (1KB?), so I’m not sure it serves any useful purpose. I think it’s worth repartitioning and having a dedicated partition for this if you really want swap (it’s useful for sleep, but you need more than 1KB).
That said, just running without any swap at all (just use swapoff <filename>) could also confirm or deny this being the culprit.
Swap does have some purposes besides serving as a dumping ground for memory pages that don’t fit (e.g. the kernel sometimes uses it to store parsed data in its binary representation so parsing can be skipped down the line without taking up memory), but its performance impact if you have sufficient memory is incredibly minor.