Hey, my system is very slow after booting up and with some uptime it actually gets way faster/more responsive. I’m not using any DE and even login straight to console is quite slow, as well as starting of window manager (Hyprland). But then I start Firefox or kitty and it’s just several seconds.
No idea where to look.
It’s not CPU usage. Also after an app starts, it’s snappy and smooth. It’s just spawning of new processes which is super slow. (Just now I’ve booted and Firefox was starting for 15s)
What is even more weird is the fact that if I reboot several times it’s actually way faster right after the reboot. I don’t use this computer that often and I would swear that when I get back to it after few days it’s slower then if I boot it after an hour (which seems crazy). It’s on ZFS and I was thinking that maybe the scrubbing which starts right after boot after the computer was not on for some time has something to do with it, but the I/O load seem pretty insignificant.
Have you checked your storage system for errors? Do you see any indication of IO problems in the journal?
I’m relatively new to using ZFS, but I do not recall seeing it do scrubbing right after boot; is that something configurable? (My scrub seems to happen every Sunday when I first use the system.)
You also may be seeing some catch-up of ZFS snapshots if you have that enabled. This should be apparent in the journal.
In fact, I’d expect you to see a lot of activity in the journal (journalctl -f) when you encounter the slowdowns. That might even exacerbate the problem.
Performance would improve after running for awhile with the ZFS cache getting populated. But it’d also improve if whatever is hogging IO just goes away.
Do you have swap active? If so, can you turn it off? Do you have enough RAM to use ZFS? (I think it wants about 1GB RAM per TB storage.)
I expect there are top-like tools to monitor filesystem IO, but I do not know them (yet). Hopefully you will find some useful clues in the journal.
Yes, if you’re running a zfs scrub it can slow down the system. Scrub is supposed to use low-priority IO and pause in the face of other IO contention, but it can still have a significant impact. Perhaps especially right after boot, when there may still be lots of processes starting up (and the things you’re starting, like firefox, are starting for the first time with cold caches).
The multiple reboot thing is hard to explain normally, but makes sense in the context of a zfs scrub as a culprit (or at least, contributor):
I assume you’re using the services.zfs.autoScrub config to run regular scrubs.
This job includes a Persistent = "yes" config option that makes systemd run it on startup (or resume from suspend) if the timer would have elapsed in the interim. That’s why you see it if the computer was not on for some time.
The scrub takes some time. If you reboot again quickly, before it finishes, the pool will resume scrubbing when imported on the next boot. (If the systemd job retries too, it will fail because the pool is already scrubbing).
But after the time needed for a few reboots, and experiencing of slowness, the scrub has likely finished and the later boot will not be impacted.
You might also be using the related service to trim the pool. By default, the timers of these are somewhat unlikely to collide, at least with small pools, but they are both persistent and will collide on a bootup, potentially adding more IO contention.
Here are some commands you can use to investigate and confirm:
zpool iostat -lv 5 will show you detailed IO stats and transaction wait times for the pool at 5s intervals, in particular the time spent in various kinds of wait including scrub and trim operations. Looking not only at the scrub wait, but at the impact on other times when scrub is or is not running, or when app startup is or is not slow, may be quite informative.
zpool wait -t scrub <poolname> 60 can be handy to run in a shell window, and monitor when the scrub is done to compare with other system responsiveness (but beware of confirmation bias).
As for recommendations:
if you are, don’t use the scheduled trim service; instead set the autotrim pool property to continually trim blocks in the background as they’re freed. The main reason not to do this is for systems with periods of high peak load that want to defer the work completely until later, and that are running all the time (ie, servers). Trim is also only really relevant for SSDs; you didn’t say but you will feel the impact of a scrub much more severely on a HDD pool. You’ll also hear it, to the point where if you were on a HDD, I suspect you probably wouldn’t be needing to ask the question.
Unfortunately the persistent setting for these services isn’t configurable from standard nixos options, so it’s hard to avoid the thundering herd on boot/resume. You could change the timer to run when the system is more likely to be active, but of course that makes you more likely to feel the impact, even if the impact is smaller once the system is running in a steady state.
You can certainly try moving the timer or disabling the service for a while and just observing what difference it makes, and then plan further from there.
Depending on the kind of system it is, you could consider setting a BIOS wake timer to coincide with the scrub schedule, and let it get it out of the way as intended.
The default for the scrub timer is "Sun, 02:00" which interacts badly (though mostly harmlessly) with DST changes, consider changing it to avoid 2am.
Thanks both for suggestions! I didn’t have the computer at hand past few days, will try to look at journal and ZFS activity and hopefully will find out something.
Hi, I am experiencing a very similar problem across multiple nix devices. I am on unstable and I do not use zfs. Because this issue occured at the same time on both my machines I think I can rule out a hardware issue. I thought maybe it is an issue with GTK, as some other applications work fine, but I did not find any relevant output.