Im not sure id blame ZFS for this. I dont see anything in the logs youve provided to indicate that ZFS is involved in the path that leads to the mmu fault. Do you run this system with ECC? If not, can you run a memtest? Have tried booting a live nixos image?
mmu fault? I must be missing something, because I don’t see anything about the mmu in those screenshots. What I see is Attempted to kill init!, which happens when PID 1 tries to exit. It’s very odd considering NixOS’s stage-1-init.sh should only ever exit if you have boot.panic_on_fail in your kernel params (and even then you should see the error message of the command that failed).
You might try boot.initrd.systemd.enable = true; since that switches to an initrd that IMO is much more robust to failures and makes it easier to diagnose them. You can set boot.initrd.systemd.emergencyAccess to a hashed password to allow entering a password to get an emergency shell when initrd fails that way, or you can add rd.systemd.debug_shell to the kernel params to get a debug shell on tty9. Then you can start looking at what failed with commands like systemctl status --failed and journalctl.
I won’t pretend to understand memory management, but I think they are looking at the [ 328.565168] ? handle_mm_fault+0x1bd/0x2c0 line. Not sure if this actually implicates MMU/hardware issue vs. some other software issue with paging or somesuch.
Skimming the first hit on google for this (Page Tables — The Linux Kernel documentation) suggests multiple routes call this code path? Some snippets that other people who also don’t understand memory management mind nevertheless find interesting and/or worth further learning at some point.
…There are several reasons why the MMU can’t find certain translations…
…When these conditions happen, the MMU triggers page faults…
…Additionally, page faults may be also caused by code bugs or by maliciously crafted addresses that the CPU is instructed to access…
… Whatever the routes, all architectures end up to the invocation of handle_mm_fault() which, in turn…
mmu fault? I must be missing something, because I don’t see anything about the mmu in those screenshots. What I see is Attempted to kill init!, which happens when PID 1 tries to exit. It’s very odd considering NixOS’s stage-1-init.sh should only ever exit if you have boot.panic_on_fail in your kernel params (and even then you should see the error message of the command that failed).
Yeah what people pointed out above me is what i meant. (I was AFK, had a fun weekend, not sarcasm). My first thought was that our init went to do some syscall, cant tell which one, then in the kernel it faulted which:
Taints the kernel
Kills init
As such it looks as if init just exited, but its because the userspace thread was killed in the kernel by a mmu fault.
Not a 100% positive on this one but thats how im reading this.
ECC should have caught any HW faults then. Hm, could theoretically be ZFS is screwing up something in kernel space which then causes a MMU fault, but witg KASLR I find it highly unlikely that it would always manifest in the same way. is the errro you get always the same? Can you reliably reproduce it? Have you by any chance disabled KASLR?
when chroot i was getting a pemition denied for all command i tried including shell builtins
after rebooting i saw this
[root@nixos:~]# zfs list -o name,exec
NAME EXEC
root_pool on
root_pool/home on
root_pool/nix off
root_pool/root on
root_pool/var on
some how my root nix dataset was set to exec=off
I dont know how that happen
I dont know how my old deviation was working
Unfortunately my derivation garbage collector deleted that revision
Would there be a way to check if the nix store was exicutible and give a more intelligent error
Another datapoint. I apparently also had exec=off and also ended up with a kernel panic. Other than that, we have different setups nowadays usually. Thankfully I know Richie and was able to call them quickly to figure it out since they dont have , but I don’t remember setting exec=off. Something interesting is I tried to run zpool history to find out what the original setting was and it doesnt appear to have the create statement for the dataset, but it does for the pool. not sure if zed was having an issue or if this is expected after some time.