Nix Raspberry Pi 4 images don't work on 4K monitor

I’ve spent days flashing images onto my Raspberry Pi 4 (1gb) and I end up with two outcomes:

  1. an error saying the block size doesn’t match and requires an fsck (from a fresh ‘dd’ this is just bizarre).
  2. after going through the full NixOS stage 1 / 2 boot process, just as I should be getting a login shell, I get a black screen, the display says no input, and the Pi flashes the green led twice (which apparently doesn’t align to any status code).

I’ve tried:
nixos-sd-image-22.05.1437.e8d47977286-aarch64-linux.img.zst (boots up to black screen)
nixos-sd-image-new-kernel-22.05.1437.e8d47977286-aarch64-linux.img.zst (gives FSCK error)
nixos-sd-image-21.11.335883.7adc9c14ec7-aarch64-linux.img (lost the link) (gives FSCK error)

I’ve flashed the raspbian image on and updated the bootloader and firmware.
No change.

Trying to build my own packages (by following 1 2 ) results in:

$ nix-build '<nixpkgs/nixos>' -A config.system.build.sdImage --argstr system aarch64-linux -I nixos-config=./pi.nix 
error: attribute 'sdImage' in selection path 'config.system.build.sdImage' not found

I’m at my wits end.
I’m going to put raspbian on, I’ve burnt so much time trying to get nix/nixops/etc working.

I wanted to raise this since it seems no one is discussing it, or at least not having the same issues I am.

1 Like

Are you certain the SDcard is not damaged? What dd command are you using? Does fsck complain on the machine writing the SDcard?

dd if=<image> of=/dev/sdd bs=4M conv=fsync

SDCard is fine, I’ve put raspbian on it and its running with no problems.

running fsck on the sdcard immediately after dd’ing the image across (no first boot):

$ fsck.fat /dev/sdd2
fsck.fat 4.2 (2021-01-31)
Logical sector size is zero.

$ fsck.fat /dev/sdd1
fsck.fat 4.2 (2021-01-31)
There are differences between boot sector and its backup.
This is mostly harmless. Differences: (offset:original/backup)
  65:01/00
1) Copy original to backup
2) Copy backup to original
3) No action
[123?q]? 3
FATs differ - using first FAT.
Orphaned long file name part "adau1977-adc.dtbo"
1) Delete
2) Leave it
[12?q]? 2
/overlays/Ç
  Bad short file name (Ç).
1) Drop file
2) Rename file
3) Auto-rename
4) Keep it
[1234?q]? 4
/overlaysÇ
  Bad short file name Ç).
1) Drop file
2) Rename file
3) Auto-rename
4) Keep it
[1234?q]? 4
Reclaimed 1159 unused clusters (593408 bytes).
Dirty bit is set. Fs was not properly unmounted and some data may be corrupt.
1) Remove dirty bit
2) No action
[12?q]? 2
Free cluster summary wrong (454536 vs. really 455695)
1) Correct
2) Don't correct
[12?q]? 2

*** Filesystem was changed ***
The changes have not yet been written, you can still choose to leave the
filesystem unmodified:
1) Write changes
2) Leave filesystem unchanged
[12?q]? 2
/dev/sdd1: 42 files, 60495/516190 clusters

I just tried it on a smaller monitor.and it now boots to a terminal prompt.

I had been using my 4k TV as it’s the only one spare.
It seems the image has issues when it detects the higher resolution and tries to transition across.

Created an issue with nixos-hardware

Do you see any interesting failures in dmesg or journalctl -b related to the 4K failure?

The fsck problem is invalid. The second partition is type ext4, which will fail when checked with fsck.fat. fsck succeeds for both partitions on both image files:

eric@farm:~/tmp$ fdisk -l nixos-sd-image-22.05.1437.e8d47977286-aarch64-linux.img
Disk nixos-sd-image-22.05.1437.e8d47977286-aarch64-linux.img: 3.05 GiB, 3274612736 bytes, 6395728 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x2178694e

Device                                                   Boot Start     End Sectors Size Id Type
nixos-sd-image-22.05.1437.e8d47977286-aarch64-linux.img1      16384   77823   61440  30M  b W95 FAT32
nixos-sd-image-22.05.1437.e8d47977286-aarch64-linux.img2 *    77824 6395727 6317904   3G 83 Linux

eric@farm:~/tmp$ fdisk -l nixos-sd-image-new-22.05.1437.e8d47977286-aarch64-linux.img
Disk nixos-sd-image-new-22.05.1437.e8d47977286-aarch64-linux.img: 3.16 GiB, 3398295552 bytes, 6637296 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x2178694e

Device                                                       Boot Start     End Sectors  Size Id Type
nixos-sd-image-new-22.05.1437.e8d47977286-aarch64-linux.img1      16384   77823   61440   30M  b W95 FAT32
nixos-sd-image-new-22.05.1437.e8d47977286-aarch64-linux.img2 *    77824 6637295 6559472  3.1G 83 Linux

eric@farm:~/tmp$ sudo losetup -ro 8192K /dev/loop0 nixos-sd-image-22.05.1437.e8d47977286-aarch64-linux.img
eric@farm:~/tmp$ sudo fsck /dev/loop0
fsck from util-linux 2.37.4
fsck.fat 4.2 (2021-01-31)
/dev/loop0: 26 files, 11498/15321 clusters

eric@farm:~/tmp$ sudo losetup -ro 8192K /dev/loop1 nixos-sd-image-new-22.05.1437.e8d47977286-aarch64-linux.img
eric@farm:~/tmp$ sudo fsck /dev/loop1
fsck from util-linux 2.37.4
fsck.fat 4.2 (2021-01-31)
/dev/loop1: 26 files, 11498/15321 clusters

eric@farm:~/tmp$ sudo losetup -ro 38M /dev/loop2 nixos-sd-image-22.05.1437.e8d47977286-aarch64-linux.img
eric@farm:~/tmp$ sudo fsck -n /dev/loop2
fsck from util-linux 2.37.4
e2fsck 1.46.5 (30-Dec-2021)
NIXOS_SD: clean, 117933/197600 files, 616192/789738 blocks

eric@farm:~/tmp$ sudo losetup -ro 38M /dev/loop3 nixos-sd-image-new-22.05.1437.e8d47977286-aarch64-linux.img
eric@farm:~/tmp$ sudo fsck -n /dev/loop3
fsck from util-linux 2.37.4
e2fsck 1.46.5 (30-Dec-2021)
NIXOS_SD: clean, 118903/205200 files, 642823/819200 blocks

Hopefully journal contains clues to decipher the failures for 4K and the GUI.

With the TV unplugged, I get a clean boot.
Then when I plug in the TV, BOOM, null pointer deref.

[  138.458344] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000348
[  138.467154] Mem abort info:
[  138.469956]   ESR = 0x96000006
[  138.473004]   EC = 0x25: DABT (current EL), IL = 32 bits
[  138.478318]   SET = 0, FnV = 0
[  138.481367]   EA = 0, S1PTW = 0
[  138.484516]   FSC = 0x06: level 2 translation fault
[  138.489401] Data abort info:
[  138.492277]   ISV = 0, ISS = 0x00000006
[  138.496107]   CM = 0, WnR = 0
[  138.499070] user pgtable: 4k pages, 48-bit VAs, pgdp=000000004a520000
[  138.505509] [0000000000000348] pgd=080000004a51c003, p4d=080000004a51c003, pud=080000004a519003, pmd=0000000000000000
[  138.516129] Internal error: Oops: 96000006 [#1] SMP
[  138.521006] Modules linked in: bcm2835_v4l2(C) bcm2835_mmal_vchiq(C) videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 videobuf2_common videodev snd_soc_hdmi_codec raspberrypi_cpufreq mc hci_uart snd_bcm2835(C) btqca btsdio btbcm bluetooth brcmfmac brcmutil ip6_tables cfg80211 xt_conntrack nf_conntrack clk_raspberrypi ecdh_generic nf_defrag_ipv6 nf_defrag_ipv4 crct10dif_ce raspberrypi_hwmon broadcom bcm_phy_lib reset_raspberrypi rfkill ecc 8021q genet garp mrp iproc_rng200 i2c_bcm2835 rng_core pwm_bcm2835 xt_tcpudp bcm2711_thermal mdio_bcm_unimac vchiq(C) nvmem_rmem ip6t_rpfilter uio_pdrv_genirq uio ipt_rpfilter xt_pkttype nft_compat nft_counter nf_tables libcrc32c sch_fq_codel nfnetlink zfs(PO) zunicode(PO) zzstd(O) zlua(O) zcommon(PO) znvpair(PO) zavl(PO) icp(PO) spl(O) tap macvlan bridge stp llc fuse ip_tables x_tables xhci_pci xhci_pci_renesas vc4 cec drm_kms_helper drm pcie_brcmstb dm_mod
[  138.600287] CPU: 3 PID: 830 Comm: irq/53-vc4 hdmi Tainted: P         C O      5.15.50 #1-NixOS
[  138.608895] Hardware name: Raspberry Pi 4 Model B Rev 1.2 (DT)
[  138.614720] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[  138.621676] pc : vc4_hdmi_enable_scrambling+0x44/0x268 [vc4]
[  138.627358] lr : vc4_hdmi_connector_detect+0x68/0x250 [vc4]
[  138.632942] sp : ffff800008e93980
[  138.636247] x29: ffff800008e93980 x28: 0000000000000010 x27: 0000000000001e00
[  138.643380] x26: 0000000000000002 x25: 0000000000000000 x24: 0000000000000000
[  138.650512] x23: 0000000000000002 x22: ffffd2ae6fe3ac30 x21: 0000000000000001
[  138.657643] x20: ffff322708567080 x19: ffff3227085674f8 x18: 0000000000000000
[  138.664773] x17: 58b0805a70f23000 x16: ffffd2aec012cdd0 x15: 1d0100000183010f
[  138.671904] x14: 06e3ff00e20c0000 x13: 06e3ff00e20c0000 x12: 0fe401c305e30380
[  138.679035] x11: 091300963e10102d x10: 00000000000000b1 x9 : ffffd2ae6fe28b58
[  138.686166] x8 : 000000000000000d x7 : 0000000000000100 x6 : 0000000000000000
[  138.693297] x5 : 0000000000000000 x4 : ffff32273fbdb280 x3 : 00000000000018af
[  138.700428] x2 : 00000000000003e8 x1 : 000000001443fd00 x0 : 0000000000000000
[  138.707560] Call trace:
[  138.709999]  vc4_hdmi_enable_scrambling+0x44/0x268 [vc4]
[  138.715324]  vc4_hdmi_connector_detect+0x68/0x250 [vc4]
[  138.720558]  drm_helper_probe_detect+0xb4/0xe0 [drm_kms_helper]
[  138.726540]  drm_helper_probe_single_connector_modes+0x5f8/0x768 [drm_kms_helper]
[  138.734055]  drm_client_modeset_probe+0x240/0x10c8 [drm]
[  138.739477]  __drm_fb_helper_initial_config_and_unlock+0x50/0x520 [drm_kms_helper]
[  138.747082]  drm_fb_helper_hotplug_event.part.0+0xd8/0xe8 [drm_kms_helper]
[  138.753988]  drm_fbdev_client_hotplug+0x44/0x1c0 [drm_kms_helper]
[  138.760113]  drm_client_dev_hotplug+0x88/0xd8 [drm]
[  138.765065]  drm_kms_helper_hotplug_event+0x3c/0x50 [drm_kms_helper]
[  138.771452]  vc4_hdmi_hpd_irq_thread+0x30/0x40 [vc4]
[  138.776428]  irq_thread_fn+0x34/0xa8
[  138.779999]  irq_thread+0x154/0x2d0
[  138.783480]  kthread+0x128/0x138
[  138.786701]  ret_from_fork+0x10/0x20
[  138.790274] Code: f9402a60 52807d02 529fa001 72a28861 (f941a400) 
[  138.796361] ---[ end trace c7bf4b187d61f972 ]---
[  138.801029] genirq: exiting task "irq/53-vc4 hdmi" (830) is an active IRQ thread (irq 53)

Once this occurs, sudo reboot no longer seems to reboot the system.
Specifically, the ssh session hangs and the system never becomes responsive to ssh which makes me think its unable to finish shutting down the system.

I captured another set of logs that are from a boot-up with the TV plugged in, but in standby.
This issue still occurs in this situation. So it says to me that there is still HDMI communications going on there.

Full dmesg /journalctl logs as a github gist.

This is turning into a fun rabbit hole. :slight_smile: Do you intend to run ZFS on your RPi? (Exploring ZFS on RPi is on my personal project list.) You might have to disable ZFS to try newer kernels.

The crash log looks like the kernel is evaluating the newly connected video hardware. The NULL dereference clearly is a kernel bug. All this could be fixed in a newer kernel. Can you try building the image from nixos-unstable or with boot.kernelPackages = pkgs.linuxPackages_latest; to get kernel 5.18?

I particularly like this boot message

[ 10.773827] vchiq: module is from the staging directory, the quality is unknown, you have been warned.

I found this informative Stack Exchange article: What is /dev/vchiq in Raspberry Pi?

I am not familiar with how modern RPi gets GPU firmware updates. If it is not part of the SoC firmware (loaded from /boot), then checking for newer GPU firmware may be fruitful.

The same crash occurs at 13.583324 when booting with the TV in standby when the kernel switches from its internal video frame buffer to using the RPi graphics.

Other than trying different kernels, I’m not sure what more to do with this. You might look at what kernel version works for Raspian or Ubuntu.

My guess is that something related to the hdmi is not initialized properly. Possibly there are RPi boot config parameters which could help. Comparing contents of /boot/firmware/config.txt (and files it includes) between working and non-working OS images might reveal more to try. I’d like to know if you find anything useful.

This unrelated GitHub discussion suggests there are many things to tweak for the RPi hdmi connection.

I’m not going to run ZFS, no.

Looks like the kernels are:
raspbian: 5.15.30
NixOS 22.05: 5.15.32.

I would assume that the raspbian image has more pi related patches installed.

There is mention in the change notes above of “Remove 4kp60 option from Raspberry Pi Configuration”. But nothing I can see that screams “bug fix for 4k”

Re: using latest kernel.
I’m using the RPi4 nix configs from Nix-Hardware.
Specifically:

  imports = [
    "${builtins.fetchGit { url = "https://github.com/NixOS/nixos-hardware.git"; }}/raspberry-pi/4"
    ...
  ];

This appears to be an alias for linuxKernel.kernels.linux_rpi4.

I’ve gotten a response on the GH issue stating that that particular kernel module has had some updates.

There doesn’t seem to be a “latest” for this, and I suspect using vanilla latest would cause other issues.
I’m not versed enough to switch nixops over to unstable without doing the same for my host system. Nor have I had much luck applying kernel patches in the past.

NixOS used to build images using linuxPackages_rpi4, but they were dropped in favor of the ones using the mainline kernel. For what it’s worth (which isn’t much), I don’t encounter this issue using linuxPackages_rpi4, but I’m also not using any graphical programs (I just have it connected to my monitor in case I need to troubleshoot it).

2 Likes

This change is a very strong candidate for fixing the failure we’ve been discussing. Unfortunately it does not appear until kernel 5.19 – that could be awhile.

I would try using older kernels, hoping that one predates introducing the bug (vc4_hdmi_enable_scrambling()). Maybe eperiment with linuxKernel.kernels.linux_5_10, linuxKernel.kernels.linux_5_4 and linuxKernel.kernels.linux_4 in your NixOS config. You also might be able to try 5.18 this way.

With all the changes to that area of the code, its entirely possible one of these older kernels will work for you while you wait for 5.19 to land in NixOS.

Also, if you have not yet tried this, run the build where you get the blank screen and let it sit for 20 minutes to see if it begins working. Some of the 4k display problems worked this way. (Although they did not involve a bad dereference, so this may not work for you, but the cost of trying it is low.)

run the build where you get the blank screen and let it sit for 20 minutes to see if it begins working

I’ve let it sit for a long time, no dice.

I would try using older kernels, hoping that one predates introducing the bug.

I’ll see if I get time to try. I’m going to target this as “not for my TV use-case”.

Thanks for your insight and help in identifying the problem =)

1 Like

I get the same symptoms except it doesn’t seem to work with a smaller screen either

Is there a way to test with a different kernel without having to rebuild the whole image?