Hibernate e820 issue, don't know how to continue investigating

TLDR: My system fails to resume from hibernation because the e820 memory map changes between boots and the kernel forbids resuming from hibernation with inconsistent e820 memory maps. Other distros can hibernate and resume from hibernation with the same setup, and I don’t know what they’re doing to get around the e820 memory map issue.

The issue

I can hibernate, about half the time I fail to resume from hibernation, with a message like this in my journald logs:

PM: Loading and decompressing image data (XXX pages)...
Hibernate inconsistent memory map detected!
PM: hibernation: Image mismatch: architecture specific data
PM: hibernation: Read XXX kbytes in XXX seconds (XXX MB/s)
PM: Error -1 resuming
PM: hibernation: Failed to load image, recovering.
PM: hibernation: Basic memory bitmaps freed
OOM killer enabled.
Restarting tasks ... done.
PM: hibernation: resume failed (-1)

This happens on multiple machines with different NixOS configs; all of them using full disk encryption and ZFS.
Looking further into the journald logs, I can see that the e820 memory map changes between boots: If I collect all logs that mention “e820” from the boot before and after the failed hibernation resume attempt, and diff them, I see a few slight differences, like this:

@@ -11,10 +11,10 @@ BIOS-e820: [mem 0x0000000076970000-0x0000000076970fff] reserved
 BIOS-e820: [mem 0x0000000076971000-0x0000000076971fff] usable
 BIOS-e820: [mem 0x0000000076972000-0x00000000803fffff] reserved
 BIOS-e820: [mem 0x00000000ff010000-0x00000000ff04ffff] reserved
-BIOS-e820: [mem 0x0000000100000000-0x000000087def1fff] usable
-BIOS-e820: [mem 0x000000087def2000-0x000000087def2fff] reserved
-BIOS-e820: [mem 0x000000087def3000-0x000000087def4fff] ACPI data
-BIOS-e820: [mem 0x000000087def5000-0x000000087dfebfff] usable
+BIOS-e820: [mem 0x0000000100000000-0x000000087dee3fff] usable
+BIOS-e820: [mem 0x000000087dee4000-0x000000087dee4fff] reserved
+BIOS-e820: [mem 0x000000087dee5000-0x000000087dee6fff] ACPI data
+BIOS-e820: [mem 0x000000087dee7000-0x000000087dfebfff] usable
 BIOS-e820: [mem 0x000000087dfec000-0x000000087dfecfff] ACPI data
 BIOS-e820: [mem 0x000000087dfed000-0x000000087efe4fff] usable
 BIOS-e820: [mem 0x000000087efe5000-0x000000087efe6fff] reserved
@@ -45,8 +45,8 @@ BIOS-e820: [mem 0x000000087f7e5000-0x000000087f7e6fff] reserved
 BIOS-e820: [mem 0x000000087f7e7000-0x000000087f7f5fff] usable
 BIOS-e820: [mem 0x000000087f7f6000-0x000000087f7f7fff] reserved
 BIOS-e820: [mem 0x000000087f7f8000-0x000000087fbfffff] usable
-e820: update [mem 0x87def5018-0x87df04e57] usable ==> usable
-e820: update [mem 0x87def5018-0x87df04e57] usable ==> usable
+e820: update [mem 0x87c5c5018-0x87c5d4e57] usable ==> usable
+e820: update [mem 0x87c5c5018-0x87c5d4e57] usable ==> usable
 e820: update [mem 0x00000000-0x00000fff] usable ==> reserved
 e820: remove [mem 0x000a0000-0x000fffff] usable
 e820: update [mem 0x87df4e000-0x87df9cfff] usable ==> reserved
@@ -55,8 +55,8 @@ e820: reserve RAM buffer [mem 0x76955000-0x77ffffff]
 e820: reserve RAM buffer [mem 0x7696b000-0x77ffffff]
 e820: reserve RAM buffer [mem 0x76970000-0x77ffffff]
 e820: reserve RAM buffer [mem 0x76972000-0x77ffffff]
-e820: reserve RAM buffer [mem 0x87def2000-0x87fffffff]
-e820: reserve RAM buffer [mem 0x87def5018-0x87fffffff]
+e820: reserve RAM buffer [mem 0x87c5c5018-0x87fffffff]
+e820: reserve RAM buffer [mem 0x87dee4000-0x87fffffff]
 e820: reserve RAM buffer [mem 0x87df4e000-0x87fffffff]
 e820: reserve RAM buffer [mem 0x87dfec000-0x87fffffff]
 e820: reserve RAM buffer [mem 0x87efe5000-0x87fffffff]

From the patch description “PM / hibernate: Verify the consistent of e820 memory map by md5 digest” I see that the kernel will abort with the error message I see. Therefore, I believe that this is what is happening here.

What happens on other distros I’m running: On the same machine I can run Debian, and have never observed an issue resuming from hibernation. I have another machine (different brand) running Arch, and also never ran into this issue. On these distros (and on NixOS when resuming happens to succeed), whenever I resume from hibernation, journald treats it as the same boot as the one before hibernation, and doesn’t show any extra e820 related log lines.

However, on other distros and other machines, whenever I reboot, I can see that the e820 map changes slightly in the same fashion. So it appears that those other distros are doing something to prevent the issue I’m running into, but because of the lack of boot-related logging on successful resumptions from hibernation, I don’t know what it is.

Conceptual questions

Instead of just digging deeper or trying more things, I would like to understand things. (And having the answers written up in the same place would help people running into related issues in the future.)

  • Is the e820 memory map supposed to change between boots? It seems to be normal, but I don’t know why it should happen. (I’ve heard of Kernel address space layout randomization, but I don’t see why that should affect the e820 memory map because as far as I understand it that’s something provided by the firmware to the OS.)
  • If the e820 memory map does change between boots, what are other distros doing to resume from hibernation anyways? Are they somehow making the e820 maps match when resuming from hibernation, or are they somehow making resuming from hibernation possible even with different e820 maps?
  • Is there a standard way to get the log output of the restore kernel to learn more about what it does when resuming from hibernation?
  • The e820 trap of Linux kernel hibernation slide 25 talks about the “platform” vs “shutdown” hibernate modes. My understanding is that “platform” uses platform provided features to make resuming from hibernation easier or faster. But I don’t see why “platform” would ever be necessary; shouldn’t saving the current state, powering off, and loading that state when resuming, just work reliably all the time? (On my machines, “platform” exists, but I have not observed differences between “platform” and “shutdown” behavior.)

Resources I’ve found, and why they don’t answer my questions

This is what I’ve found so far, if you know any other relevant resources please let me know!

However, on other distros and other machines, whenever I reboot, I can see that the e820 map changes slightly in the same fashion. So it appears that those other distros are doing something to prevent the issue I’m running into, but because of the lack of boot-related logging on successful resumptions from hibernation, I don’t know what it is.

I’m not sure, I just checked the kernel used by Arch Linux and I can’t find any patches about this. The only difference, if any, must be in the configuration, so I diffed NixOS (linuxPackages_latest.kernel.configfile) with the Arch config but nothing jumps out.