Pre-RFC: Verification of the ISO

ShalokShalom · March 10, 2024, 1:07pm

@ElvishJerricco brought up the idea, that we could dm-verity the squashfs on the ISO, so that we are able to inform the user, when an ISO has been badly written.

Next to the occasional download breakage, did I personally already enjoy rare cases, like a broken USB stick, an interrupted writing process to the stick, and - particulary lovely - a broken USB port.

All these issues can be prevented, by implementing an inbuilt verification process.
I think Fedora already has something like that in place.

I would suggest to make the default boot selection automatically checking on that, and eventually providing a permanent market on the stick, that confirms the validity to speed up further boot ups.

RaitoBezarius · March 10, 2024, 4:43pm

Awesome idea, just to be clear, I think this does not warrant to go through the tortuous RFC process (except if you want to do it!). I would encourage to just send a PR and implement it .

(Maybe someone should measure the boot time impact or build time impact, but I don’t believe it’s a concern.)

Atemu · March 10, 2024, 7:38pm

As a general rule of thumb, RFCs are only necessary for setting community standards or applying widely-scoped changes with significant trade-offs where community involvement and consensus is required.

An uncontroversial improvement over the status quo in one technical detail in one isolated component does not require any such consensus nor involvement.

For the ISO, dm-verity is an appropriate tool.
For other purposes, I would like to plug fs-verity support · Issue #4917 · NixOS/nix · GitHub here

ShalokShalom · March 10, 2024, 8:00pm

Yeah, I wasnt sure were to put this, as I wanted to look for community input first.

Thanks a lot!

Atemu · March 10, 2024, 9:32pm

Indeed, “RFC” stands for “request for comments”.

It’s easy to forget that it has a meaning beside the one we have assigned to that term over the years. We really ought to find a new term for the original meaning ^^’

nikstur · March 10, 2024, 9:52pm

Sounds lik a great idea. You could use the generator I’ve written for the Nix Store: GitHub - nikstur/nix-store-veritysetup-generator: Systemd unit generator for a verity protected Nix Store

Note, however, that dm-verity doesn’t normally verify the whole block device upfront but instead only checks each block on access (you generally want this for performanc reasons!). Failures are pretty violent but the root cause not necessarily more evident to the user. Indeed dm-verity erros just look like read failures on the fs level. In other words: dm-verity ensures integrity and authenticity but not necessarily great error reporting.

For good error reporting you could write a systemd service that checks the Nix Store squashfs image in the initrd before mounting it as a loop with a hash embedded somwhere in the ISO and report if it doesnt match.

ShalokShalom · March 11, 2024, 8:30am

Please, forgive me: But why do I need great error reporting, in case I only care about the ISO not being damaged?

I dont really understand your comment, but this is probably mostly due to me not having worked with dm-verity.

Atemu · March 11, 2024, 9:20am

The way dm-verity works is that, whenever a block is read, it is individually verified to be correct. If the content of that block matches the expected content, the read happens as it usually does beyond that point. Userspace is none the wiser of this happening.

If it’s not however, it’d be as if the underlying device didn’t return data at all and the read returns a generic read error to whoever initiated it.
It’s up to that party (usually a userspace software) to interpret that error and that’s the critical part for UX.

So let’s assume we have dm-verity in place and the device the ISO on is bad. Systemd tries to start some critical application (e.g. the desktop environment) but a read error happens in the executable because of corruption detected by dm-verity and that causes the program to crash.

How does the user get informed about this fact?

In all likelihood, this particular case would materialise in the symptom of a blank screen or the “cursor of death” which would not be helpful at all.

I don’t know what the best way to handle this but here’s my idea:

You’d need some sort of mechanism that listens for read and/or dm-verity errors involving the device we’re booted off of. When one of those happens, it’d stop the boot process and bring the user into an emergency shell where they’re presented with a clear error message. The message would inform the user about the situation they’re in and recommend some basic troubleshooting steps such as using a different port, different USB stick, re-flashing, re-downloading etc. Perhaps it could also just let the user flush caches and try again as that sometimes fixes things. It should also let the user disable this mechanism and ignore the errors (useful if the affected parts are non-critical).