Troubled OS Times

Probably will cause fatalities etc, so not funny.

Definitely no laughing matter.

It’s interesting to think about how the NixOS approach can help protect against such problems.

I don’t think it can prevent it entirely: any system powerful to update itself is probably powerful enough to break itself, so there’s no substitute for disaster recovery measures that could rebuild any machine from backups if needed.

Still, the generations system allow easier recovery from anything that breaks due to a nixos-rebuild. While machines will likely have some state in /var and similar, that should be unlikely to influence the parts of the system that are needed to perform administrative tasks. Would there be any way to verify this is indeed the case? Or would this be another reason to go ‘full impermanence’?

2 Likes

I think the whole MS ecosystem is antediluvian and defective.
Events such as these will surely just enhance the focus on
extremely stable systems like Nixos.

(I am certainly not laughing either, other than the involuntary laugh when you see people using MS for mission critical infrastructure)

Nix/NixOS, or a methodology like it, is definitely a pre-requisite for what you outlined. Without something like it, you’re just treating the consumer’s computers as testing grounds.

The more I read this 2002 paper, the more it seems reads like a prophecy. Why Order Matters: Turing Equivalence in Automated Systems Administration

Due to modern society’s reliance on computers, it is unethical (and just plain bad business practice) for an operating system vendor to release untested operating systems without at least noting them as such. Better system vendors undertake a rigorous and exhaustive series of unit, system, regression, application, stress, and performance testing on each build before release, knowing full well that no amount of testing is ever enough (8.9). They do this in their own labs; it would make little sense to plan to do this testing on customers’ production machines.

3 Likes

Still, the generations system allow easier recovery from anything that breaks due to a nixos-rebuild .

There are conflicting requirements here though. Being able to boot into an earlier generation with a known-outdated EDR package can be seen as a security risk, so it would be impossible on a sufficienty locked down system.

It depends on your desired level of lockdown, of course. There’s plenty of situations where “users that have access to the boot console are trusted to switch to older versions only when needed” (or some variation thereof) is reasonable.

The scenario where you don’t want that is also interesting to consider, though, of course. Perhaps in that case you could automatically throw away old generations only after you’ve successfully rebooted into the updated one? Of course that comes with the downside of reboots, but that could be a reasonable trade-off.

1 Like

To handle the locked down cases, there could be some kind of flag which disables certain things unless the most up-to-date revision is being used. I think that in most cases, “can’t boot” is far worse than “no write access to prod”.

But I don’t think that most people actually need to be locked down to such a degree. Unless you’re ok with being occasionally down without recourse, you have to eventually trust the end user a little bit. I don’t know what the contents of the corwdstrike update were, but they can’t have been that important relative to the pain caused by their side effect.

Yeah, I also think the solution to this kind of thing lies in being able to verify pre-deployment. Sure, state can still cause issues, but if it is state at least it won’t bring down all your customers simultaneously.

Without NixOS there is simply no way to test with sufficient integration to catch these things. I can just imagine how nightmarish trying to test against a proprietary OS with practically random updates is compared to what we have. Even if you get special pre-warning from Microsoft because you’re a big player, you have no control over what your customers ultimately run.

You’d basically need to do fuzzing to assert that. Not impossible, but quite costly.

It makes me want a filesystem that knows which file was created by which package so that we could mount an overlay on /var which is missing the files created by whichever packages are currently under suspicion.