I have yet to understand when the nixos branch updates… and why it doesn’t.
On nixos-18.09 we’re still at Linux 4.14.76, there has not been a new commit since October 17. Linux 4.14.78 was released October 20, 8 days ago. Is it supposed to be like this or is the machinery broken? I don’t see anyone talking about it, is this normal?
How can I tell myself what’s happening and why it doesn’t happen going forward? It’s far from the first time that it appears to me as though nixos is stuck, so I’d like to know what I need to know to be able to tell. When I look at Hydra I don’t understand it well enough to be able to tell why the current old commit is the current commit and not another, later commit.
Sometimes it’s a hardware failure, other times it is some commit that has broken the tests.
If you need a recent change on your system, you can use the nixos-18.09-small channel. This will update much faster but as a consequence, you are more likely to have to build stuff yourself. If you really want to use the latest and greater, you can set NIX_PATH to the last commit:
Indeed I’d like not to compile Linux as it takes many hours on my machine. But in addition, isn’t the small-channel also less tested, and that’s why it’s updated while the tested channel is dead? I don’t think using a nixos-channel with fewer or no tests is a solution.
That howoldis-page is pretty neat, but even it seems to be issues as well. It says nixos-18.09 was updated 1 day ago, but that’s not correct, and following the links from that page shows that.
How long should the nixos-channel be allowed to be broken, blocking all software updates, including security patches? In my opinion whatever commit breaks nixos ought to be rolled back within hours at most so that updates may continue.
It’s actually correct, in a way. The channel was apparently updated around a day ago to an old version, probably going backwards at that moment.
Each channel is supposed to be at the last commit that has passed tests and has finished all builds. Commits breaking significant stuff are supposed to be rolled back fast, that’s true. (That’s to be done by humans.)
I want to preface this saying that I very much appreciate the work done in every form, including maintenance, toolmaking, answering questions and all that, because I can tell I may seem negative and whiny, but it’s only because I care… well that and frustration rooted in a lack of deep understanding of the NixOS infrastructure… something that a user shouldn’t have to feel any need to possess, but I digress. If I didn’t care I would just be using Debian instead of typing this.
Then that’s the worst kind of correct. I don’t feel it answers the question “How up to date are NixOS channels?” well and going backwards is the opposite of being “updated”, at best it was only “changed”.
Delivery of updates being blocked for days must be considered very major breakage. Why isn’t whatever broke NixOS rolled back already so we can keep moving forward? Lack of human time/numbers/motivation, tools too difficult to use, not enough automation, something else? Is it up to users to find an expert with free time to poke?
There are problems here that prevents NixOS from being delivered and it’s no isolated incident, so what could the causes be and what to change to make it better?
howoldis is a simple unofficial tool. Many nix* devs find it useful as it is, and improvements are welcome. The problems with going backwards should now be hopefully resolved.
A few days of delay isn’t considered a major problem by me; I suspect pushing beyond such frequency might lack motivation from most of the involved people. Most changes need a few days just to get reviews, etc. Mass rebuilds may take even a couple weeks to go through rebuilds and stabilization.
I can’t see any really “low-hanging fruit” to get it faster. Yes, more manpower will help a lot, but that seems difficult to get. The farm infrastructure isn’t in perfect state – apparently some machines exhaust some resources from time to time, creating occasional job failures; perhaps the setting might be tuned better, darwin infra is being overhauled these weeks, etc.
I can’t see any really “low-hanging fruit” to get it faster. Yes, more manpower will help a lot, but that seems difficult to get.
And even that is not a low hanging fruit — we have some botlenecks with documentation, review throughput, onboarding processes, tooling (and tooling-for-tooling)… And we lack a clear consensus on the long-term roadmap for some of these problems. So even finding a hundred people eager to contribute immediately and a lot would not be enough to solve any problems quickly.
I guess nobody expects people to stay on-call to immediately fix failing builds and tests by 4am in the night.
In this particular case though it looks like the hydra build issues remained unnoticed for over a week. On sunday I reported in IRC that no tested release-18.09 channel has been released for 2 weeks by then, and it seemed like I was the first to report this.
In the meantime, several backports have relied on backports to the nixpkgs release-18.09 branch landing in NixOS quickly.
So as long as security updates are provided via the normal stable channel, at least build-blocking failures need to be detected within a day or so. Not having security fixes for 2 weeks is a serious issue.
This is especially important for enterprises that require timely security updates. It’s going to be hard to be perceived seriously without that. We need to fix our workflow and tooling to make this happen.
When we’re talking about this… my understanding is that for security-critical servers it’s meant that you follow a -small channel, most likely a stable version of it. Those can update very fast even for huge rebuilds, and I don’t remember having any long lags in there. (We had some backwards-updates there as well, but that should be solved now.)
It would probably be good to actually announce that the nixos-18.09 channel may be delaying critical security updates and to suggest to always use nixos-18.09-small if you care about security.
So what is the point of the nixos-18.09 channel then if it cannot be used for regular systems due to intermittent delays in security updates?
Update: It’s not just for security critical servers. I’ve just made the switch on my two desktop systems, due to the critical firefox 63 being delayed.
Well, it’s all best-effort so far. My participation (for example) is all in my free time.
If you’re touchy about security, IMO in the past several months the main problem isn’t really the speed of channel updates but the fact that many CVEs just don’t get fixed in our branches for a very long time (presumably they are usually less important, but I don’t really know ATM).
I didn’t intend to imply that I or someone else could demand other people’s work. If my post sounded like that I’m sorry, it was not intended. I’m just trying to understand what causes these notable delays in some security updates.
I wasn’t aware about that. I assumed that most security updates moved to the stable version rather quick, as it’s usually just a cherry-pick.
So currently I see two issues with nixos and security:
The security roundups are lacking manpower, which may cause delays in backporting or cherry-picking the updates to the release-branches.
This can probably be solved with some more people that have a look at the security roundups and support (check CVEs, provide patches, cherry-picking, backporting, …) them.
Channels may not update for some time in some cases due to random build failures.
This is rather opague to me:
Let’s assume I’m waiting for a security update to make its way through a channel. It has been merged in release-18.09. Usually after a couple of hours hydra should pick up the change and start building.
Now there are multiple things that could go wrong, some (like build failures with a compiler error) will be easy to figure out, others will just be “Exit code 1, but I’m not giving you any useful error message”, or even “Timed out after 10 hours”.
It’s easy to reproduce the builds locally assuming I’ve got a similar system type, thanks to the generated build script.
Now let’s assume the build works perfectly fine on my local system, suggesting an issue with one of the build systems. What can I do then? Should I contact someone?
Shouting in IRC with its high volume rarely yields a usable response to this and just repeating “could someone look at this failed build on hydra, please?” every hour is probably not very useful.
Should I open a topic on discourse? Or would it make sense to have a single topic that the relevant people can watch?
How can I support the people that actually need to do something here?
As far as I understand this, both the nixos-unstable and nixos-18.09 channels are currently blocked by a problem with systemd. @andir is currently working on this, but perhaps he could need some help. See discussion on IRC in #nixos-security and https://github.com/NixOS/systemd/pull/24