Can we please stop breaking stuff willy-nilly?

bendlas · July 9, 2024, 2:31pm

Lots to catch up on, after the weekend …

Can you elaborate on your usecase for using unstable, while still wanting the stability? Then perhaps we can come up with a reasonable solution here.

As I said above: I’m helping flush out issues early, before they hit next stable. The reasonable solution is to recognize, that our stable users will also be hit by the same breakage that we are, only delayed, and that there is little qualitative difference between the channels, except for maturity.

Just going “why don’t you just use stable then?” completely misses that point.

@raboof

Please do not see this as “taking the L”

Why not? I started a fight. I thought it was about stability of our user experience. Turns out it was also about the tone we find acceptable. I don’t have a problem with changing my approach and “losing” some of the sass, if it means that the community wins in civility. I only ask that we still take this issue seriously, even if I roll off the gas.

the particular case that triggered this thread seemed fairly reasonably considered, though?

Hard disagree. As detailed above, it was the most egregious case, I’ve seen in a long time:

It removed working options for security reasons, even though they could be configured safely
It introduced a submodule in its place, in a way that prevented both proper error message and any type of backwards compatibility, even as an afterthought [1]
It seems to have done this in a paradoxical way, where
- we’re done after migrating internal users, because external users should expect to be broken on unstable
- we’re so concerned about our external user’s security, even after having migrated all internal users, that we can’t let them have their toys any more, not even with a warning. are we realizing that this way, we’re disincentivizing users from upgrade, which is even worse for security?

Perhaps ‘declarative’ was the wrong word to use, but did you really not understand what I was writing here?

I can see your hands, waving, but I actually don’t know precisely what your point is here. Is it “because we can detect a certain class of breakage (that which produce eval errors) earlier, it’s more acceptable to break our users”?

postponing your update

That goes exactly to the heart of the issue: Why is the NixOS security team deciding for me, that I can’t get my latest CVE fixes, before having reviewed all uses of fcgiwrap in my systems? In the name of fixing a local privilege escalation?

FFS! Local root exploits are a dime a dozen. Those should never hold up more important security fixes!

The maintenance burden […] all those should keep working […] do they share code

I agree that those aren’t easy problems. I’ll stand by my point that if we just stopped breaking things, then the maintenance burden will be a lot less than what should be expected, judging based on other systems.

I take some issue with your implication that anybody who disagrees with you must just not be very experienced.

Oh plenty more experienced people will disagree with me on any number of things. Just not about stability, because people who don’t value that, tend to burn out and change carreer. [2]

Also, I am taking issue with your implication that I’m in any way looking down at or thinking less of somebody inexperienced. If anything, I want to help them avoid some of the mistakes I’ve seen and made myself.

@polygon

If that is your end goal, I’d immediately stop maintaining.

With respect, but issuing all-or-nothing threats like this makes me question if your current level of involvement is at a healthy place. If you need someone to talk to, my DMs are open.

I have scaled back my own involvement with NixOS at various points, where I noticed that I cared “too much” for it to be healthy, and now that I’m feeling better about it, I’m back. Hello. Most of my old stuff still works and that’s how I like it.

@srd424

would it also be possible / worth introducing a separate of ‘backcompat’ modules translating old options to new? This way they’d be out of the main code base, and could be maintained by those with more interest in long-term backward compatibility?

This is a sensible approach IMO, also because it would allow overburdened maintainers to have a place to move what they’d consider “legacy” to, while still keeping half an eye on avoiding unnecessary breakage.

@b-m-f

If security concerns can arise, we would need some mechanism to warn of this when the configuration is being evaluated, no? Any ideas how to solve this in code @bendlas?

We could handle that the same way as we handle e.g. outdated electron or node versions: By blacklisting, while allowing users to whitelist the known-insecure thing back.

There is also a NixOS option for warnings.

[1] btw, I think implementing services.<service>.<instance> instead of services.<service>.instances.<instance> has repeatedly caused issues. I think that could be something to put in an RFC, to always leave room for top-level options in services.

[2] yes, of course, there are nuances, especially in a project as big as NixOS, which is why I’m careful to qualify with stuff like “unless absolutely necessary” and “where is the line”. Also, I believe @raboof when they say they have extensive experience providing compatibility, not all of it good, but given how fast people seem to be with defending unnecessary breakage, I don’t think we’re erring on that side right now.

waffle8946 · July 9, 2024, 3:57pm

I didn’t say that and would prefer that you don’t put words in my mouth.
Of course I missed the point, which is why I asked for clarification on your usecase in the first place.

In any case, if we’re discussing being the “vanguard” then I would say that the goal should be to catch broken-by-design modules. The benefit to running stable in this case would be, you don’t get hit with such modules because they were fixed prior to the release. I would not advocate eliminating breakage that improves modules - as long as the breakage is adequately documented in the release notes and via adequate warnings/errors.

PS:

If you select some text in another’s comment and click “Quote” it will provide a blockquote that can link back to the relevant comment. Makes it easier to follow the conversation.

polygon · July 9, 2024, 4:45pm

Guess a good ad-hominem always trumps actually answering the argument?

mmarx · July 9, 2024, 4:57pm

Also, if “just stopp[ed] breaking things” is not all-or-nothing, I don’t know what is.

bendlas · July 9, 2024, 7:18pm

That’s super good to know, thank you very much!

I did answer a common sentiment ITT in the context of answering your question, but I didn’t mean to make it sound like you said it …

But would you agree that there are levels of improvement, that are below a threshold, where it’s worth breaking consumers, and should therefore only be done in a compatible way?

How would you approach defining that threshold, to keep the slope from becoming too slippery i.e. without resorting to the 'ole “I’ll know it when I see it”?

How do you like my definition of “Only break if there is just no way around it, without compromising other functionality”?

I guess I did see your bringing up your involvement as maintainer, as an invitation to discuss your involvement as a maintainer. But I can also see now how my wording may have been too close for comfort, so my apologies for that.

I would ask you to also refrain from wordings like “You worry so much about …”, because that makes it easier for me as well, to not “go there”.

I think what I’m trying to say is: If you must hypothetically quit because of a hypothetical community commitment to stability, then you’d hypothetically have to do what you’d had to do, it’s all volunteer work after all. I’d hypothetically much rather keep you involved, though, or hypothetically even increase your involvement, so …

… I’d like to find out, what you’d need for that, when faced with an increased (or even absolute) requirement to keep your packages and modules stable, as a maintainer.

I didn’t answer the argument, you’re referring to, because I’m still thinking it’s asked and answered. But let me reiterate the points where you put question marks:

Depends on what the fix looks like. If it means replacing module or package names with incompatible successors, then yes. Make a new thing.

That the new thing has all the fixes and works better and users will love it, especially when you warn them about the possibility/necessity to upgrade, while they can rely on the old thing remaining there.

With nixos, this is possible. That’s the huge innovation. That’s the reason, we’re patching RPATHs. Worst case will always remain that it needs to fully instanciate an ancient version in a container, from git history, with the closure bloat, that involves, which could just be another warning. That could be the final resting place, where old modules and packages go to die. An overlay that knows about the final commit, where something still works. But that’s just one possibility to handle this.

Whatever we do, I think we should fully commit to a deprecation/warning cycle of at least one release, hopefully two for important functionality.

Yes, that means that introducing attrsOf submodule in place of a what used to be named options will not be acceptable.

No, a maintainer would maintain old stuff indefinitely, because it’s policy.
Also, as long as nobody breaks the old stuff, it’s basically free.

Yes, we would need support structures, for maintainers to say: “I don’t want to deal with another gcc update” and just kick something out of their maintenance responsibility.

But maybe instead of deleting the thing, there is a standard method to pin its inputs to various degrees (up to and including a historic nixpkgs revision) and anybody needing the old thing would be expected to keep up with the required emulation degree / closure sizes.

BTW, please also note how we can even only start having this conversation without having to bring up VMs, because we can 99.9% rely on the linux kernel not breaking us.

Can I ask you the same question, I asked @waffle8946: How do you like my definition of “Only break if there is just no way around it, without compromising other functionality”?

I mean, point taken. I’m offering an extreme and contrarian viewpoint here. But we have a zero-tolerance policy on various forms of communication within the community, so why not explore a zero-tolerance policy for breakage as well?

How do you feel about my concession to “break when absolutely necessary”?

bendlas · July 9, 2024, 8:23pm

Interesting. Was that with the help of NixOS? Can you share some of your experience / estimation of how much leverage you get in the advantage / cost relation, by basing such an environment on Nix, with its referential transparency and other related properties?

waffle8946 · July 10, 2024, 12:41am

I guess I just find it difficult to agree with the premise that there’s unnecessary breakage that needs to be prevented. Sure, I’ve had my gripes with breakage, like the pipewire configPackages fiasco, but again that was only a major issue because people didn’t know how to migrate to the new option - i.e. it was a (within-module) documentation issue - hence my stipulation that “breakage is adequately documented”. I don’t see a need to avoid the breakage entirely; if we made a mistake with the interface originally or didn’t address some usecase, let’s just fix it.

If the community sees it more like you do, then that should be reflected in a policy developed via RFC, and my individual opinion wouldn’t matter in that case.

polygon · July 10, 2024, 9:29am

You can discuss my involvement without speculating about the state of my mental health and insinuating that my answers are what they are because said state is rather poor. I’d call this borderline abusive behavior. That being said, apology accepted.

You want maintainers to do more work so users have to do less work. You don’t accept any policy where old stuff is eventually forced out so everyone has to switch - at least I assume that because your argument “And what happens in 6 months when unstable becomes the new stable, and old stable gets abandoned?” can be made for any deprecation policy. These sound like the unenjoyable parts of a job. So the answer is probably money.

This seems a bit naive to me. If one is the type of user that wants to set up a thing until it works and then, ideally, never touch again, with a policy like that in place, these users will probably never switch. Because why, it works, and others do the unpaid, unsatisfying work of keeping it in this state (or maybe the ones left after this policy is introduced actually enjoy it, what do I know). Whereas if you are the type of user usually excited about new things, this whole “we break things occasionally because we only now realized that our design was kind of a dead-end” is a non-issue to start with.

Go on, we are getting to the important topics now, what do you have in mind here? Can I just assign maintenance of old module versions to @bendlas? Or more realistically a “keep deprecated stuff running” task force?

And according to various sources, the majority of regular Kernel developers are paid for their contributions. I wonder why. That being said, changes to the userspace-facing Kernel interfaces carry a significantly larger downstream update burden than spending half an hour every six months to change a few lines in a NixOS configuration file.

raboof · July 10, 2024, 10:06am

That was without NixOS. I don’t think leveraging Nix would be much help for avoiding the desire for binary compatibility there: this was in the Scala ecosystem, where it’s common to have ‘deep’ trees of transitive dependencies managed in independent repositories.

This is a contrast with Nix modules, where there’s typically just one or two layers of modules depending on each other across repositories, reducing the urgency of compatibility for module options.

nh2 · July 10, 2024, 3:17pm

I posted what I believe the best approach for such cases is:

github.com/NixOS/nixpkgs

Comment by nh2 to nixos/fcgiwrap: refactor to fix permissions

NixOS:master ← pacien:nixos-fcgiwrap-isolation

My opinions on how this should be done, as an industrial user (meaning running s…ervers that people depend on): * The writeup https://github.com/NixOS/nixpkgs/pull/318599#issuecomment-2211515713 by @pacien is a great and exactly what's needed to make good decisions. * The `security` tag is correct here. This was a big Local Privilege Escalation bug. * This should _not_ be backported to stable if it requires manual user interaction: * Lots of people run stable with [`system.autoUpgrade`](https://search.nixos.org/options?channel=24.05&show=system.autoUpgrade.enable&from=0&size=50&sort=relevance&type=packages&query=autoUpgrade) to get 0-manual-involvement security updates (great value!). They rely on servers staying online, and getting automatic fixes for critical security bugs, such as the recent [OpenSSH Remote Code Execution vulnerability](https://discourse.nixos.org/t/security-advisory-openssh-cve-2024-6387-regresshion-update-your-servers-asap/48220/21). * Thus **stable changes should never break `autoUpgrade`**. Introducing e.g. a `throw` error during evaluation (e.g. telling the user that they should change their NixOS options) would stop automatic updates, preventing systems from automatically getting much more important fixes, like the OpenSSH fix. * Making a change stable that just turns the services in question _off_ would be wrong. Lots of servers are set up such that Local Privilege Escalation does not matter / is a very small threat. The services staying online is important. Stable should backport straightforward fixes; when re-archtitecturing is needed to fix wrong security architecture (as happens here), that should be in `unstable` and be part of the next release. * What this PR does (multiple independent fcgiwrap instances) is the correct technical solution. * This PR is acceptable for `nixos-unstable`. But I think it would have been even better to: * Do the fix as `services.fcgiwraps`, keep `services.fcgiwrap` unchanged. (As suggested by @bendlas.) * This allows trivial backporting of this change to stable without breakage. * Also reduces damage and allows users to switch back if we find that the fix is not quite correct. * Lets stable users choose whether fixing a Local Privilege Escalation is something they need to fix now or with the next release. * Immediately wrap an eval warning around `services.fcgiwrap`, land that in unstable and stable. This will make all users that interactively upgrade NixOS notice and upgrade fast, while not breaking `autoUpgrade`. * Make a security announcement (as proposed above, [here](https://discourse.nixos.org/c/announcements/security/56)). * On unstable, add a warning that `services.fcgiwrap` is insecure and unmaintained, and will be removed soon. Remove the old module shortly before the release of NixOS 24.11, or remove it for NixOS 25.05 (removing it later would allow users to upgrade NixOS independently from switching to the new module architecture as a minor added convenience). Doing that seems to have no drawbacks.

bendlas · July 12, 2024, 12:04am

I’d be down to be part of something like that.

Besides this answer, I’ve put the rest of what I typed for you into a gist, because I noticed that it took up the majority of my post again, with comparatively little new information: gist:c3ae238ef7c23c9a31aa04ebc36ba23e · GitHub

Feel free to comment or also quote me on it, if you feel that there’s anything worth responding to in this thread.

Ha! I think Scala actually illustrates my point perfectly: Look at the graveyard of Lift 1.x applications, that were dependency-locked to a degree where they just had to be rewritten. Seems like they have learned their lesson: The Scala 3 compatibility story - VirtusLab

Can I ask how you go from the (probably traumatic?) experience of maintaining compatibility layers in such an environment, to “if anything we’re being too conservative with breaking unstable”?

Clojure OTOH committed to not-breaking with 1.0, and as a result I can still run code from 15 years ago in the most recent version [3]. Not meaning to brag here, I’m fully aware how this is much easier to pull off with a dynamically typed language, then again, look at how python3 is going. I’m sure we’ll be ready to drop python2 any minute now …

see also angular vs react. I swear, once you see it, you can’t unsee it

[3] @polygon and yet, people still update their dependencies regularly, which is usually effortless, because library authors tend to respect the “dont-break-make-a-new-thing” rule.

/rant

Well … if I could have just a single new rule, it would be: “Avoid re-using a namespace with regular options for an attrsOf submodule.”

attrsOf submodule is one of the more brittle parts of the module system and it should almost always have a dedicated key.

Thank you very much for taking the effort to collect that information. I think it reflects current consensus very well.

And thanks to everyone, who’s helping to mitigate the damage from my hothead outburst.

snow40479 · July 13, 2024, 12:27pm

Same. I only use unstable because of a rolling distro. I DO NOT want breakage unless absolutely necessary and unavoidable.
Unfortunately, people in this community don’t share the same mindset.

I’ve heard this kind of saying many many times in the community. Ultimately because people have different definition and expectation for unstable. Without any clear policy and commitment for backward compatibility clearly defined, I can only expect this to happen again and again. One group of people complaining “if anything we’re being too conservative with breaking unstable” and another complaining about “breaking stuff willy-nilly”.

Below is another recent perfect example of how I call it as “breaking stuff willy-nilly” while certain group of people think “we’re being too conservative”.

github.com/NixOS/nixpkgs

24.11 rename codeName to Vicuna

NixOS:master ← sg-qwt:master

opened 11:36AM - 09 Jun 24 UTC

sg-qwt

+1 -1

1. We need to keep code name simple and universal, not every culture know how to… type ñ on their keyboard 2. Avoid unnecessary and potential trouble caused by unicode https://github.com/NixOS/nixpkgs/issues/315574 3. According to wiki, vicuña is same as vicuna > The vicuña or vicuna is one of the two wild South American camelids, which live in the high alpine areas of the Andes, the other being the guanaco, which lives at lower elevations. [Wikipedia](https://en.wikipedia.org/wiki/Vicu%C3%B1a)

I cannot appreciate more how Clojure stand strong by its commitment of non-breaking.

@bendlas Do you think a RFC can be drafted in that regards? Unless policy is defined and consensus be clearly made among community, I can only imaging same thing will happen again and again.

dschrempf · July 13, 2024, 9:49pm

@jjpe @snow40479 I am also totally with you. I actually asked about nixos-stable a while ago: Why does NixOS not have a rolling release system?. The discussion may be of interest to you!

L-as · July 13, 2024, 10:00pm

The whole design of NixOS needs to be redone honestly. This isn’t trivial because of how intertwined everything is.

jjpe · July 13, 2024, 10:16pm

Not just the technical design, where it’s easy to see warts all over the place. The organization behind NixOS also needs a from-scratch rebuild, with more human-centric values, if the way some of my my recent post on this very discourse have been handled are anything to go by.
Not that that should be surprising: the organization (insofar there actually is a coordinated organization at all) makes the software. That is to say, to tackle the technical problems, the root of the problems must be tackled.

That’s not to say I think the entirety of the org is rotten though, I view it as more like a couple of poisonous apples in an otherwise fairly beautiful garden. But those apples are not just poisonous, they’re emitting noxious fumes and need to be removed lest the entire garden go the same direction. Especially people actually creating and merging PRs are most probably fine; they just need a better base to work from and with.
Of course I’m just one voice.

@dschrempf interesting read!

nim65s · July 15, 2024, 9:14am

Yes, that’s exactly why we are currently doing exactly that. If you want to talk about this, you can follow instructions in Zulip for governance discussions.

bendlas · July 18, 2024, 11:27pm

That should be possible, I think. Though probably not easy, because:

nixpkgs gets incoming breakage all the time, so the RFC would have be in a large part about how to manage incoming breakage in a backwards-compatible way
nixos is a massive project to begin with, so finding and addressing all needs and wants towards such a management system, will probably require a few passes
there is also breakage, that you want to incur and fix. e.g. downgrading a database version
- for this, I’ve actually written down an attempt at an RFC, when postgres 15 (in this case even an upgrade) broke our existing databases: nixos/state: initialize requirements document by bendlas · Pull Request #267365 · NixOS/nixpkgs · GitHub

Maybe the reasonable thing would be to start with the back-stop, a la “make historical packages and modules available from git history …” and then propose a procedure to maintain compatibility levels (vm, container, original, buildinput-rebased) in a separate thing?

Interesting, thanks for the pointer! The idea for instantiating git history comes up there as well.

I honestly get the sentiment, and in a way I agree: If somebody has a good idea for it, let’s do it, let’s reorganize everything. BUT: The existing organisation under nixpkgs/nixos stays in place, and as we port modules into the new structure, we’re leaving “forwarder” modules behind.

I will also say: Given the scope that nixos manages to pull off, I think it’s design is pretty clean.

jjpe · July 20, 2024, 8:48am

It could be a 1-2 combo:

Create a new project, with a from-scratch technical design that keeps the good stuff, but learns from the mistakes made in the past. This could make on-boarding new users easier, and ideally would also reduce the maintenance burden where possible, which would help those seeking to give back to the community.
That same project could then also adopt a more equitable governance structure (think RFC 175 and initiatives like it), complete with checks and balances. That would be a perfect moment because it’s increasingly clear such a change is unwelcome by the powers that be in the nix community as it exists today, to the detriment of the community as a whole.

Meanwhile, nixpkgs/nixos are left as-is, meaning such an initiative wouldn’t break anyone’s existing setups.

phaer · July 20, 2024, 10:37am

Without getting too much into the details of the governance & community culture debates here, as I did so elsewhere:

A good option to keep breaking changes to services you care about to a minimum is to contribute a NixOS vm test for it. Those are run before nixos-unstable is updated (after nixpkgs-unstable, which only runs package tests).

If changes then break the tests, the need to be handled before it lands on your machines