20.09 Release Retrospective

Why?

As mentioned in https://github.com/NixOS/nixpkgs/issues/101789 @worldofpeace and I would like to hold a retrospective on the release to help improve our processes. Currently the release process is done “in a vacuum” with very little input from the community about how they did or didn’t succeed in contributing or using the release.

What

We will hold a 1 hour meeting at a to-be-determined date and time to facilitate questions and have a discussion around what didn’t work well and how to improve. The issues will likely be time-box’d (5mins initial discussion, with 2 minute extensions) to keep pace of discussion going and not end up just talking about one topic the entire time. (exact protocol is subject to change, I just found timebox’ing to be much more effective in real-life).

Topics should be categorized into 3 sections:

  • What went well?
  • What do you feel neutral about?
  • What did you feel negative about?

I will aggregate topics from this thread, (if i can find a decent tool) participants will be able to vote for 3-5 (depending on number of topics) topics that they would most like to discuss, and proceed with time-boxed discussion. If we don’t discuss a particular topic you were interested in, this doesn’t mean it wasn’t important, we just didn’t around to it in this meeting. If we find these meetings to be super-productive, then we may hold additional meetings.

The desired outcome is to have some “action items” which can improve the next NixOS release.

When / Where ?

We will use https://www.when2meet.com/?10270057-1BAil to determine time and date. But sometime next week ( 8 Nov - 15 Nov) ideally. Since both release managers are US time zones, I scoped the hours to something sane for US individuals.

Platform is TBD, most likely jitsi or similar video conference technology.

Who?

Anyone who attempted to contribute, contributed, uses, or has valuable feedback on the 20.09 release.

Policy for this Discourse thread

Please feel free to add anything to the following topics in a post:

  • What went well?
  • What do you feel neutral about?
  • What did you feel negative about?

You can also start a discussion to other peoples posts, however, if a particular topic warrants it’s own thread, then it should probably be moved into it’s own thread. I would like to mainly keep this thread focused on discussion topics for the retrospective meeting.

cc marketing team @garbas @samueldr @davidak (sorry if i forgot some people)

5 Likes

To kick off discussion, I’ll start:

  • What went well?
    • ZHF was pretty successful in stabilizing packages
      • Lots of contributors
    • New website, looks great
  • What do you feel neutral about?
    • Desire to move back the release date to YY.05 / YY.11
      • I hope this helps deliver a better Plasma / Gnome experience, but not completely sure
    • Split release manager responsibilties to a more granular level
      • Maybe designate people for ecosystems, and create a rough schedule for the branch-off, ZHF, release dates, so people can better prepare for release events
      • I want to avoid making this “feel like a job”
  • What did you feel negative about?
    • Release managers had difficulty doing Hydra related tasks
      • Usual hydra point-of-contacts (@grahamc and @edolstra) were unavailable
        • graham had a child, edolstra was on vacation. Life happens, but we should have a repeatable process.
      • RM’s didn’t have sufficient permissions to create release channels, release jobsets
    • Hydra processes are currently undocumented, “get in contact with infrastructure team”
    • People using the release schedule as a motivator to cram several 100 staging commits in right before branch-off
      • Many of these PRs were stagnant for months, instead of having a steady trickle of staging breakages into master, we front loaded the risk right before ZHF
      • This also pushes back ZHF date because we need a successful hydra evaluation before we can create the ZHF issue
        • ZHF was pushed back over a weekend to allow for the eval to mostly complete
    • Shipping an “unpolished” plasma / gnome experience
    • Personally don’t use plasma, but spent many hours fixing plasma related issues.
    • Expecting @ttuegel to fix all the plasma issues
      • Low bus factor
      • Don’t want him to get burned out
      • He did a lot of great work which I appreciate immensely, but the odds were heavily stacked against him
    • Docbook.
      • Release notes, pain
      • Reading docbook diffs, more pain
      • Detracts from additional documentation contributions
9 Likes

Here is stuff from my perspective:

  • What went well?

    I’m always surprise how many people step up when release requires their help.

  • What do you feel neutral about?

    • The work we did on the nixos.org and search.nixos.org required 2 pull requests to bump the nixos version. Ideally you wouldn’t have to do anything, just marking 20.09 as active for status.nixos.org.

    • The website could do much better job at communicating what is happening with the release (eg. “ZHF is in progress, help here”) and help. It is true this portion of the website (news/announcements/blog) is under reconstruction, but next release I hope this can be improved as well.

    • I really like that discussion “how to release on time” started. It might take us some time to figure it out, but I welcome the effort to improve on this front.

  • What did you feel negative about?

    I don’t have any negative things to say.

5 Likes

What went well?

  • Release status was communicated well, and especially when there were problems that pushed back the release, they have been communicated.

What do you feel neutral about?

  • After the branch off I was not able to find the 20.09 manuals. The manual/*/stable/ path pointed to 20.03, the manual/*/unstable/ path pointed to 21.03.

What did you feel negative about?

  • During the ZHF phase, I wanted to try to contribute, though I was overwhelmed by the mass of entries, while also beeing scared away by the interface of Hydra. Filtering and searching the entries was not intuitive, and I also had problems figuring out if things failed due to packages not building properly or because of failed module tests. Not to mention, that after I left and wanted to take a look again a day later I was neither able to find the link again, and finding the correct sub page starting from https://hydra.nixos.org/ was not possible for me.
3 Likes

Regarding your neutral “split release manager responsibilities”, @worldofpeace and I had a little chat about this in #nixos-dev a while back: https://logs.nix.samueldr.com/nixos-dev/2020-09-23#1600891051-1600892536 (interspersed with some other conversation).

The split I propose there is adding a “release engineering” team that handles the unblocking of blockers and reviewing and merging of PRs, while the RM team focuses first on triaging issues (e.g. designating blockers) and second on those same responsibilities. The RM team would essentially become a subset of the releng team whose most important duty is to triage. I feel like this would help the burnout that RMs sometimes (if not always) experience: most people defer to them for merging to the release branch (at least, before release), so they have to both triage blockers and review and merge PRs (amongst other misc. tasks).

NB: I’ve never actually interacted with a “releng” team, so I don’t know if there is a better name for what this new team would do. I chose it because it sounded cool.


Regarding the negative “RMs had difficulty doing Hydra related tasks” – after appointment as an RM, someone on the infra team should give these people the permissions necessary for their job. I think this ties into the following point of “Hydra processes are currently undocumented” in that once those processes are decently documented, if the RMs have the permissions, there shouldn’t need to be any infra intervention.


Regarding the docbook negative point: RFC 72 was merged recently. After someone (or a group of people) does the initial conversion legwork (easier said than done), there should be a large decrease in pain when contributing to documentation.

2 Likes

Another negative:

  • Current release conventions only create a stabilizing staging-YY.MM branch
    • this is inconvenient as there’s no good way to “batch” large changes while another stabilization is in flight
    • see https://github.com/NixOS/nixpkgs/pull/102662 for more context
    • I would like to see the staging + staging-next paradigm to be reflected in the release
    • Short term solution was to create a project https://github.com/NixOS/nixpkgs/projects/35, and just manually merge the PRs so that the hydra jobset will hopefully only evaluate after all the PRs have been merged
2 Likes

I wasn’t heavily involved, just tried to help some with ZHF so that’s all I can comment on.

On the positive side:

I thought it was quite cool to be able to have such a direct impact on helping the release with relative ease with ZHF.

On the more negative side:

I had a problem or two with already having a PR out that fixes a package, getting (valid) feedback on that package, taking a day to address that, then in the meantime a newer PR for the same package coming out from someone else that is less complete and a different reviewer merging that package as-is, basically instantly. So then I had to fix the conflicts from the PR that flew through review. I think my suggestion would be continuing to work towards reviews being more consistent.

Also on the topic of reviews, although this is not at all specific to the release, the PR template suggests running nixpkgs-review wip, which rarely seemed to fail. But then the reviewer would run nixpkgs-review pr xxxxxx which to my surprise does a lot more validation and so that could end up being a surprise if that had failures not in the wip command. I started running the pr command on my own PRs since I learned that. Is there a way to get the pr-like experience before creating the PR? If so maybe switch to that in the PR template?

Fixing the Python packages that were broken was a bit confusing. From what I can tell a big batch of packages just had their versions bumped at some point prior (some scripted thing it seems) without seeing if they work or making any changes to dependencies, then that was merged. Maybe that’s totally “the process” but not knowing that going in threw me off and gave me pause when trying to fix builds, since my assumption was that the builds as-is were working at some point. If this is the desired process (blindly version bump everything, merge it in, then see what breaks) maybe document that or call it out in the ZHF instructions?

1 Like

I think nixpkgs-review wip only tests uncommitted changes - I’ve had luck with nixpkgs-review rev HEAD before I push to my PR branch
Edit: meant to reply to austin

3 Likes

This one I’m not sure we should talk about, really I think we said it was okay that we didn’t do that. We have a staging-next for master because it rebuilds so much, on release staging the influx of changes is greatly smaller so a single staging branch has been sufficient like on master back then.

The problem is that merging PRs which target staging-20.09 brings it to a branch which will start an evaluation of it and affects current stabilization efforts. This is bad because if you merge a few PRs in different evaluation periods, we will use a lot more hydra resources than we should; and if the branch is trying to be stabilized, then it’s annoying to iterate on a branch which may get large rebuild commits.

For non-x86-linux, hydra resources are already spread very thin: https://hydra.nixos.org/queue_summary

Tentative time will be Tuesday, Nov 10 at 1200 PST / 1500 EST / 2000 GMT

I created a google cal invite, at the time previously mentioned. Anyone is free to join. @worldofpeace may not be able to join, but expecting @garbas to be present.

Going to conduct the meeting using Pivotal labs retrospective techniques, best reference I could find is https://tanzu.vmware.com/content/blog/how-to-run-a-really-good-retrospective , however, it’s shy on the details.

I will come prepared with a pre-made list of items from this thread. However, people joining are welcome to add additional items. Then the meeting will go as follows:

  • Introductions
  • Brief description of the items, and why they are relevant
  • everyone has 5 votes, vote once per topic or as many times for a topic up to 5 total votes
  • Begin time-boxed discussion on the topics, starting with the most popular topic, and working down
    • 5 min initial discussion
    • 2 min extensions, depending on majority vote.
  • If an “action item” comes out of discussion, then I will record it.

I’m used to doing these in-person with a team. Not sure how this will translate to a video conference. If it’s just @garbas and I, it will probably pretty informal, but representative to the process described above.

It seems as if your link just opens my calendar without any clues about your scheduled event, if I try to open in an incognito tab, google asks me to log in first.

Sadly I can not really find “the time previously mentioned” (Except for a sidenote it beeing biased to American timezones).

Can you perhaps use a date tag?

[date=2020-11-10 time=00:00:00 timezone="Europe/Paris"] → [date=2020-11-10 time=01:00:00 timezone="Europe/Paris"]

meeting room: https://meet.google.com/dhs-nxbm-ygg
2020-11-10T20:00:00Z2020-11-10T21:30:00Z

g cal invite (same as above):

1 Like

Rescheduling as neither @worldofpeace and @garbas weren’t able to make it.

New date:
2020-11-13T21:00:00Z2020-11-13T22:30:00Z
Room: https://meet.google.com/yzv-ynuw-fjb

A negative:

  • I think ZHF could have been a lot shorter if email notifications from hydra were enabled, and judging by the amount of reactions on this comment, I am not the only one with this opinion (22 :+1:s at the time of writing. by comparison the OP has 30 :tada:s)
5 Likes

Having some kind of notification even outside of ZHF would be beneficial. The ideal time to fix something is when it’s first broken on master.

I would also like some type of notification of breakages.

2 Likes

That’s what I meant! If breakages are fixed continuously as soon as they happen, then we wouldn’t have as many at release time, and we wouldn’t need to backport them

6 Likes

I agree. Though it would be nice to have some filtering. Some of the packages that I maintain or update regularly fail on Hydra with illegal instruction because the CPUs in some of the Hydra machines are too old :frowning: .

1 Like