Fixing the staging/staging-next workflow

Creating this topic to move the discussion around staging and staging-next workflow into it’s own thread, to avoid cluttering Marketing Team: Can we present Nix/NixOS better? with off-topic noise.

Problem statement

Our current staging and staging-next workflow is ill-equipped in detecting fundamental problems with our package set.

Example

https://github.com/NixOS/nixpkgs/issues/96197 , anything to do with building a stage1 boot environment was broken.

Discussion topic

What are some realistic ways in which we can prevent such breakages from occurring on a “mainline” branch (e.g. staging, staging-next, master).

Additional context

Testing changes on staging or on a PR targeting staging is usually very painful due to the need to rebuild large amounts of packages in which to vet changes.
Although staging-next has a related hydra jobset dedicated to it, by the time a change is in staging-next, it is usually coupled with 50-500+ other changes which makes it difficult to determine causality of regressions.

Impact

In the case of https://github.com/NixOS/nixpkgs/issues/96197, this removed all nixosTests from providing any useful validation. Also, the timeline for the fix caused the branch-off date of the 20.09 to be pushed several days.

6 Likes

We need a smaller jobset for staging https://github.com/NixOS/nixpkgs/pull/43618 and we should have a tested page for it (as we have for channels), showing the most important ones that need to pass.

2 Likes

There has been some discussion going on adjacent to #sig:sig-workflow-automation on how to implement a merge train with automatic bisection based on bors + adjacent tooling.

There is a draft RFC that prepares the groundwork for marking broken packages as broken = true and pro-actively coordinate a subsanation period to downstream maintainers.