Kubernetes broken on master, a recurring event

azazel75 · June 16, 2019, 1:42pm

Hi guys, I’m a bit frustrated with the current state of the Kubernetes
on master. I’ve tried it today and the test don’t complete
successfully, so while I thank you all so much all how you for your
effort in bringing kubernetes to NixOS and while I understand the
necessity to update the packages and the modules to improve its
support, I’m wondering why this state of things is such recurring.

So I wanted to ask you:

Am I the only one to have non working tests? I you try them right
now, do they work for you?
Why hydra builds don’t fail on them? (I must admit I have some
problems understanding what’s happening on it)
Can we coordinate the effort using branches so that master branch
is kept functional? I see that someone (like me) submits PRs while
someone other commits directly. Is it insane to have them try the
tests before pushing?
If we need it ( and I think we do), can we agree on some common
goal, either using this forum and creating a kubernetes
subsection in it or using some other kind of tool?

Thanks for reading and again thank you all for your effort.

arianvp · June 16, 2019, 5:57pm

Correct me if I’m wrong, but I think this is because we are only running the nixos-unstable-small tests on hydra?

I don’t see a tests jobset for the unstable channel. only for the unstable-small channel.

https://hydra.nixos.org/project/nixos

And the unstable-small channel only seems to run a very small subset of the nixos tests Hydra - nixos:unstable-small

I think (but am not sure) this is because running all the tests for all commits on master is just too expensive.
This is why we fork of two times a year (03 and 09) and run all tests to create a release.

I would suggest staying on the stable channels for this kind of stuff, honestly. That’s what they are there for. But preferably I’d see less broken things on master too.

azazel75 · June 16, 2019, 6:15pm

I don’t think they run for every commit, not even for every cumulative push to the master.
But when a PR is opened, the bot runs the tests on it… or not?

There’s a new version of Kubernetes every 4 months and while it isn’t always needed to use the latest release, sometime it is. I think that the purpose of having tests is that of avoiding commits that break things, mostly. If tests are too expensive, lets agree on a policy where the committer runs the tests (only the kubernetes tests) before pushing.

Feature branches and PRs help greatly in maintaining the master clean of breakages and to give visibility to one’s work

jtojnar · June 16, 2019, 6:16pm

nixos-unstable is https://hydra.nixos.org/job/nixos/trunk-combined/tested, see https://howoldis.herokuapp.com/ and also this papercut:

jtojnar · June 16, 2019, 6:37pm

What Hydra does is that it builds these tests periodically on master:

github.com

NixOS/nixpkgs/blob/413a59e8ccb2188b7f7e27961d51a661a746b6f8/nixos/release-combined.nix#L41-L144


      
          tested = pkgs.lib.hydraJob (pkgs.releaseTools.aggregate {
            name = "nixos-${nixos.channel.version}";
            meta = {
              description = "Release-critical builds for the NixOS channel";
              maintainers = with pkgs.lib.maintainers; [ eelco fpletz ];
            };
            constituents =
              let
                # Except for the given systems, return the system-specific constituent
                except = systems: x: map (system: x.${system}) (pkgs.lib.subtractLists systems supportedSystems);
                all = x: except [] x;
              in [
                nixos.channel
                (all nixos.dummy)
                (all nixos.manual)
          
                nixos.iso_graphical.x86_64-linux or []
                nixos.iso_minimal.aarch64-linux or []
                nixos.iso_minimal.i686-linux or []
                nixos.iso_minimal.x86_64-linux or []

This file has been truncated. show original

and when all succeed, it advances nixos-unstable channel to the commit the jobset was run on.

Similar thing is done for other channel with different set of tests:

nixos-unstable-small

nixos-xx.yy channels run on release-xx.yy branch instead of master.

@GrahamcOfBorg currently does not run any NixOS tests unless asked in a comment. See also Build passthru.tests automatically along with the package · Issue #368 · NixOS/ofborg · GitHub

Kubernetes are not part of any of the jobsets so channels advance even when that test is broken. We do not add non-critical software to the jobsets in order not to block progress, especially when the software does not have a proper maintenance in nixpkgs (as evident from the commonly broken tests).

People are expected to run the tests before pushing a commit or opening a pull request but sometimes unrelated change breaks it and the breakage is not discovered for a while, especially for less used software.

TLDR: various parts of nixpkgs have different levels of support and we do not want to block everything for less supported software most people do not care about.

azazel75 · June 16, 2019, 8:37pm

is this an agreed upon policy? It seems vaporware… can’t we do better, with little effort (I’m talking to all those that push changes to kubernetes stuff)? The upstream project (kubernetes) alone is big enough and it takes some effort to get to know it to some level, if summed to that, we have to worry to have a frequently broken configuration codebase, it becomes very frustrating, but it’s sad, in way…

azazel75 · June 16, 2019, 9:00pm

thanks @jtojnar to looking this up and making it clear

jtojnar · June 16, 2019, 9:15pm

See Nixpkgs Reference Manual. It is also mentioned in the PR template.

Looking at the last few kubernetes changes, the tests were ran:

kubernetes: 1.13.3 -> 1.13.4 by johanot · Pull Request #56524 · NixOS/nixpkgs · GitHub
kubernetes: 1.14.1 -> 1.14.2 by r-ryantm · Pull Request #62456 · NixOS/nixpkgs · GitHub
kubernetes: 1.14.2 -> 1.14.3 (CVE-2019-11245) by johanot · Pull Request #62816 · NixOS/nixpkgs · GitHub

Maybe it is big enough upstream but not many nixpkgs users seem to care about it so it receives attention according to that. @johanot seems to be the only one who cares about it and they seem to be doing a good job.

One possible improvement would be adding passthru.tests (example) so that @r-ryantm bot which often updates the package can run the tests automatically and not update when the tests fail.

Another possible improvement could be fixing the e2e test to increase the certainty update did not break anything:

github.com

NixOS/nixpkgs/blob/3b1c943d573818a125a06a71e1eb2a82565fad26/nixos/tests/kubernetes/default.nix#L4


      
          { system ? builtins.currentSystem }:
          {
            dns = import ./dns.nix { inherit system; };
            # e2e = import ./e2e.nix { inherit system; };  # TODO: make it pass
            # the following test(s) can be removed when e2e is working:
            rbac = import ./rbac.nix { inherit system; };
          }

Yeah, channels need to find compromise between frequent advances with some software occasionally broken, and everything working with long stabilization periods and very infrequent releases. I feel like we have found the optimum for the popular software as maintainers usually fix the bugs very fast but for less popular software, which has fewer maintainers, the mean time to repair will unfortunately be longer.

azazel75 · June 16, 2019, 9:30pm

As I have written in my first post, have you tried running them yourself? It’s possible that it’s an issue with my system…

jtojnar · June 16, 2019, 9:59pm

I did not try to run it since Kubernetes is huge and I have some networking issue at the moment.

DuncanHills · July 16, 2019, 2:58am

Hi, I’m new here. I’m wondering what the bottleneck is to running more tests, more often. How long would it take to build and test every package in nixpkgs? If that is “too long,” how long would it take to build all the packages that cross some popularity threshold? Certainly a set of the most popular packages could be built and tested upon every commit, or at least daily.

After some basic research it seems that these builds are generally run on a collection of around 75 machines at TU Delft. I’m unfamiliar with the commit frequency of nixpkgs but nothing about this strikes me as untenable, especially with cached build artifacts and/or an incremental build/test strategy.

Perhaps some parts of what I’m imagining are underdeveloped, if so, what are they and perhaps I may help?