Nixpkgs pull request: ask for instructions to test

Thanks to the great Nix’s system, many people can contribute to nixpkgs… the problem is that Nix is victim of its popularity as many Pull Request takes ages to be reviewed and merged, up to the point of the creation of this thread.

I have the feeling that this process could be speed-up with a simple addition: we should semi-automatically give instructions on how to test a PR. This way:

  • testing most PR is just a simple copy/paste
  • it is easier to see which PR are easy to review and which need more effort

Some people/scripts already do it greatly. For instance nixpkgs-update does provide nice instructions as seen here. But most people don’t (I just picked random PR) : even simpler PR that just introduce a new package like here) do not provide any instructions while they are trivial to test with nixpkgs-review… and we can’t really blame anyone as we don’t provide any template to make it easy. For some more complex PR like here, it’s not clear how simple testing the PR is… adding instructions would clarify this point.

Of course, it’s hard to automatically generate instructions for any PR, but at least adding a section in the template with ‘how to test this PR’, pre-filled with some generic code would help a lot, forcing the author of the PR to think about it (maybe github provides a way to automatically insert the url of the current PR to make it even easier to integrate with nixpkgs-review? Or we can use a bot otherwise that edits the original message.).

What do you think?

3 Likes

Yes, fully agreed.

Would you mind adding a PR to adapt the PR template? A concrete proposal is always easier to evaluate than questions or opinions, and much more likely to produce a tangible outcome.

A minimally invasive change would be to provide a section at the end:

# How to test this change?

<!-- provide instructions to reviewers -->

The PR thread is an abomination, and in my opinion quite pointless: a PR should be opened (or put out of draft mode) if and only if it is ready for review. Picking PRs for review should not be based on recency, but some measure of importance (as proposed here: Improving Nix developer experience). Following up on changes should follow a well-defined process that can already be managed entirely through GitHub via notifications.

The PR templates are also way too noisy and would benefit from some refactoring. But that’s just my two cents of ranting about problems I will probably never get around to address myself.

Sure, I can give it a try this evening. Just, does anyone knows if we have a procedure to automatically replace, say, {{CURRENT_PR_URL}} with the URL of the current PR? This url is needed for nixpkgs-review but does not exist yet when you create the PR. I can maybe use github actions for that, I need to check it it has enough rights to edit a PR.

For a while it was quite trendy to execute the PR testing instructions that the PR template gives (nixpkgs-review in mass on open PRs), some people even had a bot running that did this automatically on PRs. Quite frankly this was a disaster and lead to a lot of pointless noise.

This is because testing if things still build is a bit error prone (for large changes you need to check for false negatives as well usually) and it is not the only condition for checking if something is mergeable. I would argue that the code review and judgement of the potential impact must come first and then build testing should be done to verify one’s assumptions, the problem being of course, that not everyone is necessarily familiar enough with the matter to determine this.

We have automatic testing infrastructure, i.e. ofborg, which I’d argue is much more important. By filling passthru.tests with appropriate cases and reverse dependencies it can also give a decent indication of impact. The big advantage of this is trust and, crucially, platform coverage. Whether it works on x86_64-linux is not a hard question to answer for most people and thus semi-automatic testing of changes mostly contributes to this answer of the question.

2 Likes

Okay, agreed. Skimming a couple of recent pull requests, none of them have tests. I suppose this is because writing tests for Nixpkgs and NixOS feels kind of hard and inscrutable. We could address this by adding better documentation and providing guidance e.g. with tutorials. Just thinking out loud, I’m not actually going to do this any time soon. But something to keep in mind.

Also, NixOS tests have a ton of overhead around booting the heavy VM. Not the best debugging experience, and that’s while tests are even more finicky than builds… And hacky non-NixOS-based solutions for GUI testing were and probably will be rejected for duplicating functionality.

This might get better after Nix PR 3600 gets merged (it talks about not using nixbldN users but as a minor side-effect it allows running NixOS-in-container in a Nix build with much much better «boot» times)

For non-leaf packages just inheriting some reverse deps into passthru.tests (like harfbuzz has), is already really valuable. It doesn’t need to be anything fancy.

passthru.tests can also be simple derivations which gets you quite far usually.

2 Likes

Sorry for the time I spent to answer.

@sternenseemann: Thank for your comment.

I feel you are actually addressing another (complementary) question: I understand that automatic testing may be unperfect and noisy, but it is not really the point of the current proposal as the instructions are still for humans (a perfect testing solution may even be better… but would certainly require much more work and infrastructure). In my case however Ofborg helped me at least to realize once that the build was broken on MacOs, so it was not completely useless.

Your comment on the fact that we need to test other things is exactly the point of this post: a reviewer (or even a bot, but it’s another question) should know what to test. Should one just try to compile and start a program? (if yes, how?) Is there dependencies that must also be tested? Is it supposed to be a “simple” PR with no expected nasty side effects, or a huge PR that must be battle tested?…

That being said, the question of automatic testing is interesting:

  • first it allows simple PR (e.g. bumping the version of a package without further dependencies) to be tested and validated quickly
  • it helps (but don’t substitute) reviewers of major PR to avoid breakages

Moreover, in 70% of the cases, testing a program is basically just running a program to see if it starts (at least it’s what I would do to test a PR without further instructions). I’m sure the code to test this is easily reusable and could be packed into a library as easy to use as

runBasicTests {
  gui = true;
  programToRun = "chromium";
  checkIfTextOnScreen = "New tab";
}

and in my opinion the overhead of the VM is small compared to the time to compile the program. I just tried and booting a VM without GUI took me 10s (while compilation would take ~2mn), and booting a VM with GUI + start chromium took me 27 seconds (try to compile chromium… it will take more than 27 seconds).

That being say, if VM add a bit of overhead (which is certainly true for tiny libraries) it may be interesting to add another option like passthru.advancedTests so that people can explicitly run these test if needed.

1 Like

I like output of nixpkgs-update. And in combination of proper passthru.tests including reverse dependencies can automate and speed up things a lot.

I think this would give us a false sense of security. When GUI programs are broken and don’t work they often still open fine and display some text but from that things are obviously broken even to people that see the porgram for the first time.

This is already covered by meta.mainProgram.

You are wrong in that. VMs currently have a very big overhead compared to derivativion which is mainly because they boot a full system from after the bootloader stage with systemd and minimal services. If we want to use them more widely and for simpler tests then we need a lightweight variant.

Booting without a GUI might work for simpler programs but anything more complex might require additional services or a desktop environment. Currently it is also not that easy to debug failed tests and I wouldn’t want to invest to much time to later find out that breeze theme was missing.

Uhm maybe, in my experienced broken gui programs would often just segfault, I think I never saw any program that properly started while being broken. But it does not mean that it nevers happens, and I may be wrong. And again, this is not to replace reviewers, it’s just an additional guard to avoid unexexpected breakages on large changes (I don’t think that reviewers would want to compile and run all 200 dependencies of an updated library “just in case”, and I doubt they do it right row… having a false sense of security is maybe better than having no idea at all if packages would break or not).

Well I just gave numbers above illustrating that starting a VM with GUI + chromium takes 30 seconds. It’s for sure not small compared to starting an empty derivation, but derivations are never empty. If you compile a program like chromium locally, you will see that it takes muuuch more than 30 seconds, so the testing time is still negligible (or comparable) compared to the compilation.

That being said, it would be great to provide lighter solution, but I would still love these alternative to be as reproducible as possible (or at least provide various options to trade reproducibility with efficiency), while staying simple enough (if you spend 1h to understand that docker don’t setup user rights correctly it’s not worth waiting 30s for the vm to start) secure enough (running arbitrary code on the main computer without virtual machine can be scary, at least qemu provides some sort of isolation). I guess one could imagine to start the VM just once instead of once per tested program.

If people don’t want to debug a broken test but still see that the program runs fine, then we could add an option to locally say that a test is broken. And anyway, all VM tests should be optional, and only people wanting to test changes could decide to use them.

I have seen Qt and GTK Apps regularly that didn’t find some fonts, Icons or individual glyphs, but they ran fine ignoring this.

I had this especially with GTK apps on Plasma and Qt apps on Gnome.

PR submitted Add instructions for reviewer in PR template by tobiasBora · Pull Request #242733 · NixOS/nixpkgs · GitHub!

@NobbZ can’t these kind of errors be detected by reading at the error messages of the app for instance?

If they emit messages, sure. Though the way how those are emitted is not necessarily uniform. And not always an error.

It could just be that they didn’t find the first font, and then do a fallback, which is expected and just how fonts (should) work.