LLM/VLM for NixOS VM test screenshot analysis

I’m packaging some visual GUI software, such as MeshLab and CloudCompare.

For those, I’m generally adding NixOS VM tests to ensure automatically that the GUI works; this allows to detect PRs breakage more easily.

However, it is not fully automated yet: Somebody has to look at the screenshot.

Is there a small, local, deterministic VLM in nixpkgs that I could use in the VM tests to analyse the screenshot, so that I can give it the screenshot and ask “does the program show a cube with colours red+green+blue”?

Has anybody got experience with this approach?

4 Likes

AI models, oversimplifying, are impure math functions. And that is by design. That makes them non-deterministic and “creative” (for the lack of a better word). Which is in stark contrast to nixpkgs’ aim.

Additionally, an answer hallucinated tokens from LLM/VLM should not be used to determine whether your test failed or passed. Instead, in your case, I’d rather take a screenshot of the GUI app in full screen mode and check for the ratio of one coloured pixels against another.

Or better yet, have a reference image of what the output should look like. Compare references with current and determine the threshold for a heuristic match/mismatch.

3 Likes

That makes them non-deterministic

This is incorrect (or, as you say, oversimplifying). They can be just as much deterministic as other software (namely dependent on whether you use true random number generators vs. seeded PRNGs, depend on thread scheduling vs deterministic reductions, and so on).

Of course only a model + engine that are deterministic makes more sense for automted tests (though in fact many current Nix packages and tests are plenty nondeterministic; we try to fix those as much as possible).

For example, a llama.cpp developer points out: “you can already get bit-for-bit identical results”

Finally, NixOS VM tests already use ML models for over 5 years:

The test driver’s get_screen_text() function uses tesseract for OCR, which since version 4 uses neural networks:

take a screenshot of the GUI app in full screen mode and check for the ratio of one coloured pixels against another

This is unreliable in the face fo changes to the GUI. For example, the tested program’s author might slightly change the GUI, breaking this ratio test.

reference image

Even more unstable versus changes, needing frequent human correction. The idea is to reduce that.

1 Like

I’m packaging some visual GUI software, such as MeshLab and CloudCompare.

First off, gawbless. Thank you for attempting this.

re: LLMs…

You can use something that does perceptual difference or structural similarity without having to go full LLM. Maybe one of these might help?

1 Like

Looking back, I don’t sound like I wanted to. Sorry about that. My intention was to point out that yes, while LLMs/VLMs can be deterministic, they’re not designed to be. Some are, some aren’t. I wouldn’t rely on the models to be honest.

perceptual difference or structural similarity

These are image perception tools though, designed to compare almost-exact images (e.g. finding small pixel differences, or ignoring them).

I want more: For it to be robust against the software reordering buttons, changing default widths of panels, changing the default display axis so the 3D object may now be looked at from a completely different angle, and so on. These things would all change the images very significantly.

Being robust against that would reduce human review/update effort significantly, compared to tests breaking on small pixel differences, needing code updates (which the plain screenshots of today don’t need).


I did some testing here and had good results with gemma-4-E2B-it-Q4_0.gguf which is small enough to run comfortably in VM tests.

First, I improved the CPU inference performance of llama-cpp on NixOS by 13x by enabling runtime AVX2 dispatch:

This makes it fast enough to run single-threaded (for determinism) in a NixOS VM test.

I experimented with these models:

  • SmolVLM-500M-Instruct-Q8_0.gguf
    • Too dumb. While it’s good at detecting cats, it cannot describe GUIs well at all, hallucinates a ton, claiming that the screenshot of e57inspector is a property listing.
    • Especially bad is that a white empty image is an adversarial case for it, where it hallucinates a man. Apparently hallucination on plain inputs is well known for such older models. This is obviously a problem for NixOS VM tests where the failure to launch a program might create entirely white/black screenshots.
  • SmolVLM2-2.2B-Instruct-f16.gguf
    • Already better. Can figure that the GUI of e57inspector is a computer program.
    • Can OCR the random text on it almost correctly.
    • Still hallucinates a fully white image being a sceenshot of the New York Times website.
  • gemma-4-E2B-it-Q4_0.gguf
    • Much better. Describes the GUI of e57inspector in large detail, figuring out based on the button icons that it must be a 3D-related program.
    • OCR still not fully correct.
    • Correctly points out the white image as entirely white. Only small mistake is the claim “There are some vertical black lines visible on the far left and far right edges”; I have not checked if this might be caused by incorrect cropping by llama-cpp (wouldn’t the be first time some convolution fills out-of-bounds pixels with black 0-Bytes).

Overall the small Gemma 4 model seems quite adequte to run on VM tests.

Also worth pointing out that NixOS VM tests already have a feature to OCR text from screenshots; likely, VLMs will be much better at the same task.

Single-threaded, such a Gemma invocation takes 28 seconds on my CPU; this is acceptable compared to the overall cost of VM tests.

Multi-threaded is much faster, e.g. 6 seconds on 8 cores. But I have not checked yet if llama-cpp’s implementation is deterministic when multithreaded.

Example

Prompt:

Describe the given screenshot.
Next, answer this question and explain your answer:
Does the screenshot show a software application that has successfully loaded a thing called ColouredCubeFloat with some 3D scans in it?
Subsequently, output additionally exactly YES or NO for that question.

Screenshot:

Output:

The screenshot displays the interface of a software application, likely related to 3D scanning or visualization, named “e5Inspector”. The interface shows a file browser/viewer on the left and a main content area on the right.

On the left panel, there is a file structure showing a folder named “Data 3D” containing a file named “c1d7wzk6pf65kdw6hkkh2damgwff” and an “Images 2D” section.

The main content area shows a summary or details for a specific item. The title displayed is “c1d7wzk6pf65kdw6hkkh2danmwwff”, and below it, it states “ColouredCubeFloat”. Below this, it indicates “Scans: 1” and “Images: 0”.

The question asks whether the screenshot shows a software application that has successfully loaded a thing called ColouredCubeFloat with some 3D scans in it. Based on the text “Scans: 1” and the title “ColouredCubeFloat”, it suggests that the object or data named “ColouredCubeFloat” has been loaded, and it has 1 scan associated with it.

YES

Despite the model being quite small AND quantized from 16 to 4 bits, it’s already quite good at describing UIs.

2 Likes

Next, I tackled determinism.

I found that llama-cpp wasn’t deterministic when given --image despite --temp 0 --threads 1.

So I fixed it, by fixing the underlying issue in OpenBLAS.

Issues and PRs I made for that:

With that, inference seems to be bit-deterministic.

4 Likes

I have cleaned this approach up now, and made a PR for cloudcompare and e57inspector:

4 Likes

I’ve mentioned your PR in Full Time Nix | Bump 2026-05-18 hoping it will receive a review.

2 Likes