Invalidating a successful build

Where I work there are many long term bugs that could be addressed by the ability to cancel the result of a successful build to force it to rebuild. The reasons basically all boil down to non-deterministic failures, which is especially pervasive because we use custom hardware that has bugs all the way down to the silicon. But this is a problem nix itself has, and does a compromise by not caching failed builds, assuming that most flakes are false failures. The other reason is that we have multi-machine coordinated builds, which should either succeed or fail as a unit, and it’s easier to cancel a partial success after a single failure than to get everyone to succeed or fail consistently.

So I’ve been thinking about how to do this, and thought I’d run some ideas by here, in case others have thought of this too. This is part of Developing a system that replaces nix remote build which BTW has been in production for a while now and I’m exploring ways to import/export from a monorepo (so if someone knows about that, advice appreciated, but on the other thread!).

So, the nix cache and its protocol has (so far as I know) no support for removing a cache entry. It would have to have some kind of serial number on out paths and clients would have to have a quick way to check for validity of what they have. Or something, none of it exists so it’s all theoretical.

Lacking the output invalidation, the other option is input invalidation. However, drvs are famously created by arbitrary uncontrolled nix expressions, and (so far as I know) you cannot just load a bunch of drvs and modify them (say increment a serial), and then get the whole graph rehashed and recreated. Say we have a central poison.nix which is downloaded from a central place on each build, and loaded to inject serial numbers into drvs as they are created. But, how should it be addressed? The ideal way would be by the drv itself, because we want to invalidate the results of an individual drv, so e.g. poison.nix would contain set $hash-name.drv… but that would presumably require special support from the derivation primitive itself. Otherwise, it has to be something like { name = 3; } and mkDerivation has something like serial = poison.${name} or 0. But the name is the wrong way to address an invalid result, it forces everyone building that thing to rebuild, even theirs is not the problematic drv.

So much for that. My other thought was since I do builds with buildDerivation, which uses a serialized drv not a file on disk, maybe I actually can modify them. There’s a comment in store-api.hh which says “drvPath This is used to deduplicate worker goals so it is imperative that is correct. That said, it doesn’t literally need to be store path that would be calculated from writing this derivation to the store”. I take this to mean I can modify the drv contents and send any old thing as drvPath as long as it’s unique, so I could just hash my new contents. Further, since buildDerivation only builds a single thing, I’m not sure what is this mentioned deduplication. Anyway, I can keep a drv->serial DB, and inject serials into drvs. Once the drv is modified, I have to recalculate the outs, so I need the hash anyway. It’s unclear if it would have to exactly match the very complicated algorithm in prim_derivationStrict or if “uniquely determined by drv content” is enough. Then I would need to go in the topo order of downstream drvs and update their inputSrcs. This is complicated by the fact that they’re BasicDerivations so inputDrvs have already been converted to out paths, but would need the original drv path to look up the modified outs, but doable.

Would this even work? It seems plausible, but is presumably untrodden ground for nix, which never does this internally. Has anyone considered the general problem? I assume no one has considered this particular solution because the invalidation would only work on builds done through my system, a plain local build won’t know about any of it and simply download the bad result.