Provenance metadata for trained models?

gm6k · January 27, 2024, 11:42pm

A significant part of the operative logic in some of today’s software works by applying pre-trained models, and we can expect this to become more common. This raises the question whether this could also be a concern regarding transparency or supply-chain security requirements.

Replacing them is not easily done even if it’s known how they were trained considering that the amount of data and computing power used to train them is practically limitless. Furthermore, the platforms may not be designed to allow reproducible training.

Still, maybe it would be appropriate to have a way to mark models included in our packages in a similar way as already possible using meta.sourceProvenance. This way, users could at least make an informed decision on what (possibly intransparently created) models run on their machines.

What do you think?

hugosenari · January 29, 2024, 7:46pm

Nice, but wonder if order fronts have same information, ie. when we download a binary and only elf patch it.

And also how we deal with inheritance of this information, ie: modelA → modelB, where A is downloaded from some place and have some bias, used to create B, then B might inherit this bias.