Code attribution policy

piegames · August 11, 2024, 8:23am

In the last months, I’ve heard several allegations about people copying foreign code into Nix projects without or with improper attribution. Part of the conflict here is that there is disagreement on what exactly constitutes proper attribution. I therefore propose to establish some agreed on rules, and put them somewhere in the contributor guidelines. Some ideas of mine, for a start:

Source of truth for attribution must be the git history, everywhere else (GitHub etc.) is auxiliary.
Attribution includes references to the original code, the original author(s), and if applicable issue and PR references
Full cherry-picks should preserve the original commit metadata to the extent that is possible

Note that software licences don’t specify much anything here; this is purely about a social contract. There is little to no legal obligation, and I have no interest in talking about licenses and legal stuff here.

jade · August 11, 2024, 9:12am

IMO copying verbatim more than three lines of code from another change not originally intended for a project, especially without talking to the authors, should get at least a link back to the original source and often a Co-authored-by.

This is not as much of a worry in terms of code review changes getting squashed, as long as the reviewer doesn’t substantially change the thing and doesn’t ask for credit (which is the most common case).

I think a lot of the thing that feels bad is when someone goes and picks a change, doesn’t communicate anything and then doesn’t write anything in the commit message about having done so.

In summary what I want if someone is picking my change is:

if it’s essentially verbatim, leave the author tag alone and maybe add yourself as Co-authored-by depending on how much you changed (e.g. maybe you added build system changes for make or something that doesn’t exist in lix).
- This is slightly different for me in the case of completely unrelated projects where it’s not a change in response to a change, but simply taking code as it is.
  
  This may be because one particular function isn’t quite right or because of wanting to avoid a gratuitous dependency.
  
  An example of a case like this is where I copied an entire function from Python. I checked the license is compatible, added a copyright header to the file as required by their license, and added a comment in the source about where the function came from in Python. The point of this is documenting this is foreign code, not written by me, and it might change in the future so it should be possible to find again if it ever has bugs.
if a substantial amount is used, Co-authored-by, but not if it’s rewritten to the point of indistinguishability
if basically nothing is used, still consider linking to the other implementation’s change if you were aware about it while writing your change (obviously if you just didn’t know, that’s fine). it makes commit history a lot more useful.
regardless of the above, please link to the place you got the commit from and the pr/context. Commit hash is very helpful but if it’s cross -repo so it won’t resolve, consider adding a web link for the commits/CLs/PRs contained within. This is very helpful to figuring out the context of a change when debugging it later, and is context that’s really hard to build up later.

If it’s relevant, it can be nice to link to the other implementation’s change even if the change was completely rewritten, since this helps keep track of cross project context.

This isn’t about hard and fast rules but rather about a culture of credit and helping people use the code base years into the future, even when there are forks. I am not super bothered if my code is taken, but removing the context is shooting yourself in the foot and is unkind to both the original author and your future self.

Again, these aren’t hard and fast rules, and nobody is perfect. The case that makes people uncomfortable is when there’s no way to dig through either commit history or GitHub prs (i would discourage putting stuff just in the pr message or comments, it’s a big extra step to find it) to find where the original code came from.

Longer commit messages are great, especially if they add useful information about what everyone was thinking that led to that commit being made.

I will note that including explicit acknowledgement in the source is practically an exceptional case. For most changes, including all of those that have led to my friends feeling their work has been taken without credit, the only thing that should be done more carefully has been commit messages. The case where this differs in my view is for vendoring library functions wholesale because then the reader may wonder why the code style doesn’t match, for instance, or they need updating.

Also I acknowledge GitHub does an awful job of surfacing commit messages to reviewers and certainly makes fixing them as a reviewer a pain, requiring using the git command line (it’s one click in Gerrit and the commit message is front and center for reviewers. if you’ve read lix commit messages you’ll see the cultural impact of this difference). So a certain amount of this is a shadow cast by our tools.

Anyhow, when in doubt, link to the code you looked at. It will help your future self and other maintainers a lot. People are a lot more ok with trying and not getting it quite right than not trying.

Personally I would not be mad at someone taking pretty much any of my contributions and resubmitting them, as long as there is some credit and ideally context, no matter how that credit is precisely arranged, no matter what the author field says. It’s most important to do your best. The problem is when there’s simply no such context in the commit metadata or the code (if appropriate).

roberth · August 11, 2024, 11:34am

I feel like that should all be common sense, but we’re accepting contributions from people with a variety of backgrounds, so I’m proposing to add it to the Nix contributing guide for a start.
Here’s my PR; feel free to add suggestions.
I’d like to keep it somewhat concise. Perhaps we could have a community wide page to elaborate on the details and provide more context?

Atemu · August 11, 2024, 1:52pm

I do not believe that to be the case. They could be more explicit but they do in fact specify some rules here.

Excerpt from the LGPL 2.1 which is the license for Nix and Lix:

1. You may copy and distribute verbatim copies of the Library’s
complete source code as you receive it, in any medium, provided that
you conspicuously and appropriately publish on each copy an
appropriate copyright notice and disclaimer of warranty; keep intact
all the notices that refer to this License and to the absence of any
warranty; and distribute a copy of this License along with the
Library.

…

2. You may modify your copy or copies of the Library or any portion
of it, thus forming a work based on the Library, and copy and
distribute such modifications or work under the terms of Section 1
above, provided that you also meet all of these conditions:

…

b) You must cause the files modified to carry prominent notices
stating that you changed the files and the date of any change.

(Emphasis mine.)

IANAL but a git commit does, in effect, cause modified files to “carry a prominent notice” that states precisely who changed it at what time in the form of the git commit metadata. Unless there is some other copyright notice (i.e. a manually written one in the file), one could interpret this to mean that the metadata contained in the git commits must be preserved. (Though not necessarily as git commit metadata, you could also put the metadata into the file.)

An interpretation of the LGPL could therefore be that integrating licensed code without also "conspicuously and appropriately publish"ing the “copyright notices” (git commit author + date) would constitute a violation of the license and, in the case of the LGPL, would therefore cause the violator to lose all rights that the license would otherwise grant.

Including LGPL’d code someone else wrote without proper attribution would therefore already be a copyright violation IMV; no other social contract is necessary.

emily · August 11, 2024, 2:13pm

Usually they do explicitly say that every derivative work must retain the copyright notice and licence text in the derivative work. Of course, everyone flagrantly violates this all the time.

Linking REUSE here as it’s relevant and has some level of adoption in the Nix ecosystem, although it’s probably not feasible to make Nixpkgs comply with it at this point (and it’d be controversial).

waffle8946 · August 11, 2024, 2:24pm

That’s what RFC’s are for, I don’t think potential controversy should dissuade improvement. It can be done incrementally like the other big changes (by-name, nixfmt). Whether REUSE specifically would be a good topic to determine in said RFC.

emily · August 11, 2024, 2:26pm

Over the past few years we’ve learned that the RFC process doesn’t work for sufficiently controversial changes. (In other words, the RFC process doesn’t work.)

But also I really don’t think people would go for headers listing dozens of contributors at the top of every highly‐trafficked source file.

waffle8946 · August 11, 2024, 2:31pm

We have nixfmt, why does it not work? And everybody has their own half-baked opinion on formatting. Not saying it’s an easy process, or that some of the responses were even appropriate, but it got done.

That’s reasonable, but maybe that’s a problem for automation, or maybe we find a different approach. I’m not tied to any one solution to the problem, but it’s good to know that the approach exists and why we may not use it.

piegames · August 11, 2024, 2:40pm

Again, Licenses are a legal solution to a social problem. The license forbidding something won’t help me much unless I’m ready to sponsor a lawyer to enforce it. (Also the concept of copyright is basically dead at this point)

And REUSE doesn’t really scale well, Nixfmt once used it but then it got removed because it was just impractical to wield.

pbsds · August 11, 2024, 3:58pm

Something very actionable we could add as a policy:

If you copy a stale PR to do more work on it, please make sure to preserve authorship in the git history to the original author.
If you see a forked PR with improper attribution, please help out by calling it out.

waffle8946 · August 11, 2024, 4:16pm

That’s reasonable, however:

How do we make this easier to identify?

pbsds · August 11, 2024, 5:32pm

Sometimes they link to the forked PR from the original, but this is not a given. It mostly comes down to luck whether you’re equipped to spot it.

zimward · August 11, 2024, 7:10pm

Setting up a git action which compares the diffs of not-merged PRs to look for verbatim copies could point out potential violations inside of the same repo.

SergeK · August 11, 2024, 10:25pm

Just bouncing ideas off: we could literally extend the PR template with a “Related work, references, acknowledgements” section…

EDIT(2024-08-12): P.S. I appreciate the emphasis that we’re discussing the social issue, not the legal one

Atemu · August 12, 2024, 8:45am

zimward · August 12, 2024, 3:01pm

A DCO makes sense to prevent a situation where nixpkgs (or the NixOS Foundation) would be sued due to a rogue PR (which is probably not a question of “if” but “when” given enough time). But it won’t solve the social issue of properly attributing contributions that can be legally distributed without stating the author (like forking a PR in nixpkgs? may need to read the MIT license again).

Plagiarism is not only an issue in academia and might harm morale of some contributors in the long term.