Skip to content

Removal request & notice: permissive licensing might often still be unsuitable(!) for training set inclusion #160

@ell1e

Description

@ell1e

I'd just like you to know that code with permissive licensing with attribution requirements are possibly unsuitable for training set inclusion. I'm bringing this to your attention not as a lawyer, but as a maintainer. Ask your own council. However, attribution requirements usually means derivatives must retain attribution of the original author. LLMs are apparently well-known to occasionally spit out exact derivatives, but without satisfying attribution requirements, which suggests this practice could be illegal.

I therefore request you at the very least process opt-out requests in retrospect for pre-existing data sets to fix this. However, just to stress this again, I'm not a lawyer and this isn't legal advice. But at least from the outside, this looks troubling.

For example, it appears you included repositories of mine that have attribution requirements:

Screenshot_20240404_155403

I don't understand how StarCoder would possibly satisfy them.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions