Label/Detect Identical Content

Definition: For some features, duplicate data suggests misuse.

Intervention Flavors:

Content Analysis History

Reversible:

Easily Tested + Abandoned

Suitability:

Contextual

Technical Difficulty:

Straightforward

Some platform features ("original features") are solely intended to facilitate collaboration, communication, or sharing original work. The spontaneous, unrelated creation of duplicate files, messages, or images in an original space is extremely improbable when the media is large enough (>1KB). This means that duplicates can be a leading signal of abuse, since it indicates that users are utilizing the features for purposes other than original creations.

Some platforms have even gone so far as to prevent the creation of duplicate images or videos, to encourage engagement with the original work on the platform. That approach is probably not suitable for most platforms as a rule, but the approach is likely suitable for most as a leading signal of abuse and a jumping off point for further investigation. Additionally, abuse determinations about whether content violates policy can easily be applied to duplicates, preventing the need for constant reporting, rediscovery, or reclassification of harmful content.

Duplicates can also be tied into strategies for provenance. In original features, it is typically useful for users to know that the content that they are viewing is non-original. Therefore, adding a button indicating non-original content (and linking to its source) would be helpful for users to gain some context about the content by ascertaining its origin.

While exact, byte-for-byte duplicate detection systems are trivially easy to circumvent, a wide array of literature and software exists to do duplicate detection today, particularly in the domain of copyright enforcement and CSAM. Re-purposing these techniques to deliver features for user insight is a straightforward extension.

Is something missing, or could it be better?