Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove CSAM, if present #71

Open
ahundt opened this issue Dec 20, 2023 · 5 comments
Open

Remove CSAM, if present #71

ahundt opened this issue Dec 20, 2023 · 5 comments

Comments

@ahundt
Copy link

ahundt commented Dec 20, 2023

A recent report definitively found CSAM in LAION-5B, and that dataset has been taken down until the problem can be solved. The DataComp dataset is much larger. Please let us know what steps you have taken and/or plan to take to address this issue responsibly. Thanks!

https://www.404media.co/laion-datasets-removed-stanford-csam-child-abuse/

Edit: Ali Alkhatib also makes a good point that, should dataset changes be needed, they might need to be mixed in with additional simultaneous data changes so an old version can't be easily diffed against a new version to find harmful material, among other best practices.

https://x.com/_alialkhatib/status/1737484384914092156?s=46

@ludwigschmidt
Copy link

Thank you for the suggestion for improving DataComp. The cited study uses one of LAION’s NSFW classifiers to find CSAM content in LAION-5B. Unlike LAION-5B, we removed NSFW content when assembling DataComp, so to the best of our knowledge, the CSAM images in question are not in DataComp. We will review this issue in more depth and welcome specific suggestions for removing content from DataComp. For additional information, please see Section 3.2, Appendix E, and Appendix G of the DataComp paper, which describe our safety measures in more detail.

@ahundt
Copy link
Author

ahundt commented Jan 20, 2024

Thank you for your reply. I appreciate your attention to my concerns. However, I would like to draw your attention to the fact that my name is already mentioned in the acknowledgement section on page 10 of your paper, indicating that I have previously read and shared several items about the design, construction, collection, and publication approach to this dataset with another member of your team. While they have been noted, unfortunately, these concerns have not been addressed in practice, to the best of my knowledge, which would require actions like those found in the papers I reference below.

Regarding CSAM, the 404 media article makes explicit the very high risk posed. I would appreciate it if you could substantively address the items in this issue since I was asking what you’ve done now beyond what is outlined in the paper.

Simply multiplying your own error rate figures by the scale of your dataset provides very large numbers for potentially problematic images in your dataset. Work by multiple Birhane et al papers as well as the Stanford group that verified the CSAM in LAION includes substantially more comprehensive evaluation steps that have not been completed, according to your paper.

Here is Dr. Birhane’s Google Scholar page with the relevant papers and methods:

  1. Multimodal Datasets
  2. Data-swamps
  3. LAION’s den
  4. Large image datasets

Here is the page with the Stanford group’s work detecting CSAM.

The paper stable bias is also likely to be relevant.
https://arxiv.org/abs/2303.11408

I would appreciate it if this matter were taken seriously and acted upon with equal or greater care and attention than authors of the papers I’ve provided have taken. The reasons detailed in the 404 media article make the risks, motivation for addressing the risks, and the impacts all crystal clear.

Thank you for your time and consideration.

@ahundt
Copy link
Author

ahundt commented Aug 27, 2024

@Lwantstostophim If you’re in the United States you need to contact the FBI https://www.fbi.gov/contact-us

If you’re in another country in which it is safe to do so you should report to the equivalent authorities.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants
@ahundt @ludwigschmidt and others