-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
14% of SHA256 hashes not matching #67
Comments
the hashes were computed with img2dataset
did you also use img2dataset for verification ?
…On Wed, Oct 4, 2023 at 5:41 PM pfischer-nvidia ***@***.***> wrote:
Introduction
We downloaded the Datacomp 1B set
<https://huggingface.co/datasets/mlfoundations/datacomp_1b>.
For verification, we only kept an image if its SHA256 checksum of the
bytes matches with the corresponding entry in the metadata you provide.
Problem Statement
Hundreds of millions of images were discarded due to a hash mismatches.
Here's one example. Let's look at entry 21:
https://huggingface.co/datasets/mlfoundations/datacomp_1b/viewer/default/train?row=21
- UID: 38f76e4b1b4a77ca66a62b453da17912
- text: Cable Manager, Horizontal, Recessed Flat ...
- url: https://images.eanixter.com/viewex/PR108844V6.JPG
- sha256:
0e77fada7a972fc2fa2ad8991701968ea0953d03f799787c4d27f072bdfa4164
[image: PR108844V6]
<https://user-images.githubusercontent.com/126014612/272631001-8dec2e12-ba27-463f-b9c0-ae995412702e.JPG>
If you download the image and compute the hash, it will be this:
$ curl -s https://images.eanixter.com/viewex/PR108844V6.JPG | sha256sum
6f392bcd683005e33356a2c003fb8378c7ff445d0b54b3429c84bd112299d273
One might think the image was modified slightly (e.g. its header or some
pixels). However, checking the web archive version from 2019 yields the
same hash:
$ curl -s https://web.archive.org/web/20191127193532if_/https://images.eanixter.com/viewex/PR108844V6.JPG | sha256sum
6f392bcd683005e33356a2c003fb8378c7ff445d0b54b3429c84bd112299d273
(You can view the web archive capture here
<https://web.archive.org/web/20191127193532/https://images.eanixter.com/viewex/PR108844V6.JPG>
.)
Mitigation
We would like to better understand how the hashes were computed. It seems
the code that was used for that is not published.
Potentially, we could build a workaround by computing the hashes in the
same way you did.
Ultimately, we think fixing the hashes in the metadata will be the best
solution.
—
Reply to this email directly, view it on GitHub
<#67>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAR437UYUXP5WOP746SSNFDX5V7RXAVCNFSM6AAAAAA5S2ZXCCVHI2DSMVQWIX3LMV43ASLTON2WKOZRHEZDMNBZGIZDIOI>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
We have our own downloading framework. But all we do really is |
Just checked. img2dataset does the same thing. I downloaded some images with img2dataset, and I'm getting the same hashes as with our code and they don't match the given metadata. I ran it like The result for the image above is:
Can you provide the version (commit ID) and commandline of img2dataset that you used? |
@gabrielilharco should be able to say who ran this hash computation and what commit was used |
We are blocked by this issue @gabrielilharco, @rom1504 . Please help soon by checking how this was done. |
Is the problem only for DC-1B or do you have the problem downloading CommonPool as well? |
We haven't downloaded CommonPool with hash checking. Can you please explain the thought behind your question? What is the conclusion if it's different / the same? Given the above example, you can just compute the hash yourself (independent of our download) and you will see it doesn't match. |
Yeah the original hash was computed during commonpool download and then passed around to the dc1b parquet, maybe there was a bug in that data processing pipeline...
…________________________________
From: pfischer-nvidia ***@***.***>
Sent: Friday, October 6, 2023 8:55:20 AM
To: mlfoundations/datacomp ***@***.***>
Cc: Vaishaal ***@***.***>; Comment ***@***.***>
Subject: Re: [mlfoundations/datacomp] 14% of SHA256 hashes not matching (Issue #67)
We haven't downloaded CommonPool with hash checking. Can you please explain the thought behind your question? What is the conclusion if it's different / the same?
Given the above example, you can just compute the hash yourself (independent of our download) and you will see it doesn't match.
—
Reply to this email directly, view it on GitHub<#67 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAKMJEFQR6AL4GR46LUHRXTX56TNRAVCNFSM6AAAAAA5S2ZXCCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONJQGA3TMOJSG4>.
You are receiving this because you commented.Message ID: ***@***.***>
|
Also tagging @GeorgiosSmyrnis |
Has anybody been able to look into this? |
hello we are looking into this right now! So the download code that computes the hash is right here: https://github.com/rom1504/img2dataset/blob/main/img2dataset/downloader.py#L203-L318C6 We are checking our internal pool to see what the computed hash was. We have 3-4 working hypotheses that we are working on resolving:
We are exploring all of these and hopefully in the next day or so we can come to a conclusion. What happens if you just skip the 14% for now? would that unblock you? |
Yes, we're able to move forward with initial experimentation with the missing 14%. We are eagerly awaiting your findings though. Your hypotheses 2-4 are interesting though; would that suggest that we'd also run into hash mismatches coming from CommonPool-12.8B? |
yeah I believe so, my money is on 2. What we can do is just look at our internal copy of one of the hash mismatched images and see if either the exif data or the actual data is truncated. |
Here is a preliminary investigation on incomplete streams. Seems that this does not explain the hash mismatch, at least not at clean byte boundaries. Will now try to track down a mismatched image in our tar files. Hopefully a visual inspection will be useful Sorry about this and thanks for raising the issue! import urllib.request
import io
import hashlib
import os
from tqdm import tqdm
current_hash = "6f392bcd683005e33356a2c003fb8378c7ff445d0b54b3429c84bd112299d273"
datacomp_hash = "0e77fada7a972fc2fa2ad8991701968ea0953d03f799787c4d27f072bdfa4164"
url = "https://images.eanixter.com/viewex/PR108844V6.JPG"
img_stream = None
user_agent_string = (
"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:72.0) Gecko/20100101 Firefox/72.0"
)
request = urllib.request.Request(
url, data=None, headers={"User-Agent": user_agent_string}
)
with urllib.request.urlopen(request, timeout=20) as r:
img_stream = io.BytesIO(r.read())
img_stream.seek(0, os.SEEK_END)
max_bytes = img_stream.tell()
img_stream.seek(0)
for i in tqdm(range(max_bytes + 1)):
for j in range(max_bytes):
img_stream.seek(j)
computed_hash = hashlib.sha256(img_stream.read(i)).hexdigest()
if computed_hash == datacomp_hash:
print("hit the datacomp hash") # does not happen
if computed_hash == current_hash:
print("hit the current hash!") # happens |
So you may already have done this, but I scanned the xlarge pool metadata, looking for that URL, and this is what I found:
So this URL only occurs once in the data, and it's the same as in the metadata for DataComp-1B. I then searched the metadata for the hash |
to understand what is going on, I think it would be interesting to try and
redownload the samples with unexpected hashes a few times (maybe redownload
1M of them 3 different times, waiting a few hours/days in between) and then
compare the hashes.
It is possible these websites do not have stable hosting and provide
different content over time
…On Tue, Oct 10, 2023 at 5:19 PM Mike Ranzinger ***@***.***> wrote:
So you may already have done this, but I scanned the metadata for xlarge,
looking for that URL, and this is what I found:
+----------------------------------+---------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------+-----------------+---------------------------+---------------------------+-------------+------------------------------------------------------------------+
| uid | url | text | original_width | original_height | clip_b32_similarity_score | clip_l14_similarity_score | face_bboxes | sha256 |
+----------------------------------+---------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------+-----------------+---------------------------+---------------------------+-------------+------------------------------------------------------------------+
| 38f76e4b1b4a77ca66a62b453da17912 | https://images.eanixter.com/viewex/PR108844V6.JPG | "Cable Manager Horizontal Recessed Flat 3-Ring Rack Mount 1RU 19"" Width x 4.8"" Depth x 1.72"" Height 16 Gauge Steel Powder Coated Black With 3"" Metal D-Ring" | 250 | 250 | 0.33520508 | 0.28076172 | [] | 0e77fada7a972fc2fa2ad8991701968ea0953d03f799787c4d27f072bdfa4164 |
+----------------------------------+---------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------+-----------------+---------------------------+---------------------------+-------------+------------------------------------------------------------------+
So this URL only occurs once in the data, and it's the same as in the
metadata for DataComp-1B.
I then searched the metadata for the hash
6f392bcd683005e33356a2c003fb8378c7ff445d0b54b3429c84bd112299d273 which is
what we get when we download the image, and it never occurs in the metadata.
—
Reply to this email directly, view it on GitHub
<#67 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAR437RPXMDBECAPX4S42WLX6VROLAVCNFSM6AAAAAA5S2ZXCCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONJVGY3TIMRRHE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Overall I believe (as I mentioned when we introduced this hash verification
feature), the correct way to do this would be instead some form of
approximate hashing. A poor way to do this for example would be to compare
clip embeddings of images and check the similarity. (a better way may be to
use a dedicated model to image deduplication).
I think it would make more sense, as what we are trying to collect here is
a dataset providing the same semantic information as the one initially
collected, rather than using the exact same bytes.
On Tue, Oct 10, 2023 at 6:32 PM Romain Beaumont ***@***.***>
wrote:
… to understand what is going on, I think it would be interesting to try and
redownload the samples with unexpected hashes a few times (maybe redownload
1M of them 3 different times, waiting a few hours/days in between) and then
compare the hashes.
It is possible these websites do not have stable hosting and provide
different content over time
On Tue, Oct 10, 2023 at 5:19 PM Mike Ranzinger ***@***.***>
wrote:
> So you may already have done this, but I scanned the metadata for xlarge,
> looking for that URL, and this is what I found:
>
> +----------------------------------+---------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------+-----------------+---------------------------+---------------------------+-------------+------------------------------------------------------------------+
> | uid | url | text | original_width | original_height | clip_b32_similarity_score | clip_l14_similarity_score | face_bboxes | sha256 |
> +----------------------------------+---------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------+-----------------+---------------------------+---------------------------+-------------+------------------------------------------------------------------+
> | 38f76e4b1b4a77ca66a62b453da17912 | https://images.eanixter.com/viewex/PR108844V6.JPG | "Cable Manager Horizontal Recessed Flat 3-Ring Rack Mount 1RU 19"" Width x 4.8"" Depth x 1.72"" Height 16 Gauge Steel Powder Coated Black With 3"" Metal D-Ring" | 250 | 250 | 0.33520508 | 0.28076172 | [] | 0e77fada7a972fc2fa2ad8991701968ea0953d03f799787c4d27f072bdfa4164 |
> +----------------------------------+---------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------+-----------------+---------------------------+---------------------------+-------------+------------------------------------------------------------------+
>
> So this URL only occurs once in the data, and it's the same as in the
> metadata for DataComp-1B.
>
> I then searched the metadata for the hash
> 6f392bcd683005e33356a2c003fb8378c7ff445d0b54b3429c84bd112299d273 which
> is what we get when we download the image, and it never occurs in the
> metadata.
>
> —
> Reply to this email directly, view it on GitHub
> <#67 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AAR437RPXMDBECAPX4S42WLX6VROLAVCNFSM6AAAAAA5S2ZXCCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONJVGY3TIMRRHE>
> .
> You are receiving this because you were mentioned.Message ID:
> ***@***.***>
>
|
Do you still have a local copy of this image? Can you verify that you get the hash in the dataset metadata? Or is that impossible due to preprocessing changing the image bytes in your stored copy? |
We do, but we don't have an efficient way of finding it since we don't have a mapping from uid to which shard the datapoint is in. Do you happen to have any example like this in CommonPool small? |
We were able to find some images where the sha256 we have on file doesn't match the one if we re-download the image, we're currently investigating further |
Here's an example of an image where there is a mismatch. They are very similar visually, but there's a small difference. They are both jpgs, and with the same size and properties (
The original image url is http://www.dhresource.com/260x260s/f2-albu-g5-M01-35-B5-rBVaJFip2JWAWPjrAAG7KGesIUY669.jpg/wholesale-new-casual-leather-men-bag-small.jpg |
This could be because of CDNs and how aggressive we were when downloading. I can't think of a way out of it that doesn't involve re-downloading the entire pool, which is prohibitively expensive for us at the moment. The safest thing to do if there is concern about poison attacks is to only trust the hashes that match and ignore all else. Unfortunately that will mean throwing away some data that is probably good. Alternately, if you're less worried about attacks but still want to guarantee the integrity of the image-text pair, you could look at our metadata features (e.g. CLIP features) to check if they are roughly the same |
Thanks for investigating this. I think what we're going to do is download the mismatch 14%, place it in "quarantine", and then, as you said, compare clip scores, bringing in those within some threshold. If we go down this path, would you be interested in us sharing the results, e.g. we could create new parquets for those images that failed the hash check but passed the clip check, along with the new hashes. |
Yes, absolutely, that would be amazing! |
@gabrielilharco: I think your finding (small pixel differences) does not fully explain the issue. Please see my original post above, where the image was the exact same image already back in 2019. How can we explain that? |
I think it would be more useful to do an analysis over a few thousands (or
maybe millions) samples rather than on a single image. It's likely the
causes of a changed hash are varied and some of them will be expected (i.e.
no bug)
…On Wed, Oct 11, 2023 at 8:09 AM pfischer-nvidia ***@***.***> wrote:
@gabrielilharco <https://github.com/gabrielilharco>: I think your finding
(small pixel differences) does not fully explain the issue. Please see my
original post above, where the image was the exact same image already back
in 2019. How can we explain that?
—
Reply to this email directly, view it on GitHub
<#67 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAR437WQAISE7JSPBLX45ETX6YZ2XAVCNFSM6AAAAAA5S2ZXCCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONJWHA3DMMJYGI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Sure, I can create a list. But for each sample there should be some explanation why the hash is different and I think we haven't found an explanation for the initial example. |
@pfischer-nvidia my best guess is because of CDNs, in a way I don't fully understand yet. But in general, my understanding is that there is no guarantee that you'll always get the same image when querying an url. The fact that we were doing so many queries in parallel when downloading might have played a role too. Visually inspecting some images where there's a mismatch, the images we have locally look very similar to the ones we download again. A few other data points:
|
Ok I understand that images on the internet change over time (even only slightly). |
That's not the kind of change I'm talking about. CDNs can dynamically optimize images in an attempt to deliver faster loading times. Even minor dynamic changes such as compression can lead to different hashes. Another example:
This image also hasn't changed since 2019 according to the web archive https://web.archive.org/web/20230000000000*/https://thumbnailer.mixcloud.com/unsafe/60x60/tmp/7/9/0/f/c08c-b512-40fd-812c-539f2d6c7c00 The image we have on file differs slightly from the one we redownload, but they look very visually similar. Here's a comparison (our image first). |
The three Aceito Murda images (the first attached, the second attached, the Internet Archive one) produce three different hashes for me ( Are there other examples like the image in row 21, where the current download and the Internet Archive match, but the |
@dpaleka I manually looked at 100 images where there was a hash mismatch between our metadata and the redownloaded image and couldn't find anything like that. I put a copy of a few of our shards here if you want to investigate further: https://drive.google.com/file/d/1898MDL_fXOYPjIzNYTt_B6nH0nZqRiTh/view?usp=sharing |
Introduction
We downloaded the Datacomp 1B set.
For verification, we only kept an image if its SHA256 checksum of the bytes matches with the corresponding entry in the metadata you provide.
Problem Statement
Hundreds of millions of images were discarded due to hash mismatches.
Here's one example. Let's look at entry 21:
https://huggingface.co/datasets/mlfoundations/datacomp_1b/viewer/default/train?row=21
38f76e4b1b4a77ca66a62b453da17912
Cable Manager, Horizontal, Recessed Flat ...
https://images.eanixter.com/viewex/PR108844V6.JPG
0e77fada7a972fc2fa2ad8991701968ea0953d03f799787c4d27f072bdfa4164
If you download the image and compute the hash, it will be this:
One might think the image was modified slightly (e.g. its header or some pixels). However, checking the web archive version from 2019 yields the same hash:
(You can view the web archive capture here.)
Mitigation
We would like to better understand how the hashes were computed. It seems the code that was used for that is not published.
Potentially, we could build a workaround by computing the hashes in the same way you did.
Ultimately, we think fixing the hashes in the metadata will be the best solution.
The text was updated successfully, but these errors were encountered: