-
Notifications
You must be signed in to change notification settings - Fork 152
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] a transform to perform file level de-dupe (exact) #870
Comments
@sujee What is file hash ? how is it calculated? Is it based on file name and size and other metadata without consideration for its content ? Or is it treating the file content as just a stream of bytes and calculating the hash ? It is still not clear to me how you get around reading the file to calculate the hash. Can you please elaborate what the file hash formula is ? |
hash is based on file content - treating it as bunch of bytes. something like this with open(file_path, 'rb') as file:
sha1_hash = hashlib.sha1()
chunk_size = 4096 # Adjust this value based on your needs
while True:
chunk = file.read(chunk_size)
if not chunk:
break
sha1_hash.update(chunk)
return sha1_hash.hexdigest() |
@sujee how would you treat zip and tar files? |
For first-version, I plan to treat zip/tar files as ONE file. So if there are duplicate zip/tar files, dupes will be eliminated. I won't 'look inside' the archives. in the next version, I can add functionality to extract the archive content, and perform de-dupe on all. |
Search before asking
Component
Transforms/universal/ededup
Feature
Current process for dedupe is
This has some drawbacks
What I propose is
Functionality
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: