-
-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cache hashes on filesystem #87
Comments
Thanks for the bug report. I'm not surprised the CLI has limited performances. The goal is to make it work before making it fast. Once the project gets more stable and better tested, I guess we can invest some time making it fast. |
I just try to deduplicate a mailbox with about 300000 e-mails. So the question is not whether to make it work or to make it fast. Currently it is neither fast nor does it work for really large mailboxes. Is at least 3 GB RAM for 300000 e-mails or more than 10 KB per e-mail a reasonable size? The numbers above indicate even 100 KB per e-mail. |
@stweil To be fair, swap exists, so this does indeed become a question of optimization as @kdeldycke correctly pointed out. |
FWIW, I was able to run this on a 10 year-old thinclient (!) with 2GB of RAM by adding 5 to 10 GB of swap on an external USB HDD to analyze a maildir account with about 150' mails. This really is only a question of convenience. Just add temporary swap space and drink copious amounts of coffee while waiting for now ;-) |
The thing is this project was always a hack for single users with small data. The fact that people are now using it across several mail sources and much bigger boxes is evidence it fills real user needs. But now that a wider audience is playing with it shows its weaknesses. We need some more developers to tackle that issue, and a file-system cache is indeed a good feature to have. Let bootstrap that by discussing the implementation. I propose to simply persist the hash <=> email's UID map into a local SQLite database. That way we don't need to re-invent yet another file format. And that's easy to do thanks to the |
When hashing about 20.000 to 30.000 mails mdedup used up 2G of RAM. I think it would be good idea to offload storing the hashes to a temporary file, one that could potentially be reused on a subsequent run.
The text was updated successfully, but these errors were encountered: