-
-
Notifications
You must be signed in to change notification settings - Fork 481
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Worker timeouts during full page archives don't get cleaned properly causing duplicates and large space usage #742
Comments
Have you enabled video download? This is what happened to me…. |
As @ctschach mentioned, this looks like video downloads being enabled indeed. If you go inside the large folder and run And if you want to know which asset this is, you can go to:
If you want to know to which bookmark does this asset belong, you can run the following query against the sqlite database:
|
No, this was my first thought, but I checked all the youtube bookmarks and the "video" tab in the dropdown are all greyed out. |
how did you end up with that many? 😅 If you don't care about the bookmark, the easiest would be to remove it and re-add it. |
1 similar comment
how did you end up with that many? 😅 If you don't care about the bookmark, the easiest would be to remove it and re-add it. |
I have no idea... Haha, I just exported the entire hoarder data to JSON, nuked the container and made a new one then imported in. I'll keep a lookout and see what happens after. |
Alright, I've tested enough and come to the conclusion it's that specific webpage that is causing problems. Seems like it's timing out during archiving and causing an endless loop as the worker try again: ![]() The webpage I have bookmarked is: https://www.interaction-design.org/literature/topics/visual-hierarchy If you want to test. |
OK so I guess the reason is two-fold:
|
I increased the timeout to 300 and it's working well now. Thanks for the solution! |
It not only create duplicate assets but also orphan tags: it can happen that inference task is run multiple times because of a timeout, but different tags could be generated each time. Sometimes, the 1st inferno task created new tags, but 2nd inference will bot use them again generating orphan tags. I think that in case of a timeout, completed tasks should not be retried. It costs to rerun inference, and takes time to rerun full page archiving. I believe that a retry is worth in case of HTTP errors, but not of a timeout. Maybe worth making it a parameter if someone want to keep retrying slow websites. |
Describe the Bug
I'm currently running hoarder within docker and today I checked and found out it's taking up 40GB while I only have 119 bookmarks, the file size seems to be very unusual. I did an ncdu and the results are attached below.
How do I find out which bookmarks are taking up 1GB of space?
Steps to Reproduce
Expected Behaviour
Less file size
Screenshots or Additional Context
Device Details
No response
Exact Hoarder Version
v0.19.0
The text was updated successfully, but these errors were encountered: