Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Worker timeouts during full page archives don't get cleaned properly causing duplicates and large space usage #742

Closed
maya329 opened this issue Dec 19, 2024 · 11 comments
Labels
bug Something isn't working

Comments

@maya329
Copy link

maya329 commented Dec 19, 2024

Describe the Bug

I'm currently running hoarder within docker and today I checked and found out it's taking up 40GB while I only have 119 bookmarks, the file size seems to be very unusual. I did an ncdu and the results are attached below.

How do I find out which bookmarks are taking up 1GB of space?

Steps to Reproduce

  1. Run ncdu on the system and browse into the data folder

Expected Behaviour

Less file size

Screenshots or Additional Context

image

Device Details

No response

Exact Hoarder Version

v0.19.0

@ctschach
Copy link

Have you enabled video download? This is what happened to me….

@MohamedBassem MohamedBassem added the question Further information is requested label Dec 20, 2024
@MohamedBassem
Copy link
Collaborator

As @ctschach mentioned, this looks like video downloads being enabled indeed.

If you go inside the large folder and run cat metadata.json it should tell you the type of the asset.

And if you want to know which asset this is, you can go to:

https://<addr>/api/assets/<UUID>

If you want to know to which bookmark does this asset belong, you can run the following query against the sqlite database:

select bookmarkId from assets where id = '<UUID>';

@maya329
Copy link
Author

maya329 commented Dec 20, 2024

Have you enabled video download? This is what happened to me….

No, this was my first thought, but I checked all the youtube bookmarks and the "video" tab in the dropdown are all greyed out.

@maya329
Copy link
Author

maya329 commented Dec 20, 2024

Seems like I have a ton of full page archives... Is there a quick way to clear these and only keep 1?

image

@MohamedBassem
Copy link
Collaborator

how did you end up with that many? 😅 If you don't care about the bookmark, the easiest would be to remove it and re-add it.

1 similar comment
@MohamedBassem
Copy link
Collaborator

how did you end up with that many? 😅 If you don't care about the bookmark, the easiest would be to remove it and re-add it.

@maya329
Copy link
Author

maya329 commented Dec 20, 2024

I have no idea... Haha, I just exported the entire hoarder data to JSON, nuked the container and made a new one then imported in. I'll keep a lookout and see what happens after.

@maya329
Copy link
Author

maya329 commented Dec 20, 2024

Alright, I've tested enough and come to the conclusion it's that specific webpage that is causing problems. Seems like it's timing out during archiving and causing an endless loop as the worker try again:

image

The webpage I have bookmarked is: https://www.interaction-design.org/literature/topics/visual-hierarchy

If you want to test.

@MohamedBassem MohamedBassem added bug Something isn't working and removed question Further information is requested labels Dec 20, 2024
@kamtschatka
Copy link
Contributor

OK so I guess the reason is two-fold:

  • We are not properly handling timeouts of the worker, which causes this behavior of having the full page archive being scheduled again and again, adding to the list of assets for this bookmark.
  • You have not set the CRAWLER_JOB_TIMEOUT_SEC high enough to give the crawler the chance to even finish in time. Can you try increasing it to a longer time and see if the problem persists, so we can confirm?

@maya329
Copy link
Author

maya329 commented Dec 20, 2024

I increased the timeout to 300 and it's working well now. Thanks for the solution!

@maya329 maya329 closed this as completed Dec 21, 2024
@MohamedBassem MohamedBassem reopened this Jan 2, 2025
@MohamedBassem MohamedBassem changed the title Hoarder taking up 40GB with just 119 bookmarks Timeouts during full page archives don't get cleaned properly causing duplicates and large space usage Jan 2, 2025
@MohamedBassem MohamedBassem changed the title Timeouts during full page archives don't get cleaned properly causing duplicates and large space usage Worker timeouts during full page archives don't get cleaned properly causing duplicates and large space usage Jan 2, 2025
@debackerl debackerl mentioned this issue Jan 13, 2025
1 task
@debackerl
Copy link

It not only create duplicate assets but also orphan tags: it can happen that inference task is run multiple times because of a timeout, but different tags could be generated each time. Sometimes, the 1st inferno task created new tags, but 2nd inference will bot use them again generating orphan tags.

I think that in case of a timeout, completed tasks should not be retried. It costs to rerun inference, and takes time to rerun full page archiving.

I believe that a retry is worth in case of HTTP errors, but not of a timeout. Maybe worth making it a parameter if someone want to keep retrying slow websites.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants