Worker timeouts during full page archives don't get cleaned properly causing duplicates and large space usage #742

maya329 · 2024-12-19T19:33:59Z

Describe the Bug

I'm currently running hoarder within docker and today I checked and found out it's taking up 40GB while I only have 119 bookmarks, the file size seems to be very unusual. I did an ncdu and the results are attached below.

How do I find out which bookmarks are taking up 1GB of space?

Steps to Reproduce

Run ncdu on the system and browse into the data folder

Expected Behaviour

Less file size

Screenshots or Additional Context

Device Details

No response

Exact Hoarder Version

v0.19.0

ctschach · 2024-12-20T15:11:57Z

Have you enabled video download? This is what happened to me….

MohamedBassem · 2024-12-20T15:17:49Z

As @ctschach mentioned, this looks like video downloads being enabled indeed.

If you go inside the large folder and run cat metadata.json it should tell you the type of the asset.

And if you want to know which asset this is, you can go to:

https://<addr>/api/assets/<UUID>

If you want to know to which bookmark does this asset belong, you can run the following query against the sqlite database:

select bookmarkId from assets where id = '<UUID>';

maya329 · 2024-12-20T18:20:59Z

Have you enabled video download? This is what happened to me….

No, this was my first thought, but I checked all the youtube bookmarks and the "video" tab in the dropdown are all greyed out.

maya329 · 2024-12-20T18:26:03Z

Seems like I have a ton of full page archives... Is there a quick way to clear these and only keep 1?

MohamedBassem · 2024-12-20T18:28:48Z

how did you end up with that many? 😅 If you don't care about the bookmark, the easiest would be to remove it and re-add it.

MohamedBassem · 2024-12-20T18:28:52Z

how did you end up with that many? 😅 If you don't care about the bookmark, the easiest would be to remove it and re-add it.

maya329 · 2024-12-20T18:48:20Z

I have no idea... Haha, I just exported the entire hoarder data to JSON, nuked the container and made a new one then imported in. I'll keep a lookout and see what happens after.

maya329 · 2024-12-20T19:42:00Z

Alright, I've tested enough and come to the conclusion it's that specific webpage that is causing problems. Seems like it's timing out during archiving and causing an endless loop as the worker try again:

The webpage I have bookmarked is: https://www.interaction-design.org/literature/topics/visual-hierarchy

If you want to test.

kamtschatka · 2024-12-20T21:02:36Z

OK so I guess the reason is two-fold:

We are not properly handling timeouts of the worker, which causes this behavior of having the full page archive being scheduled again and again, adding to the list of assets for this bookmark.
You have not set the CRAWLER_JOB_TIMEOUT_SEC high enough to give the crawler the chance to even finish in time. Can you try increasing it to a longer time and see if the problem persists, so we can confirm?

maya329 · 2024-12-20T21:19:52Z

I increased the timeout to 300 and it's working well now. Thanks for the solution!

debackerl · 2025-01-18T22:41:12Z

It not only create duplicate assets but also orphan tags: it can happen that inference task is run multiple times because of a timeout, but different tags could be generated each time. Sometimes, the 1st inferno task created new tags, but 2nd inference will bot use them again generating orphan tags.

I think that in case of a timeout, completed tasks should not be retried. It costs to rerun inference, and takes time to rerun full page archiving.

I believe that a retry is worth in case of HTTP errors, but not of a timeout. Maybe worth making it a parameter if someone want to keep retrying slow websites.

MohamedBassem added the question Further information is requested label Dec 20, 2024

MohamedBassem added bug Something isn't working and removed question Further information is requested labels Dec 20, 2024

maya329 closed this as completed Dec 21, 2024

stanstrup mentioned this issue Dec 26, 2024

Website creates very large (1.32GB) asset when using fill page archive #770

Open

MohamedBassem reopened this Jan 2, 2025

MohamedBassem changed the title ~~Hoarder taking up 40GB with just 119 bookmarks~~ Timeouts during full page archives don't get cleaned properly causing duplicates and large space usage Jan 2, 2025

MohamedBassem changed the title ~~Timeouts during full page archives don't get cleaned properly causing duplicates and large space usage~~ Worker timeouts during full page archives don't get cleaned properly causing duplicates and large space usage Jan 2, 2025

debackerl mentioned this issue Jan 13, 2025

Duplicate download #876

Closed

1 task

MohamedBassem closed this as completed in fd7011a Feb 1, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Worker timeouts during full page archives don't get cleaned properly causing duplicates and large space usage #742

Worker timeouts during full page archives don't get cleaned properly causing duplicates and large space usage #742

maya329 commented Dec 19, 2024

ctschach commented Dec 20, 2024

MohamedBassem commented Dec 20, 2024

maya329 commented Dec 20, 2024

maya329 commented Dec 20, 2024

MohamedBassem commented Dec 20, 2024

MohamedBassem commented Dec 20, 2024

maya329 commented Dec 20, 2024

maya329 commented Dec 20, 2024

kamtschatka commented Dec 20, 2024

maya329 commented Dec 20, 2024

debackerl commented Jan 18, 2025

Worker timeouts during full page archives don't get cleaned properly causing duplicates and large space usage #742

Worker timeouts during full page archives don't get cleaned properly causing duplicates and large space usage #742

Comments

maya329 commented Dec 19, 2024

Describe the Bug

Steps to Reproduce

Expected Behaviour

Screenshots or Additional Context

Device Details

Exact Hoarder Version

ctschach commented Dec 20, 2024

MohamedBassem commented Dec 20, 2024

maya329 commented Dec 20, 2024

maya329 commented Dec 20, 2024

MohamedBassem commented Dec 20, 2024

MohamedBassem commented Dec 20, 2024

maya329 commented Dec 20, 2024

maya329 commented Dec 20, 2024

kamtschatka commented Dec 20, 2024

maya329 commented Dec 20, 2024

debackerl commented Jan 18, 2025