Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] --abort ignoring type of extractor being used #1399

Closed
sourmilk01 opened this issue Mar 22, 2021 · 10 comments
Closed

[Feature Request] --abort ignoring type of extractor being used #1399

sourmilk01 opened this issue Mar 22, 2021 · 10 comments

Comments

@sourmilk01
Copy link

I've noticed that using --abort with a site that has uses different image hosts (such as reddit with reddit, imgur, gfycat, redgifs content posted) will cause the --abort feature to get interrupted before it hits n if it switches to a different extractor (e.g. --abort 5 and 4 repeated reddit posts are skipped, but then a repeated imgur post gets skipped and it resets).

I haven't tested it yet, but I suspect that even if 5 posts of the same time that isn't the parent-extractor (like imgur posts on a subreddit url) are skipped, --abort won't apply because they aren't reddit-extracted posts.

Is there a way to have --abort ignore what the type of extractor is being used and if not, could that feature be added?

@mikf
Copy link
Owner

mikf commented Mar 22, 2021

--abort 5 and 4 repeated reddit posts are skipped, but then a repeated imgur post gets skipped and it resets

It doesn't completely reset, i.e. the number for skipped reddit posts is still at 4 and the next one will trigger --abort, but it uses a different/new "skipped" count for each URL.

I haven't tested it yet, but I suspect that even if 5 posts of the same time that isn't the parent-extractor (like imgur posts on a subreddit url) are skipped, --abort won't apply because they aren't reddit-extracted posts.

Exactly. It will stop for the current site, e.g. imgur, but will continue with reddit regardless. So you'd need a "global" --abort that also counts any files from child-extractors?

@sourmilk01
Copy link
Author

Exactly. It will stop for the current site, e.g. imgur, but will continue with reddit regardless. So you'd need a "global" --abort that also counts any files from child-extractors?

That was my thought; I turned off parent-metadata to test, and it appears that when queuing a reddit URL, --abort n will only count reddit-hosted images and ignore imgur images for the n count. That being said, I saw it skip several dozen already downloaded imgur files even though n was set at 5.

I'm not sure how you would implement that; would you add a new variant "global" --abort, or would you change the original functionality of --abort to count all child-extractors as opposed to only the parent extractor?

@sourmilk01
Copy link
Author

sourmilk01 commented Mar 26, 2021

I forgot to mention, my main reason for requesting this was related to the imgur rate-limit issues I previously asked about (#1386).

By far, imgur has the worst rate-limiting out of all the sites I've seen (1,250 requests per hour; 12,500 per day; if daily rate is hit 5 times in a month your IP gets blocked for the rest of the month).

I've found that when scraping a subreddit or reddit user page that has mostly imgur links, the cap is hit fairly quick; even when files are already downloaded, --abort will fail to stop so it continues to skip until it reaches the hourly cap.

@sourmilk01
Copy link
Author

@mikf I've managed to mitigate my imgur-rate issues with a shoddy workaround (manually identifying and setting aside subreddits and users that were imgur-post heavy).

I still have scrape speed issues when it comes to gfycat/redgif, some subreddits almost exclusively use media from those site so they essentially never abort and have to parse the whole ~1,000 posts available before the next URL.

Any idea on when this type of --abort could be implemented? If it would take too much time to set for every extractor, would it be easier to just set it for imgur/gfycat/redgif (specifcally reddit)?

@razielgn
Copy link
Contributor

This issue also comes up for example with behance, when using a profile (with contains multiple projects) as input: the skip counter resets on every project, as they are handled as different jobs.
Is it reasonable to implement a global skip counter or is there a different way to handle this?

@mikf
Copy link
Owner

mikf commented Jun 5, 2021

@sourmilk01 I think 7ab8374 combined with c693db5 and dfe1e09 solves your problem.

  • parent-skip to share the skip counter between parent and child
    (e.g. skipping 3 on reddit and 2 on imgur would count as 5 skipped files)
  • skip: terminate (or -T/--terminate) to let the stop signal bubble up from child to parent
    (e.g. reaching 5 skipped files on imgur would also stop the parent reddit extractor)

@sourmilk01
Copy link
Author

@mikf Wow! Thank you so much!

I just tested it myself, and parent-skip and --terminate are working on reddit as intended; I think my scrape speed is a fifth or maybe even a tenth of what it was before.

@razielgn You should try testing it on behance.

@razielgn
Copy link
Contributor

razielgn commented Jun 9, 2021

Works great, thank you @mikf!

@Hrxn
Copy link
Contributor

Hrxn commented Jun 11, 2021

@sourmilk01 Is there any specific reason for not using the archive file option here?

@mikf
Copy link
Owner

mikf commented Jun 12, 2021

@Hrxn the problem here isn't detecting an already downloaded file, but gallery-dl's action when finding one in combination with parent and child extractors, e.g. Reddit and Imgur. Any skipped download on one Imgur URL didn't propagate to its parent or other children and didn't count towards the overall "skip limit". Hitting said "skip limit" on an Imgur URL also wasn't able to halt the download for its Reddit parent, only itself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants