[Feature Request] --abort ignoring type of extractor being used #1399

sourmilk01 · 2021-03-22T08:06:51Z

I've noticed that using --abort with a site that has uses different image hosts (such as reddit with reddit, imgur, gfycat, redgifs content posted) will cause the --abort feature to get interrupted before it hits n if it switches to a different extractor (e.g. --abort 5 and 4 repeated reddit posts are skipped, but then a repeated imgur post gets skipped and it resets).

I haven't tested it yet, but I suspect that even if 5 posts of the same time that isn't the parent-extractor (like imgur posts on a subreddit url) are skipped, --abort won't apply because they aren't reddit-extracted posts.

Is there a way to have --abort ignore what the type of extractor is being used and if not, could that feature be added?

The text was updated successfully, but these errors were encountered:

mikf · 2021-03-22T20:30:12Z

--abort 5 and 4 repeated reddit posts are skipped, but then a repeated imgur post gets skipped and it resets

It doesn't completely reset, i.e. the number for skipped reddit posts is still at 4 and the next one will trigger --abort, but it uses a different/new "skipped" count for each URL.

I haven't tested it yet, but I suspect that even if 5 posts of the same time that isn't the parent-extractor (like imgur posts on a subreddit url) are skipped, --abort won't apply because they aren't reddit-extracted posts.

Exactly. It will stop for the current site, e.g. imgur, but will continue with reddit regardless. So you'd need a "global" --abort that also counts any files from child-extractors?

sourmilk01 · 2021-03-22T23:20:56Z

Exactly. It will stop for the current site, e.g. imgur, but will continue with reddit regardless. So you'd need a "global" --abort that also counts any files from child-extractors?

That was my thought; I turned off parent-metadata to test, and it appears that when queuing a reddit URL, --abort n will only count reddit-hosted images and ignore imgur images for the n count. That being said, I saw it skip several dozen already downloaded imgur files even though n was set at 5.

I'm not sure how you would implement that; would you add a new variant "global" --abort, or would you change the original functionality of --abort to count all child-extractors as opposed to only the parent extractor?

sourmilk01 · 2021-03-26T14:10:46Z

I forgot to mention, my main reason for requesting this was related to the imgur rate-limit issues I previously asked about (#1386).

By far, imgur has the worst rate-limiting out of all the sites I've seen (1,250 requests per hour; 12,500 per day; if daily rate is hit 5 times in a month your IP gets blocked for the rest of the month).

I've found that when scraping a subreddit or reddit user page that has mostly imgur links, the cap is hit fairly quick; even when files are already downloaded, --abort will fail to stop so it continues to skip until it reaches the hourly cap.

sourmilk01 · 2021-04-21T18:22:32Z

@mikf I've managed to mitigate my imgur-rate issues with a shoddy workaround (manually identifying and setting aside subreddits and users that were imgur-post heavy).

I still have scrape speed issues when it comes to gfycat/redgif, some subreddits almost exclusively use media from those site so they essentially never abort and have to parse the whole ~1,000 posts available before the next URL.

Any idea on when this type of --abort could be implemented? If it would take too much time to set for every extractor, would it be easier to just set it for imgur/gfycat/redgif (specifcally reddit)?

razielgn · 2021-04-29T12:50:29Z

This issue also comes up for example with behance, when using a profile (with contains multiple projects) as input: the skip counter resets on every project, as they are handled as different jobs.
Is it reasonable to implement a global skip counter or is there a different way to handle this?

mikf · 2021-06-05T17:23:12Z

@sourmilk01 I think 7ab8374 combined with c693db5 and dfe1e09 solves your problem.

parent-skip to share the skip counter between parent and child
(e.g. skipping 3 on reddit and 2 on imgur would count as 5 skipped files)
skip: terminate (or -T/--terminate) to let the stop signal bubble up from child to parent
(e.g. reaching 5 skipped files on imgur would also stop the parent reddit extractor)

sourmilk01 · 2021-06-08T17:49:31Z

@mikf Wow! Thank you so much!

I just tested it myself, and parent-skip and --terminate are working on reddit as intended; I think my scrape speed is a fifth or maybe even a tenth of what it was before.

@razielgn You should try testing it on behance.

razielgn · 2021-06-09T18:41:55Z

Works great, thank you @mikf!

Hrxn · 2021-06-11T04:06:07Z

@sourmilk01 Is there any specific reason for not using the archive file option here?

mikf · 2021-06-12T13:14:10Z

@Hrxn the problem here isn't detecting an already downloaded file, but gallery-dl's action when finding one in combination with parent and child extractors, e.g. Reddit and Imgur. Any skipped download on one Imgur URL didn't propagate to its parent or other children and didn't count towards the overall "skip limit". Hitting said "skip limit" on an Imgur URL also wasn't able to halt the download for its Reddit parent, only itself.

mikf added a commit that referenced this issue May 15, 2021

add 'parent-skip' option (#1399)

7ab8374

mikf added a commit that referenced this issue Jun 5, 2021

add '-T/--terminate' command-line option (#1399)

dfe1e09

sourmilk01 closed this as completed Jun 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] --abort ignoring type of extractor being used #1399

[Feature Request] --abort ignoring type of extractor being used #1399

sourmilk01 commented Mar 22, 2021

mikf commented Mar 22, 2021

sourmilk01 commented Mar 22, 2021

sourmilk01 commented Mar 26, 2021 •

edited

Loading

sourmilk01 commented Apr 21, 2021

razielgn commented Apr 29, 2021

mikf commented Jun 5, 2021

sourmilk01 commented Jun 8, 2021

razielgn commented Jun 9, 2021

Hrxn commented Jun 11, 2021

mikf commented Jun 12, 2021

[Feature Request] --abort ignoring type of extractor being used #1399

[Feature Request] --abort ignoring type of extractor being used #1399

Comments

sourmilk01 commented Mar 22, 2021

mikf commented Mar 22, 2021

sourmilk01 commented Mar 22, 2021

sourmilk01 commented Mar 26, 2021 • edited Loading

sourmilk01 commented Apr 21, 2021

razielgn commented Apr 29, 2021

mikf commented Jun 5, 2021

sourmilk01 commented Jun 8, 2021

razielgn commented Jun 9, 2021

Hrxn commented Jun 11, 2021

mikf commented Jun 12, 2021

sourmilk01 commented Mar 26, 2021 •

edited

Loading