-
-
Notifications
You must be signed in to change notification settings - Fork 951
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What happens if I run with abort: 3
after a download for a page was interrupted, and more than 3 new uploads have been posted?
#1283
Comments
it will abort after 3 times skipped no matter what |
Interesting. Can I make it so it doesn't? I have quite a few pages that have interrupted in the past due to various errors (both caused by me and that were out of my control) and I want those to be fully downloaded eventually. But I don't want to run with |
You can use Depending on the site, it might also be possible to quickly skip over a large chunk of already downloaded files without hammering the site with (useless) HTTP requests by using |
Oh sorry, by
The problem is I have multiple ones and don't know which ones they are. Is there a way to reliably find them? Also, something I keep wondering is what is the difference between |
No, not that I can think of. The only way is most likely going to be running gallery-dl with all URLs again.
|
@github-account1111 you could do as i do: do 2 downloads. Basically, select the single gallery where you had the error. and run that with a different config file that has "skip:true" Aaand slowly go through the list. A few at a time. The only other way to do this is to run a simlulation that only downloads the page galleries and then check which ones have undownloaded stuff But if you are worried of bans, that is also unadvisable. Also i don't remember if gallery-dl has a function for that. In general though. It's better to not hoard a ton of galleries because you then get 2 things: "fear of missing out" and also "sensory overstimulation". The first is the nagging feeling you're missing something. The second is the kind of stuff where art just does not work anymore because you're waded through a ton of it. |
@mikf Just reran all the URLs with |
@Butterfly-Dragon We're talking hundreds of URLs here, so doing it that way might take a couple months and a ton of manual labor. This is mostly for archival purposes. I might not ever see most of those downloads. But stuff gets deleted very frequently, and I used to get very upset in the past when, for instance, visiting one of my YouTube playlists only to discover a third of it is gone. If anything, this takes care of FOMO, because I now know that since the script runs periodically, I don't have to visit any of those pages anymore. |
Wait... so you are going through hundreds of urls to download (at least) thousands of videos?!?! Uhm. Okay? I don't see the purpose of archiving stuff you will not have the physical time to appreciate. 😅 But sorry if i intruded. |
I mean I'm not. The script is. That's the whole point haha If I just wanted to download a couple pages I'd have probably just done it manually instead of figuring out how a new cli tool works.
Photos and videos. It's a mix of websites using different formats. E.g. artstation only has pics, youtube only has vids, and instagram has a mix of both.
There is a chance I will. Just like in the example with youtube playlists, if something is deleted from the Internet, I will still have access to it. That's the whole point of archival (check out the Wayback Machine, for instance). Storage is cheap nowadays, so I don't see why not. |
Your amusement is justified though. It can sound pretty weird. There's not necessarily a rationale behind every single part of this. Like I said, this is to an extent psychological in that it's my way of coping with FOMO. If it weren't for this, I imagine I'd spend a lot more time browsing those websites than I currently do (which is fairly infrequently). |
i ... honestly use gallery-dl because my connection sucks and this way i have to download far less stuff and it gets done quicker 😅 |
@mikf, can we download posts in reverse order? Firstly get all links (memory-costly without Is this feature implemented or planned? |
how is that any different than running without abort or terminate? |
Downloads are happening from recent to older. Am I right? |
I mean, not LISTING from older to newer, but keep listing from newer to older, but cache the list until we hit the abort/terminate condition, and THEN reverse and download. I should have been more clear about this, thank you. |
if you do not abot nor terminate you keep downloading and since you need to check the URL anyway, it changes nothing, it is just slower. The only solution i see of the problem above is that the SQL file lists the stored URLs and also lists a previous state of the gallery and the last known download state of the gallery. If the last known download state does not list as having reached the previous known state (because the program was abruptly terminated) and the old download state lists the gallery as having been fully downloaded then it does not keep trying to download past the last known download. Otherwise it could keep downloading because of "anomalous interruption". If the gallery was never fully downloaded and a known state of the gallery as "fully downloaded" is not reached (with 3 overlaps because "abort:3") then you keep downloading, never aborting nor terminating. This means the SQL needs to record which images belong to which gallery and this is not done everywhere AFAIK. It might require a full re-check (but not redownload) of everything so that it writes which image belongs to who and which gallery is complete at which images downloaded and then write all new image downloads with that same protocol. This would allow to not have "gaps" in case of a computer going down mid-download of a gallery (something that happens quite often *sigh* ) Reverse image download would only make sense for galleries like "Tapas comics" which add the new chapters at the end of the gallery. But, again, it is just faster to use "skip = true" (which is the default, or "do not abort nor terminate") for those edge cases. |
Again. Correct me if I'm wrong, but: My sample command-line: Each time it lists everything only to find that nothing has to be downloaded. If I would add, Is this right?
Is this correct? If so, then why do you think inversion of the download order won't help? It fetches 250 links (200 new + 50 existing), starts downloading from 199 back to 0, disconnects at 100; then the next time it would list 150 and continues roughly from 99 to 0. What am I missing here? |
Because that is the same as writing down that one artist (or checking which artist was working on in the log or on the files as sorted by "most recent"), starting download with that gallery and telling it to not abort/terminate. Your system not only checks ALL the URLs in that gallery but also does it twice: first in one way then the opposite way as it downloads. And it still solves nothing because what if the gallery is 250 images you have "abort 50" and you downloaded both the first 51 and the last 51 images? you are still missing 148 images. Telling it to not terminate downloads on that one gallery is just the easiest way to do it without rewriting how SQLs are handled and how the abort/terminate is handled. |
It should LIST from the start "as now", stop when saw enough downloaded ("as now"), but then start fetching them, from oldest to newest. Basically, currently algo is roughly this:
This can be seen as an optimized version of this one algo:
So what am I proposing is reversing the list before the last step here! Currently we parse and download sequentially, I suggest to parse all what we are UP TO download, and then start downloading, but in reversed order. The only serious issue I can imagine is some kind of link expiration, eg. the very first one because it would be downloaded potentially long after it was fetched. |
Oh, "first 51 + last 51" is impossible if using terminate + reverse correctly, since reverse would never make a gap. |
you still have not answered how is that different from "do not abort/terminate" aside from being slower. You keep focusing on the minutiae as if they mattered. If you do no "abort/terminate" then you download everything you did not download leaving no gaps. But it requires a single pass rather than multiple because a gallery of even as few as 250 images usually requires 4 main pages which you have to download and do BUT ALSO you are asking the thing to ignore the first few because there was an anomalous termination or whatever which means you have gaps. What if you have multiple gaps? what if you have... basically you want something which is edge case at best and i can see working only on galleries which add new images at the very end of a gallery. And even then just setting that gallery to just not abort/terminate is faster than what you are proposing even when you have to download no new images. |
What are you talking about? Page 1: image15, image14, image13, image12, image11; For example, I have abort=3. First pass will: Imagine there are 4 new images. Now it looks like this: Page 1: image19, image18, image17, image16, image15; If will fetch page 1 downloading 19, 18, 17, 16 but skipping 15; then it would download page 2 and skip 14, 13, then abort because 3 were skipped already. So far so good. Now imagine 6 new images: Page 1: image25, image24, image23, image22, image21; It fetches page 1, downloading all 25, 24, 23, 22, 21, then fetches page 2 but assume the internet disconnects before the download of image 20 starts, ruining the job. User restarts later and here what happens: it fetches page 1, skips 25, 24, 23 as already downloaded and aborts! Never fetching the page 2 and failing to grab image20 forever. Now, my algo from the start (3 pages, 15 images) will do this: First update (4 pages, 19 images): Second update (5 pages, 25 images): – No matter on which file the internet disconnects, there won't be a "gap" inside the sequence, it always grows backwards monotonically. If I would not use abort, then the last case will be: My method won't cause unnecessary page fetches (which is a huge problem with artists that have thousands of images) and at the same time guarantees that it never misses anything if you always enable it. |
okay. now validate by showing how any of those scenarios is better than just fetching all in speed, resources and/or efficiency. |
I can run the script fetching 200+ artists each day, and at best it would make just 200 requests instead of 200 times number of pages (per each artist accordingly), which could be huge for some artists. I already hitting pixiv flood limit 3 times per "just skip normally" run, for example. Don't tell me increasing timeouts, don't tell me running the script rarer, and don't tell me raising the abort value. All of those are WORKAROUNDS, while reversing the download sequence is the solution! |
The high abort value is a trade-off between "fetching too many pages unnecessary" and "high chance to miss something somewhere". Also, the rarer script runs are, the bigger should be the abort value, meaning if you want to keep it low (in 1-2 pages), you would have to schedule the whole job to run more often, so that no more than 2 pages of new content would appear for any artist – to be 100% safe in all cases. |
You are clearly under intense stress and FOMO. FOMO is bad and you should get treatment for it. I download daily 1500+ artists and i rarely see more than 20 images being added daily. Except for the AI """artists""". Your screams of "i will lose something somewhere" are obvious sign of FOMO turning into useless panic. This is not dismissal, it is concern. Get help. That said: For pixiv i suggest the dedicated downloader https://github.com/Nandaka/PixivUtil2/releases which prevents a lot of your problems as gallery_dl is a generic downloader and falls short of dedicated ones. Pixiv artists do have the tendency of adding stuff like 15 images all at once like a "manga" that utility allows you to scour those pages unimpeded and check all the artists, from pixiv alone i have 300+ artists and at best all i ever got was reduced download speed with that utility, when rebuilding an archive. |
So you need to either have low abort, or reverse downloading. |
reverse downloading is just worse forward downloading without abort and yes i keep abort to 5 at most except for TAPAS webcomics which add the new stuff at the very end so i set those to specifically download without skipping. |
It may or may not be worse considering what is faster in every particular case: download all needed pages and then all needed images, or take pages by one and images between them. When a lot of content is added – the abort value would be respected anyway, and the total speed will be the same. We are not wasting anything, downloading the same pages and the same images, just saving this limited set of images from older to recent. |
With skip enabled you just recover the url to check if it was downloaded which you need to do anyway. The nornal way just downloads anything missing immediately rather than building the image gallery tree first then download what's missing. in most cases building a full image tree is impossible due to 429 blocks. |
You don't need ALL the links, you will get only missing ones and then abort hits. Are you still not getting the point? |
you need all of the links to know which to discard. there is no "future reading" + "i will not need this" feature that lets you discard stuff you do not need before checking what it is to tell you do not need it. Reading backwards stilll requires you to get all the links in a page to go to the end to see if there are more pages. |
Currently:
|
I presume you don't understand how reverse downloading will "fill the gap" if it can't know about it? If you already made a gap somehow (by not using reverse downloading, for example) – then the only way to fix it is to run without abort. And it perfectly knows which one exactly, just as abort does it currently. |
So. "Forward" downloading (without abort/terminate)
If you see a (few) artist(s) with a(/some) broken download(s) you just write down that artist(s) and do a special forward download just for them. That is literally it. It's not like it happens constantly, it can only happen once every time you do a full remap of an artist's site. If you left the PC doing something you know when it restarted at which point you just write down which artist's folder is the last one by telling it to sort folders by "modified" and you tell it to check that artist specifically. otherwise it's handled by infinite retries or other systems already in place. |
What is "mapping the site"? |
Each run of "downloading without abort/terminate" will fetch ALL pages of the artist, no matter if it really had new stuff or older gaps.
On the other hand, using abort/terminate will fix all of the above EXCEPT for the possibility to leave gaps! What you suggested is a workaround: "if gallery-dl failed, the do a full run" (or estimate where the gap is and change the abort value, etc.) If by "mapping the site" you'd meant "store the list of links before starting the actual download", then yes, but this stage is cheap in term of the resource usage (especially if the abort value is low). If @mikf would tell, "I cannot do this due to complexity for now", I can perfectly understand. |
Oh, I have another fair idea! How about a new argument, like Meaning, you set it for a period roughly "from your previous run", for example 604800 (60x60x24x7) is one week – and so any file that was updated no more than a week ago would NOT be taken into account for abort. This way, anything you might have been downloading recently – will be still re-fetched anyway (but not re-downloaded, just as now), even if nothing changed. For artists that had no new works for months – this will do nothing, since everything would count for about (which you can set pretty low now). But for those who posted, e.g., yesterday 100 new pictures – you will fetch those pages again, even if your abort is much smaller. The download would stop anyway, after fetching older pages that will trigger the abort normally. This will effectively prevent gaps as long as you keep re-running the job in case of errors (or just to be extra sure), because even if a previous download fails – the next one will retry that exact file, since nothing from newly downloaded counts for about. How that sounds? |
open a new case for this this was closed so it is probably unfollowed by anybody else. But yeah that's a more sensible way of dealing with crashed/rebooted PC in the middle of a download. |
Oh, there is a thread: #5255 |
Wait, there is also tackling with the archive: fd734b9 |
Hmm, it didn't help: even with I think this is mainly because the files are there, and gallery-dl relies on their existence and not on the archive. UPD: |
Wow, this is better: |
Will it download only the new uploads since the interruption, or will it also finish downloading the older posts that were not downloaded previously due to the interruption? Is that accounted for in the archive?
The text was updated successfully, but these errors were encountered: