Speedup? #5

gjaekel · 2022-11-25T10:57:07Z

Dear Alexander,
fist, a big thank you to you and the other contributors mentioned in the README.

To speed up archiving a whole year I ask myself if it would be possible to trigger the generation of (all or some number of) the documents in a first step. And to start the download of all in a second step.

This may avoid to busy-wait for the generation time at every document.

I'll start a POC about this, now

gjaekel · 2022-11-25T12:32:23Z

By temporarily inserting a continue 2 before L120 ...

dl_for_heise/download.sh

Lines 117 to 121 in 2e59241

    
           else 
        
               # If the header says it is not a pdf, we will try again. 
        
               echo "${logp}${try} Server did not serve a valid pdf (instead ${content_type})." 
        
               sleepbar ${wait_between_downloads} 
        
           fi

... I made a POC-version that just will trigger the backend to prepare the documents.

I run this to prepare five issues. I start the unmodified version immediately afterwards, but this was to fast and it seems that I stepped into an DOS protection by to many requests at a time because I got an HTTP RC 500. But after about half an hour, I try it again and this run downloads the five issues one by on without any delay.

gjaekel · 2022-11-25T13:34:03Z

Another approach: I just set max_tries_per_download=1(L.9) and run the script multiple times.

run trigger the preparation. This takes about a minute for a year of issues.
run was able to get most of the issues without any waiting time. It fail at issue 22 and 26, maybe the 2nd run started just a little to early.
run was able to get the remaining issues.

Maybe a good concept is just to "pull out" the retries for the download from the innermost to an outermost loop.

AlexanderMelde · 2022-11-27T16:12:28Z

Hi Guido, you're welcome! Thank you for documenting your experiments.
Indeed i followed a very similar approach and just ran the scripts twice, the first run with no repetitions or wait times to just "trigger" the serverside pdf generation, and a second run to finally download them. That worked well and especially fast for most PDFs, however it wasn't really reliable (e.g. due to the DDOS protection). For this script, i decided to keep it "safe, but slow" - with high number of repetitions and long wait times, to ensure everything is downloaded, e.g. over night.
If you however want to get a quicker run, with the downside of manual monitoring of the progress, everyone should feel free to adapt the parameters (as we two did :) ).
Maybe we could introduce some kind of config files, or example sets of parameters to include in the README file 😄

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speedup? #5

Speedup? #5

gjaekel commented Nov 25, 2022

gjaekel commented Nov 25, 2022

gjaekel commented Nov 25, 2022

AlexanderMelde commented Nov 27, 2022

Speedup? #5

Speedup? #5

Comments

gjaekel commented Nov 25, 2022

gjaekel commented Nov 25, 2022

gjaekel commented Nov 25, 2022

AlexanderMelde commented Nov 27, 2022