Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speedup? #5

Open
gjaekel opened this issue Nov 25, 2022 · 3 comments
Open

Speedup? #5

gjaekel opened this issue Nov 25, 2022 · 3 comments

Comments

@gjaekel
Copy link

gjaekel commented Nov 25, 2022

Dear Alexander,
fist, a big thank you to you and the other contributors mentioned in the README.

To speed up archiving a whole year I ask myself if it would be possible to trigger the generation of (all or some number of) the documents in a first step. And to start the download of all in a second step.

This may avoid to busy-wait for the generation time at every document.

I'll start a POC about this, now

@gjaekel
Copy link
Author

gjaekel commented Nov 25, 2022

By temporarily inserting a continue 2 before L120 ...

dl_for_heise/download.sh

Lines 117 to 121 in 2e59241

else
# If the header says it is not a pdf, we will try again.
echo "${logp}${try} Server did not serve a valid pdf (instead ${content_type})."
sleepbar ${wait_between_downloads}
fi

... I made a POC-version that just will trigger the backend to prepare the documents.

I run this to prepare five issues. I start the unmodified version immediately afterwards, but this was to fast and it seems that I stepped into an DOS protection by to many requests at a time because I got an HTTP RC 500. But after about half an hour, I try it again and this run downloads the five issues one by on without any delay.

@gjaekel
Copy link
Author

gjaekel commented Nov 25, 2022

Another approach: I just set max_tries_per_download=1(L.9) and run the script multiple times.

  1. run trigger the preparation. This takes about a minute for a year of issues.
  2. run was able to get most of the issues without any waiting time. It fail at issue 22 and 26, maybe the 2nd run started just a little to early.
  3. run was able to get the remaining issues.

Maybe a good concept is just to "pull out" the retries for the download from the innermost to an outermost loop.

@AlexanderMelde
Copy link
Owner

Hi Guido, you're welcome! Thank you for documenting your experiments.
Indeed i followed a very similar approach and just ran the scripts twice, the first run with no repetitions or wait times to just "trigger" the serverside pdf generation, and a second run to finally download them. That worked well and especially fast for most PDFs, however it wasn't really reliable (e.g. due to the DDOS protection). For this script, i decided to keep it "safe, but slow" - with high number of repetitions and long wait times, to ensure everything is downloaded, e.g. over night.
If you however want to get a quicker run, with the downside of manual monitoring of the progress, everyone should feel free to adapt the parameters (as we two did :) ).
Maybe we could introduce some kind of config files, or example sets of parameters to include in the README file 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants