Multi-threaded web scraper to download all the tutorials from www.learncpp.com and convert them to PDF files concurrently.
Please support here: https://www.learncpp.com/about/
Get the image
docker pull amalrajan/learncpp-download:latest
And run the container
docker run --rm --name=learncpp-download --mount type=bind,destination=/app/learncpp,source=/home/amalr/temp/downloads amalrajan/learncpp-download
Replace /home/amalr/temp/downloads
with a local path on your system where you'd want the files to get downloaded.
-
Python 3.10.12
-
wkhtmltopdf
- Debian based:
sudo apt install wkhtmltopdf
- macOS:
brew install Caskroom/cask/wkhtmltopdf
- Windows:
choco install wkhtmltopdf
(or simply download it the old fashioned way). I wouldn't recommend using Windows, as the fonts are a bit weird. Unless of course, you have a thing for weird stuff.
Clone the repository
git clone https://github.com/amalrajan/learncpp-download.git
Install Python dependencies
cd learncpp-download
pip install -r requirements.txt
Run the script
scrapy crawl learncpp
You'll find the downloaded files inside learncpp
directory under the repository root directory.
Go to settings.py
and set DOWNLOAD_DELAY
to a higher value. The default is 0. Try setting it to 0.2.
That's the way it is. You can however go ahead and reduce the concurrency factor in learncpp.py
self.executor = ThreadPoolExecutor(
max_workers=192
) # Limit to 192 concurrent PDF conversions
Chamge max_workers
to a lower value. The default is 192.
Feel free to open a new issue here: https://github.com/amalrajan/learncpp-download/issues. Don't forget to attach those console logs.