Skip to content

Multi-threaded web scraper to download all the tutorials from www.learncpp.com and convert them to PDF files concurrently.

License

Notifications You must be signed in to change notification settings

amalrajan/learncpp-download

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LearnCPP Downloader

Multi-threaded web scraper to download all the tutorials from www.learncpp.com and convert them to PDF files concurrently.

Support ❤️

Please support here: https://www.learncpp.com/about/

Execution

Docker

Get the image

docker pull amalrajan/learncpp-download:latest

And run the container

docker run --rm --name=learncpp-download --mount type=bind,destination=/app/learncpp,source=/home/amalr/temp/downloads amalrajan/learncpp-download

Replace /home/amalr/temp/downloads with a local path on your system where you'd want the files to get downloaded.

Local

Install these dependencies

  1. Python 3.10.12

  2. wkhtmltopdf

  • Debian based: sudo apt install wkhtmltopdf
  • macOS: brew install Caskroom/cask/wkhtmltopdf
  • Windows: choco install wkhtmltopdf (or simply download it the old fashioned way). I wouldn't recommend using Windows, as the fonts are a bit weird. Unless of course, you have a thing for weird stuff.

Run it

Clone the repository

git clone https://github.com/amalrajan/learncpp-download.git

Install Python dependencies

cd learncpp-download
pip install -r requirements.txt

Run the script

scrapy crawl learncpp 

You'll find the downloaded files inside learncpp directory under the repository root directory.

FAQ

I'm getting rate limit errors. What should I do?

Go to settings.py and set DOWNLOAD_DELAY to a higher value. The default is 0. Try setting it to 0.2.

This script is using 100% CPU. What's wrong?

That's the way it is. You can however go ahead and reduce the concurrency factor in learncpp.py

self.executor = ThreadPoolExecutor(
    max_workers=192
)  # Limit to 192 concurrent PDF conversions

Chamge max_workers to a lower value. The default is 192.

Don't see what you are looking for?

Feel free to open a new issue here: https://github.com/amalrajan/learncpp-download/issues. Don't forget to attach those console logs.

License

The MIT License