Website scraping scripts for FerryWave project
You will need Python 3.9.6 or up to run this set of scripts.
/venv folder contains the Python environment with necessary Python libraries. Before running the scripts, switch to this environment with a source command:
source /venv/bin/activate
If for any reason the environment does not work for you, the file requirements.txt includes all necessary python packages. You can install them on your machine with pip:
pip install -r requirements.txt
You need to export the MongoDB password for the 'python-user' account into the local environment. You can do it like this:
export MONGODB_PASSWORD=<password>
Tesseract is an image recognition library that is used to scrape some of the timetabling data for the FerryWave website. You will need to install Tesseract and Tesseract OCR on your machine in order to run scraping for all the websites.
Follow the official documentation for installation steps for your OS (choose version 5):
After the installation you will have to provide the path o Tesseract executable file. Change the appropriate line in
settings.py
Runnnig update_database.py
will run the scraping for all defined website destinations, clear the database and push the new data into the database.