Skip to content

JakubWronskiUG/ferry-python

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ferry-python

Website scraping scripts for FerryWave project

Requirements

Python

You will need Python 3.9.6 or up to run this set of scripts.

/venv folder contains the Python environment with necessary Python libraries. Before running the scripts, switch to this environment with a source command: source /venv/bin/activate

If for any reason the environment does not work for you, the file requirements.txt includes all necessary python packages. You can install them on your machine with pip: pip install -r requirements.txt

Credentials

You need to export the MongoDB password for the 'python-user' account into the local environment. You can do it like this: export MONGODB_PASSWORD=<password>

Tesseract

Tesseract is an image recognition library that is used to scrape some of the timetabling data for the FerryWave website. You will need to install Tesseract and Tesseract OCR on your machine in order to run scraping for all the websites.

Follow the official documentation for installation steps for your OS (choose version 5):

  • https://github.com/tesseract-ocr/tesseract#installing-tesseract
  • Don't forget to install at least one full language package as well (preferably english): https://tesseract-ocr.github.io/tessdoc/Installation.html
    After the installation you will have to provide the path o Tesseract executable file. Change the appropriate line in settings.py

    Running the scraper

    Runnnig update_database.py will run the scraping for all defined website destinations, clear the database and push the new data into the database.

  • About

    Website scraping scripts for FerryWave project

    Resources

    Stars

    Watchers

    Forks

    Releases

    No releases published

    Packages

    No packages published