Web scrapper to harverst Data items using bioschemas specifications markup. This project is based on Scrapy, a Python library to crawl web resources.
You will need pip to install the script requirements, over here you will find documentation about installing pip in your OS. The safer way to get your requirements installed without affecting any other Python project you have is using virtualenv. You will also need an Elastic Search instance running so you can save the crawled records.
And you will need to install scrapy, please find the isntallation steps here.
You will also need an Elastic Search instance running in order to save the crawled data.
git clone https://github.com/BioSchemas/bioschemas-scraper.git
cd bioschemas-scraper
viartualenv .venv
source activate .venv/bin/activate
pip install -r requirements.txt
After you finish the script execution you will need to deactivate your virtual environment:
deactivate
In order to configure the Elastic Search instance information in the scraper you need to modify the last lines in the file bioschemas_scraper/settings.py. This scraper is set to crawl Tess Events web site by default. If you want to generate a new spider for a different web site please take a look of bioschemas_scraper/spiders/bioschemas_spider_xml.py. If you want to add aditional processing to the crawled records you will need to check the pipelines defined in bioschemas_scraper/pipelines, for now there is only one pipeline that take every crawled Bioschemas object and the it validate it agains the Bioschemas Event specificication available as a JSON Schema file at bioschemas_scraper/utils/schemas/Event.json the validation logic is available at bioschemas_scraper/utils/validators.py.
In the root of the repo run:
scrapy crawl https://tess.elixir-europe.org/events
- Microdata