Skip to content

Latest commit

 

History

History
118 lines (92 loc) · 7.93 KB

README-en-US.md

File metadata and controls

118 lines (92 loc) · 7.93 KB

English (US) | Português (BR)

Querido Diário

Querido Diário

Within the Querido Diário ecosystem, this repository is responsible for scraping official gazettes publishing sites

Find out more about technologies and history of the project on the Querido Diário website

Summary

How to contribute

catarse

Thank you for considering contributing to Querido Diário! 🎉

You can find how to do it at CONTRIBUTING-en-US.md!

Also, check the Querido Diário documentation to help you.

Development Environment

You need to have Python (+3.0) and Scrapy framework installed.

The commands below set it up in Linux operating system. They consist of creating a virtual Python environment, installing the requirements listed in requirements-dev and the code standardization tool pre-commit.

python3 -m venv .venv
source .venv/bin/activate
pip install -r data_collection/requirements-dev.txt
pre-commit install

Configuration on other operating systems is available at "how to setup the development environment", including more details for those who want to contribute to the repository.

How to run

To try running a scraper already integrated into the project or to test what you are developing, follow the commands:

  1. If you haven't already done so, activate the virtual environment in the /querido-diario directory:
source .venv/bin/activate
  1. Go to the data_collection directory:
cd data_collection
  1. Check the available scrapers list:
scrapy list
  1. Run a listed scraper:
scrapy crawl <scraper_name> //example: scrapy crawl ba_acajutiba
  1. The official gazettes collected from scraping will be saved in the data_collection/data folder

  2. When executing item 4, the scraper will collect all official gazettes from the publishing site of that municipality since the first digital edition. For smaller runs, use flags in the run command:

  • start_date=YYYY-MM-DD: will set the collecting start date.
scrapy crawl <scraper_name> -a start_date=<YYYY-MM-DD>
  • end_date=YYYY-MM-DD: will set the collecting end date. If omitted, it will assume the date of the day it is being executed.
scrapy crawl <scraper_name> -a end_date=<YYYY-MM-DD>

Troubleshooting

Check out the troubleshooting file to resolve the most common issues with project environment setup.

Support

Discord Invite

Join our community server for exchanges about projects, questions, requests for help with contributions and talk about civic innovation in general.

Thanks

This project is maintained by Open Knowledge Brazil and made possible thanks to the technical communities, the Ambassadors of Civic Innovation, volunteers and financial donors, in addition to partner universities, companies supporters and funders.

Meet who supports Querido Diario.

Open Knowledge Brazil

Twitter Follow Instagram Follow LinkedIn Follow

Open Knowledge Brazil is a non-profit civil society organization whose mission is to use and develop civic tools, projects, public policy analysis and data journalism to promote free knowledge in the various fields of society.

All work produced by OKBR is openly and freely available.

License

Code licensed under the MIT License.