EPA Air Violations

The model takes in information about a given inspection under the Clean Air Act (CAA), and outputs a probability that the inspection will reveal a violation. The current output for all pollution sources regulated under the CAA is available through this web application. Comprehensive documentation for the project is available through the web applications's documentation page. The model was created as part of my capstone project for The Data Incubator.

Getting Started

What's in this repository?

./scripts/execute_retraining.sh: Script that instructs the model retraining and pushes the result to Heroku.
./scripts/prepare_to_retrain.py: Script that prepares the directory for model retraining (mostly prepares the log file).
./notebooks/download_data.ipynb: Notebook that downloads all data for the model.
./scripts/download_data.py: Script version of download_data.ipynb.
./notebooks/link_inspections_violations.ipynb: Notebook that links inspections and violations.
./scripts/link_inspections_violations.py: Script version of link_inspections_violations.ipynb.
./notebooks/prepare_nei_data.ipynb: Notebook that prepares National Emissions Inventories data for use in the model.
./scripts/prepare_nei_data.py: Notebook version of prepare_nei_data.ipynb.
./notebooks/build_evaluate_model.ipynb: Notebook that creates the training/test datasets, and then trains and evaluates the model.
./scripts/build_evaluate_model.py: Script version of build_evaluate_model.ipynb.
./scripts/tests.py: Script that contains some unit tests.
./scripts/external_variables.py: Contains a few variables that the system needs in order to run. Duplicate can be found at ./notebooks/external_variables.py.

Prerequisites

Python 3
pandas
numpy
scipy
scikit-learn
imbalanced-learn
joblib
requests
matplotlib
bokeh

Installing and runnning

With anaconda:

Install anaconda
Create a new Python 3 environment (myenv can be whatever you want):

conda create -n myenv python=3

Enter the new environment:

source activate myenv

Install dependencies:

conda install -c conda-forge pandas numpy scipy scikit-learn imbalanced-learn joblib requests matplotlib bokeh

Change into the repository's main directory on your local machine.
Type execute_retraining.sh to train the model.
If you wish to deploy the model, please set up your own Heroku application to do so. If you do not wish to deploy the model, just comment the last few lines of execute_retraining.sh.

Running the tests

The tests run automatically during the data download/model retraining process. If a test fails, an AssertionError is written to the log file and the retrained model is not uploaded to Heroku.

Test descriptions

TEST 1: Check filenames of all the downloaded data. This ensures that the code is able to locate the files it needs to access, and ensures that no serious changes occur in the EPA data without our knowing. These tests are executed in download_data.py.
TEST 2: Check column names in all of the downloaded data. This ensures that the input data has a format that the model is able to deal with. These tests are executed in download_data.py.
TEST 3: Check the percentage of violations that are linked to inspections. If this drops below 60%, the test fails. When I developed the model, the percentage was hovering around 75%. This test is executed in link_inspections_violations.py.
TEST 4: Check the model's performance against a baseline. An error is raised if the model does not meet the standards. This test is executed in tests.py.
TEST 5: Check that the data file for the web-app looks good. Make sure it has all the columns that the web-app needs, and make sure that the file has predictions for at least 150,000 facilities (it should have predictions for ~190,000). This test is executed in tests.py.

Authors

Lucien Simpfendoerfer

License

This project is licensed under the MIT License - see the LICENSE.md file for details

Acknowledgments

Inspiration came from a recent paper by Hino et al. (2018)

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
figs		figs
literature		literature
notebooks		notebooks
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EPA Air Violations

Getting Started

What's in this repository?

Prerequisites

Installing and runnning

Running the tests

Test descriptions

Authors

License

Acknowledgments

About

Releases

Packages

Languages

License

lucien-sim/epa-air-violations

Folders and files

Latest commit

History

Repository files navigation

EPA Air Violations

Getting Started

What's in this repository?

Prerequisites

Installing and runnning

Running the tests

Test descriptions

Authors

License

Acknowledgments

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages