The model takes in information about a given inspection under the Clean Air Act (CAA), and outputs a probability that the inspection will reveal a violation. The current output for all pollution sources regulated under the CAA is available through this web application. Comprehensive documentation for the project is available through the web applications's documentation page. The model was created as part of my capstone project for The Data Incubator.
./scripts/execute_retraining.sh
: Script that instructs the model retraining and pushes the result to Heroku../scripts/prepare_to_retrain.py
: Script that prepares the directory for model retraining (mostly prepares the log file)../notebooks/download_data.ipynb
: Notebook that downloads all data for the model../scripts/download_data.py
: Script version ofdownload_data.ipynb
../notebooks/link_inspections_violations.ipynb
: Notebook that links inspections and violations../scripts/link_inspections_violations.py
: Script version oflink_inspections_violations.ipynb
../notebooks/prepare_nei_data.ipynb
: Notebook that prepares National Emissions Inventories data for use in the model../scripts/prepare_nei_data.py
: Notebook version ofprepare_nei_data.ipynb
../notebooks/build_evaluate_model.ipynb
: Notebook that creates the training/test datasets, and then trains and evaluates the model../scripts/build_evaluate_model.py
: Script version ofbuild_evaluate_model.ipynb
../scripts/tests.py
: Script that contains some unit tests../scripts/external_variables.py
: Contains a few variables that the system needs in order to run. Duplicate can be found at./notebooks/external_variables.py
.
- Python 3
- pandas
- numpy
- scipy
- scikit-learn
- imbalanced-learn
- joblib
- requests
- matplotlib
- bokeh
With anaconda:
- Install anaconda
- Create a new Python 3 environment (
myenv
can be whatever you want):
conda create -n myenv python=3
- Enter the new environment:
source activate myenv
- Install dependencies:
conda install -c conda-forge pandas numpy scipy scikit-learn imbalanced-learn joblib requests matplotlib bokeh
- Change into the repository's main directory on your local machine.
- Type
execute_retraining.sh
to train the model. - If you wish to deploy the model, please set up your own Heroku application to do so. If you do not wish to deploy the model, just comment the last few lines of
execute_retraining.sh
.
The tests run automatically during the data download/model retraining process. If a test fails, an AssertionError is written to the log file and the retrained model is not uploaded to Heroku.
- TEST 1: Check filenames of all the downloaded data. This ensures that the code is able to locate the files it needs to access, and ensures that no serious changes occur in the EPA data without our knowing. These tests are executed in
download_data.py
. - TEST 2: Check column names in all of the downloaded data. This ensures that the input data has a format that the model is able to deal with. These tests are executed in
download_data.py
. - TEST 3: Check the percentage of violations that are linked to inspections. If this drops below 60%, the test fails. When I developed the model, the percentage was hovering around 75%. This test is executed in
link_inspections_violations.py
. - TEST 4: Check the model's performance against a baseline. An error is raised if the model does not meet the standards. This test is executed in
tests.py
. - TEST 5: Check that the data file for the web-app looks good. Make sure it has all the columns that the web-app needs, and make sure that the file has predictions for at least 150,000 facilities (it should have predictions for ~190,000). This test is executed in
tests.py
.
- Lucien Simpfendoerfer
This project is licensed under the MIT License - see the LICENSE.md file for details
- Inspiration came from a recent paper by Hino et al. (2018)