Skip to content

Machine learning model to estimate the probability that a source of air pollution will fail an inspection.

License

Notifications You must be signed in to change notification settings

lucien-sim/epa-air-violations

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

EPA Air Violations

The model takes in information about a given inspection under the Clean Air Act (CAA), and outputs a probability that the inspection will reveal a violation. The current output for all pollution sources regulated under the CAA is available through this web application. Comprehensive documentation for the project is available through the web applications's documentation page. The model was created as part of my capstone project for The Data Incubator.

Getting Started

What's in this repository?

  • ./scripts/execute_retraining.sh: Script that instructs the model retraining and pushes the result to Heroku.
  • ./scripts/prepare_to_retrain.py: Script that prepares the directory for model retraining (mostly prepares the log file).
  • ./notebooks/download_data.ipynb: Notebook that downloads all data for the model.
  • ./scripts/download_data.py: Script version of download_data.ipynb.
  • ./notebooks/link_inspections_violations.ipynb: Notebook that links inspections and violations.
  • ./scripts/link_inspections_violations.py: Script version of link_inspections_violations.ipynb.
  • ./notebooks/prepare_nei_data.ipynb: Notebook that prepares National Emissions Inventories data for use in the model.
  • ./scripts/prepare_nei_data.py: Notebook version of prepare_nei_data.ipynb.
  • ./notebooks/build_evaluate_model.ipynb: Notebook that creates the training/test datasets, and then trains and evaluates the model.
  • ./scripts/build_evaluate_model.py: Script version of build_evaluate_model.ipynb.
  • ./scripts/tests.py: Script that contains some unit tests.
  • ./scripts/external_variables.py: Contains a few variables that the system needs in order to run. Duplicate can be found at ./notebooks/external_variables.py.

Prerequisites

  • Python 3
  • pandas
  • numpy
  • scipy
  • scikit-learn
  • imbalanced-learn
  • joblib
  • requests
  • matplotlib
  • bokeh

Installing and runnning

With anaconda:

  1. Install anaconda
  2. Create a new Python 3 environment (myenv can be whatever you want):
conda create -n myenv python=3
  1. Enter the new environment:
source activate myenv
  1. Install dependencies:
conda install -c conda-forge pandas numpy scipy scikit-learn imbalanced-learn joblib requests matplotlib bokeh
  1. Change into the repository's main directory on your local machine.
  2. Type execute_retraining.sh to train the model.
  3. If you wish to deploy the model, please set up your own Heroku application to do so. If you do not wish to deploy the model, just comment the last few lines of execute_retraining.sh.

Running the tests

The tests run automatically during the data download/model retraining process. If a test fails, an AssertionError is written to the log file and the retrained model is not uploaded to Heroku.

Test descriptions

  1. TEST 1: Check filenames of all the downloaded data. This ensures that the code is able to locate the files it needs to access, and ensures that no serious changes occur in the EPA data without our knowing. These tests are executed in download_data.py.
  2. TEST 2: Check column names in all of the downloaded data. This ensures that the input data has a format that the model is able to deal with. These tests are executed in download_data.py.
  3. TEST 3: Check the percentage of violations that are linked to inspections. If this drops below 60%, the test fails. When I developed the model, the percentage was hovering around 75%. This test is executed in link_inspections_violations.py.
  4. TEST 4: Check the model's performance against a baseline. An error is raised if the model does not meet the standards. This test is executed in tests.py.
  5. TEST 5: Check that the data file for the web-app looks good. Make sure it has all the columns that the web-app needs, and make sure that the file has predictions for at least 150,000 facilities (it should have predictions for ~190,000). This test is executed in tests.py.

Authors

  • Lucien Simpfendoerfer

License

This project is licensed under the MIT License - see the LICENSE.md file for details

Acknowledgments

About

Machine learning model to estimate the probability that a source of air pollution will fail an inspection.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published