Skip to content

CAPP 30254 1 (Spring 2020) Machine Learning for Public Policy Final Project

License

Notifications You must be signed in to change notification settings

christi-liongson/covid_decarceration

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

COVID-19 Decarceration and Public Health

CAPP 30254 Machine Learning Final Project

Table of Contents

Overview

The problem

According to data collected by the New York Times, many of the largest outbreaks of the coronavirus are in carceral sites - correctional institutions, prison systems and jails. The close quarters make it impossible for inmates and staff alike to follow physical distancing guidelines. Advocates across the country have rallied around a decarceration campaign, pressuring attorneys general and sheriffs’ offices to release prisoners for the sake of public health. At the same time, skeptics have raised questions about public safety and crime rates.

The question(s):

Can we predict the death rate/rate of infection in the prison population without decarceration? Can we predict recidivism rates, with respect to violent crime, if people are de-carcerated? And what qualitative analysis can we contribute to this urgent conversation?

Installation:

To install required packages, run the following command in your command-line interface:

pip install -r requirements.txt

To view the Jupyter notebooks for data preprocessing and analysis, run the following command in your command-line interface to open Jupyter notebooks in your browser from the main project folder:

cd analysis
jupyter notebook

Directory:

Additional information on files in each subfolder are listed below by policy area.

  • data: Raw and processed data (Public Safety and Public Health)
  • exploratory_analysis: Exploratory analysis on data (Public Safety and Public Health)
  • files: Database configuration and analysis (Public Safety)
  • analysis: Data cleaning, preprocessing, and analysis (Public Health)

North Carolina's Department of Public Safety (NCDPS) releases "all public information on all NC Department of Public Safety offenders convicted since 1972." Before running the following modules or notebooks, download all tables and store them as CSVs. (Note: this will require around 5 GB of storage). Run ./ncdoc_parallel.sh to store the data in the preprocessed/ directory. For more information, see ncdoc_data project by jtwalsh0.

Files (Public Safety)

  • config.py: contains CSV locations and constants e.g., seed, CSV names, etc.
  • main.py: builds, populates, and queries a SQLite3 database by calling the following modules
    • create_db.py: establishes a connection and creates tables
    • populate_db.py: inserts records into database tables
  • build_dataset.py: queries tables in database, constructs flags and additional features, and outputs a CSV. Also contains functions to prepare data for and conduct analysis. Calls the following modules:
    • query_db.py: executes SQL queries on the database
    • pipeline.py: contains functions to perform imputation, one-hot encoding, etc.
  • classification.py: runs classification models, outputs precision-recall curves and the most important features. Also contains function to predict on active sentences using the best model.
  • query_and_build.ipynb: calls functions in build_dataset.py to output datasets as CSVs.
  • models_1994.ipynb and models_2008.ipynb: finds the best model(s) and returns evaluation metrics for data trimmed starting at 1994 and data starting at 2008 respectively
  • predict_active.ipynb: applies all of our models on all of our datasets (1994 and 2008, different target outcomes).

Data (Public Safety)

  • coding_offenses.xls: categorizes offense labels from the NCDPS based on extent of harm on a scale from 1 to 5, where 1 is the least likely and 5 is most likely
  • dataset_main_active3.csv: pre-processed output from build_all() in build_dataset, where recidivism is defined as reincarceration within three years of release. Files is too large to be pushed to github, but can be recreated using the information above

Analysis (Public Health)

  • clean_data.py: functions to transform data for machine learning. Functions include one-hot-encoding and normalizing data.
  • prison_conditions_wrangle.py: functions to clean and wrangle data from the UCLA COVID in Prisons dataset and the Bureau of Justice Statistics
  • build_prison_conditions_df.py: functions to build dataframes on prison capacity, prison population numbers, COVID-19 related social distancing policies in prisons, and mitigation policies to address the adverse effects of isolation on prisoners.
  • ph_analysis.py: functions to run a series of ML models on the COVID in prisons dataset. Functions include temporally splitting the data, running a temporal cross validation grid search to tune hyperparameters, training and testing several models, and selecting and evaluating the best predictors of COVID-19 cases in prisons.
  • prison_data_processing.ipynb: a Jupyter Notebook walking through the process of building the COVID in Prisons data set, and running the Machine Learning analysis.
  • ph_plotting.py: functions to plot related public health data and cross-validation

Data (Public Health)

Team:

Authors

We also want to acknowledge and thank the course staff of CAPP 30254 (Nick Feamster, Felipe Alamos, Tammy Glazer, Alec Macmillen, Erika Tyagi, and Jonathan Tan) for their feedback and encouragement.

About

CAPP 30254 1 (Spring 2020) Machine Learning for Public Policy Final Project

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •