CAPP-30254 Final Project Repository
This research was conducted by Claire Herdeman, Alena Stern, and Justin Cohler for the Machine Learning and Public Policy course in the Masters of Science in Computational Analysis and Public Policy program at the University of Chicago. This research uses data from The Eviction Lab at Princeton University, a project directed by Matthew Desmond and designed by Ashley Gromis, Lavar Edmonds, James Hendrickson, Katie Krywokulski, Lillian Leung, and Adam Porton. The Eviction Lab is funded by the JPB, Gates, and Ford Foundations as well as the Chan Zuckerberg Initiative. More information is found at evictionlab.org. The authors owe a debt of gratitude to The Eviction Lab for providing the inspiration and data for this project.
Our repository is structured with the following directories:
src/
: contains code for this project, the key files are as follows:db_client.py
: contains database client utilities (including connecting, writing, copying, reading, and disconnecting from the database)db_init.py
: populates the database from raw data sources, generates features, and formats data for analysisml_utils.py
: contains machine learning utilities and functions to run analysisresults/
: contains the csv outputs of the overview of models, feature importances for each run, precision-recall graphs, and tree visualizations
data_resources/
: contains data README file and detailed methodology report from The Eviction Labresources/
: contains non-python resources used for classification and model executiontests/
: contains tests
The authors persisted extensive feature generation in an AWS RDS instance, along with the data provided from The Eviction Lab dataset. For access to the database, please contact the GitHub repository owner.
To connect to the database
-
Fill in the template in src/set_env.sh with your database credentials and host details.
-
Ensure the following environment variables are set (e.g.
EXPORT DB_HOST=dev.yourenv.com
):DB_HOST
DB_PORT
DB_USER
DB_PASSWORD
-
Alternatively, in
resources/
, create a file calledsecrets.json
with the following format:{ "DB_USER": "<USERNAME>", "DB_PASSWORD": "<PASSWORD>", "DB_HOST": "<HOST NAME>", "DB_PORT": "<HOST PORT>" }
We assume the use of Python 3.5+ for this project.
- Run
pip install -r requirements.txt
in the top level of the repository - Change directories into the
src
directory (cd src
) - Run Phases 1, 2, and 3 of analysis described in the project description document via
python ml_utils.py
- This will execute three different modeling phases on different combinations of classifiers, parameters, features, and output labels
- By default, the database cursor is set to only pull data from Cook County, IL to enable more speedy testing of the code.
- Run the
Census_Data_Cleaning
ipython notebook in the data directory to clean the raw census data for import - Run
db_init.py
to populate the database with the raw data and generate features - NOTE: the raw data files should be in the
src/data/raw
directory on your local machine
- Run
ml_utils.py
to perform our three phases of analysis and outputs a .csv file with the results from each phase- Because Linear SVC produces a different output than the other classifers, we chose to run this classifier separately in
ml_utils_svc.py
- Because Linear SVC produces a different output than the other classifers, we chose to run this classifier separately in
- Note: for phase 3, the script also produces feature importance .csv files, precision recall graph .png files, and decision tree .png files. You will need to create
src/results/csv
andsrc/results/images
directories on your local machine to capture this output. - Run the
Phase I Analysis
,Phase II Analysis
, andPhase III Analysis
ipython notebooks to replicate the model evaluation and feature importance analysis conducted by the authors at each stage
- To assess our models for bias in Phase III, the authors used the Data Science for Social Good Aequitas Bias Assessment and Audit Toolkit (http://aequitas.dssg.io/). To replicate this analysis, visit the url, click "Get Started" and upload the bias csv files to the
src/results/csv
repository.