Lee-Or Bentovim, Katherine Dumais, Andrew Dunn, Kathryn Link-Oberstar
Using a Kaggle dataset from the Inter-American Development Bank, we design a machine learning model to classify household-level poverty using a Proxy Means Tests methodology. After data cleaning and collapsing the data to the household level, we use several oversampling techniques and cross validation to improve model performance given imbalances in poverty categories. Following testing random forests, logistic regression, naive bayes, and k-nearest neighbors, as well as different combinations of hyperparameters, we select logistic regression as our best performing model. We also test ensemble methods and explore using a binary poverty categorization. Finally, we note limitations of our approach and recommendations for further exploration.
We describe our complete approach and results in a full report.
Professor: Chenhao Tan
Teaching Assistant: Zander Meitus
Data Source: Inter-American Development Bank data publicly hosted on Kaggle.