Contributors: Jackie Glasheen, Kathryn Link-Oberstar, Jennifer Yeaton
We explored the concept of distribution drift in Divvy bike ridership data, spanning from 2014 through 2019. Utilizing a random sample of one million trips (approximately 5% of the dataset), we examine trip duration trends, revealing a notable shift in distribution patterns between 2014-2017 and 2018-2019, especially during summer months. Our analysis incorporates various machine learning models including K-Nearest Neighbors (KNN), Random Forest, and Multi-Layer Perceptrons (MLP) to predict trip durations. The models are evaluated using Mean Squared Error (MSE), Mean Absolute Error (MAE), and Root Mean Squared Error (RMSE), with adjustments for recent trends to address the observed distribution drift. The MLP model demonstrates superior performance, suggesting its effectiveness in handling high-dimensional data and adapting to non-linear patterns in the presence of distribution drift. Our findings highlight the importance of accounting for temporal changes in data distributions when developing predictive models.
Read our final paper HERE
This project was completed as part of coursework for Mathematical Foundations of Machine Learning (Computer Science 35300) at the University of Chicago.