Citibike Rides - NY bike rides analysis

The general idea is to merge a given citibike dataset with a weather dataset for the same time frame and attempt to extract relevant stats such as the impact of weather on bike rides.

Dataset sources

Merge/Join strategy

Transform both dataset timestamps to a common pre-defined format (%d-%m-%Y %H:%M), ignoring the seconds. Since the weather data contains entries on a per-minute basis for the same time frame as the bike rides data, there is inevitably going to be a matching weather entry for every bike ride.

Note: the weather data for every bike ride refers to the time the ride was started.

Analysis

The following aspects are analyzed:

Weather (rain in particular) impact on bike rides
Number of bike rides per day, hour and total amount
Gender repartition of customers/riders

Note: the rain impact calculations rely on a uni-varied and highly oversimplified analysis that can provide a general idea of the impact of such weather conditions but it is by no means precise.

Getting Started

These instructions will give you a copy of the project up and running on your local machine for development and testing purposes.

Prerequisites

Requirements for the software and other tools to build and run

Docker

OR alternatively:

Python3
pip3

Installing

Building the docker image

docker build -t citibike --rm .

OR alternatively: installing the pip requierments

python3 -m pip install --user -r requirements.txt

Running

Linux/MacOS:

docker run --name citibike --rm -v $(pwd):/home/citibike citibike

Windows:

docker run --name citibike --rm -v %pwd%:/home/citibike citibike

OR alternatively: running the python3 app directly:

python3 app.py

Example output

Most of the information inccluded in the JSON report is pretty self-explanatory, with the exception perhaps of the rain percentages. The basic idea is to establish the % of rain time based on the weather data for the given period and compare it to the % of bike rides started in rainy weather conditions. In an ideal (and severely oversimplified) world, the impact of rain could be measured by the ratio of these 2 percentages.

The JSON report as well as the generated charts are saved in the output/ directory

Architecture thoughts

The most natural solution for scaling such a system could be to deploy multiple nodes that each get fed bike ride and weather datasets 1 pair at a time. The nodes can then independently perform the necessary operations for every dataset pair they receive: cleanup and merge the datasets, run the analysis and feed the partial report to a master node that aggregates all incoming data into a global report.

Possible improvements

Increase the depth of analysis. As it currently stands, the only thing taken into account with regards to the weather conditions is the information at the time a given ride starts. It could be useful to also look at the conditions at the end (and perhaps during) the ride. The time of day is also something that can have a significant impact on the outcome since weather conditions at night are far less likely to affect bike rides (the majority of which take place during the day).

It is also worth mentioning that when talking about "clear weather conditions" what that really here means is non-rainy weather conditions, which technically means that snowy weather would count as clear weather conditions, as it currently stands. This might be something worth updating

Input flexibility: when launched, the program will analyze the 3 (hardcoded) example dataset pairs found in the datasets/ folder and exit. It would be useful to allow for variable input
Chart generation: the plt.save_fig call does not act in a consistent manner and sometimes generates strange/empty charts. Should probably get to the bottom of this.

There simply aren't enough "rainy rides" in the example datasets to justify the charts that the program currently generates so it might be a good idea to find a more relevant way of visualizing the information

Add unit tests: yikes

Built With

Python3 - Code
pandas - Data manipulation and analysis
matplotlib - Chart visualization
Docker - Building and running

Authors

Theodor Dimov - Initial work - GitHub

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
datasets		datasets
images		images
Dockerfile		Dockerfile
README.md		README.md
analysis.py		analysis.py
app.py		app.py
dataset_utils.py		dataset_utils.py
report_utils.py		report_utils.py
requirements.txt		requirements.txt
utils.py		utils.py
visualization.py		visualization.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Citibike Rides - NY bike rides analysis

Dataset sources

Merge/Join strategy

Analysis

Getting Started

Prerequisites

Installing

Running

Example output

Architecture thoughts

Possible improvements

Built With

Authors

About

Releases

Packages

Languages

tdimov93/citibike-ride-analysis

Folders and files

Latest commit

History

Repository files navigation

Citibike Rides - NY bike rides analysis

Dataset sources

Merge/Join strategy

Analysis

Getting Started

Prerequisites

Installing

Running

Example output

Architecture thoughts

Possible improvements

Built With

Authors

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages