MBTA Bus Performance: Data Capture and Analysis
edit edit edit
Documentation for MBTA Bus Performance project lives on Github at:
- MBTA_Bus_Project_Proposal.md.
- Formal papers and data involving the MBTA reside in the MBTA_docs directory.
- Items that are less formal by nature (like meeting minutes, random notes, or project planning are kept on the MBTA Bus Performance Wiki.
This project contains two main components, one of which is data analysis with Apache Pig , another is an UI tool to display the analysis result. This tutorial is to introduce how to deploy it in a Linux shipped with Pig (We will use Hortonworks Data Platform) and how to run the analysis. For deploying the website tool in Apache Httpd server, please refer to mbta-busses-website.
Some background about the platform we will use in this tutorial
What is Hortonworks:
Hortonworks is a business computer software company based in Palo Alto, California. The company focuses on the development and support of Apache Hadoop, a framework that allows for the distributed processing of large data sets across clusters of computers[1]. There is no necessity to use Hortonworks Data Plaform to deploy this project, we use it because it's an off-the-shell Hadoop and Pig environment.
What is Pig:
Pig is a high level scripting language that is used with Apache Hadoop. Pig excels at describing data analysis problems as data flows. Pig is complete in that you can do all the required data manipulations in Apache Hadoop with Pig. In addition through the User Defined Functions(UDF) facility in Pig you can have Pig invoke code in many languages like JRuby, Jython and Java. Conversely you can execute Pig scripts in other languages. The result is that you can use Pig as a component to build larger and more complex applications that tackle real business problems[2].
Here is a nice official tutorial.
Once the Hortonworks Data Platform (HDP) is set up, we can clone the source code from our repository to the linux. First we can ssh from host into the HDP by using command below with password "hadoop":
$ ssh [email protected] -P 2222
Install GIT client:
$ sudo yum install git
Clone mbta-busses repository:
$ git clone https://github.com/BU-EC500-SP15/mbta-busses.git
Verify the repository by open the mbta-busses directory:
$ cd mbta-busses
Copy one day data set from the repository to Hadoop in HDP under mbta-busses directory:
$ hadoop fs -copyFromLocal ./DataSet/20140301.csv
Verify the file in hadoop:
hadoop fs -ls /
By default you can see the files in hadoop's HDFS as below:
Found 11 items
-rw-r--r-- 1 root hdfs 4300707 2015-05-03 00:08 /20140301.csv
drwxrwxrwx - yarn hadoop 0 2014-12-16 19:05 /app-logs
drwxr-xr-x - hdfs hdfs 0 2014-12-16 19:11 /apps
drwxr-xr-x - hdfs hdfs 0 2014-12-16 19:41 /demo
drwxr-xr-x - hdfs hdfs 0 2014-12-16 19:06 /hdp
drwxr-xr-x - mapred hdfs 0 2014-12-16 19:05 /mapred
drwxr-xr-x - hdfs hdfs 0 2014-12-16 19:05 /mr-history
drwxr-xr-x - hdfs hdfs 0 2014-12-16 19:31 /ranger
drwxr-xr-x - hdfs hdfs 0 2014-12-16 19:07 /system
drwxrwxrwx - hdfs hdfs 0 2015-05-02 19:07 /tmp
drwxr-xr-x - hdfs hdfs 0 2015-05-02 19:07 /user
Then we can process our data with Pig scripts for different metrics analysis and we provide shell scripts to trigger Pig scripts individually. With some defined parameters for the shell, we can generate analysis result for data of specific time period. Below is an example of running this scripts.
Run all the metrics for 2 yeas... 8 more hours, can do it during the night by crontab
$ nohup sh RunAll.sh &
You can run different metrics analysis for 2 years individually
$ sh MonthlyHeadway.sh --> Performance of Headways
$ sh MonthlyAvgWaitTime.sh --> Performance of AvgWaitTime
$ sh RunTime.sh --> Performance of RunTime
$ sh DiffTime.sh --> Performance of Difference Between Scheduled and Actual
To compute metrics for specific time period, you must provide parameters "Your data on the hadoop", "begin time", "end time", "begin date", "end date" to run those shell. Here is one example for calling Headway.sh, and so on so forth:
$ sh Headway.sh "Your data on the hadoop" "begin time" "end time" "begin date" "end date" (in Minute)
e.g. $ sh Headway.sh oneyear.csv 0 1440 2013-01-01 2013-02-01
All the result will be stored in a tsv file and saved in a result directory, and we can put the results to our website for projection on the charts dynamically.