Skip to content

estwanick/MarketAnalysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

67 Commits
 
 
 
 
 
 

Repository files navigation

Project Description

Use Hadoop MadReduce to predict the Monthly Stock Volatility of a set a stocks

Instructions for running

  • Setup EMR Cluster

  • In the advanced options settings for EMR select boostrap script

    • Under additional Options select Boostrap Actions
      • For add bootstrap action select (custom action)
      • Add this to the script location: s3://monthly-volatility-mapreduce/boostrap.sh
      • This will install Scipy, Numpy and Scikit Learn to all nodes in the cluster
  • After the cluster is setup

    • Under *Steps set the following settings
      • Step type: Customer JAR
      • Name: Stage(n) MapReduce
      • JAR Location: command-runner.jar
      • Arguments: (Copy and paste one of the scripts below), the order for running the scripts is (Stage 1, Stage 2, Stage 3)
      • Action on Failure: Continue

Stage 1

hadoop-streaming -D mapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator -D stream.map.output.field.separator=, -D map.output.key.field.separator=, -files s3://monthly-volatility-mapreduce/stage1/map.py,s3://monthly-volatility-mapreduce/stage1/reduce.py -mapper map.py -reducer reduce.py -input s3://monthly-volatility-data/tech50 -output s3://monthly-volatility-output/stage1results

Stage 2

hadoop-streaming -D mapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator -D stream.map.output.field.separator=, -D stream.num.map.output.key.fields=1 -D map.output.key.field.separator=, -files s3://monthly-volatility-mapreduce/stage2/map.py,s3://monthly-volatility-mapreduce/stage2/reduce.py -mapper map.py -reducer reduce.py -input s3://monthly-volatility-output/stage1results -output s3://monthly-volatility-output/stage2results

Stage 3

hadoop-streaming -D mapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator -D stream.map.output.field.separator=, -D stream.num.map.output.key.fields=1 -D map.output.key.field.separator=, -files s3://monthly-volatility-mapreduce/stage3/map.py,s3://monthly-volatility-mapreduce/stage3/reduce.py -mapper map.py -reducer reduce.py -input s3://monthly-volatility-output/stage2results -output s3://monthly-volatility-output/stage3results

Results

Cluster with 1 Master node and 4 Slave Nodes on Top 50 technology companies

- Stage 1 
    - Execution time 1 minute
- Stage 2
    - Execution time 1 minute
- Stage 3
    - Execution time 1 minute

Cluster with 1 Master node and 8 Slave Nodes on Top 50 technology companies

- Stage 1 
    - Execution time 1 minutes
- Stage 2
    - Execution time 1 minutes
- Stage 3
    - Execution time 1 minutes

Cluster with 1 Master node and 4 Slave Nodes on Top 100 technology companies

- Stage 1 
    - Execution time 2 minutes
- Stage 2
    - Execution time 1 minute
- Stage 3
    - Execution time 1 minute

Cluster with 1 Master node and 8 Slave Nodes on Top 100 technology companies

- Stage 1 
    - Execution time 1 minute
- Stage 2
    - Execution time 1 minute 
- Stage 3
    - Execution time less than 1 minute

Cluster with 1 Master node and 4 Slave Nodes on Top 400+ technology companies

- Stage 1 
    - Execution time 7 minutes
- Stage 2
    - Execution time 1 minute
- Stage 3
    - Execution time 1 minute 

Cluster with 1 Master node and 8 Slave Nodes on Top 400+ technology companies

- Stage 1 
    - Execution time 4 minutes
- Stage 2
    - Execution time  1 minute
- Stage 3
    - Execution time 48 seconds

Fault Tolerance

Description of test 1

  • Cluster with 1 Master node and 4 Slave Nodes on Top 400+ technology companies
  • Run Stage 1
  • 2 nodes fail
  • EMR replaces with 2 new nodes

Results

  • Execution time 19 minutes

Description of test 2

  • Cluster with 1 Master node and 4 Slave Nodes on Top 400+ technology companies
  • Run Stage 1
  • All data nodes fail (4 nodes)

Results

  • (Job failed)All slaves in the job flow were terminated

Description of test 3

  • Cluster with 1 Master node and 4 Slave Nodes on Top 400+ technology companies
  • Run Stage 1
  • Name node fails

Results

  • (Job failed) The master node was terminated and cannot recover

Other information (S3 Buckets are no longer public)

S3 Data Resources

Amazon EMR Settings

About

Predict Monthly Volatility for 400+ Stocks

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published