Skip to content

manishamde/simplesparkapp

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Simple Spark Application

A simple Spark application that counts the occurrence of each word in a corpus and then counts the occurrence of each character in the most popular words. Includes the same program implemented in Java and Scala.

To make a jar:

mvn package

To run from a gateway node in a CDH5 cluster:

source /etc/spark/conf/spark-env.sh

export JAVA_HOME=/usr/java/jdk1.7.0_45-cloudera

# system jars:
CLASSPATH=/etc/hadoop/conf
CLASSPATH=$CLASSPATH:$HADOOP_HOME/*:$HADOOP_HOME/lib/*
CLASSPATH=$CLASSPATH:$HADOOP_HOME/../hadoop-mapreduce/*:$HADOOP_HOME/../hadoop-mapreduce/lib/*
CLASSPATH=$CLASSPATH:$HADOOP_HOME/../hadoop-yarn/*:$HADOOP_HOME/../hadoop-yarn/lib/*
CLASSPATH=$CLASSPATH:$HADOOP_HOME/../hadoop-hdfs/*:$HADOOP_HOME/../hadoop-hdfs/lib/*
CLASSPATH=$CLASSPATH:$SPARK_HOME/assembly/lib/*

# app jar:
CLASSPATH=$CLASSPATH:target/sparkwordcount-0.0.1-SNAPSHOT.jar

$JAVA_HOME/bin/java -cp $CLASSPATH -Dspark.master=local com.cloudera.sparkwordcount.SparkWordCount <input file> 2

This will run the application in a single local process. If the cluster is running a Spark standalone cluster manager, you can replace "-Dspark.master=local" with "-Dspark.master=spark://<master host>:<master port>".

If the cluster is running YARN, you can replace "-Dspark.master=local" with "-Dspark.master=yarn-client".

Future releases will include a "spark-submit" script that abstracts away the pain of building the classpath and specifying the cluster manager.

About

Simple Spark Application

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published