Simple Spark Application

A simple Spark application that counts the occurrence of each word in a corpus and then counts the occurrence of each character in the most popular words. Includes the same program implemented in Java and Scala.

To make a jar:

mvn package

To run from a gateway node in a CDH5 cluster:

source /etc/spark/conf/spark-env.sh

export JAVA_HOME=/usr/java/jdk1.7.0_45-cloudera

# system jars:
CLASSPATH=/etc/hadoop/conf
CLASSPATH=$CLASSPATH:$HADOOP_HOME/*:$HADOOP_HOME/lib/*
CLASSPATH=$CLASSPATH:$HADOOP_HOME/../hadoop-mapreduce/*:$HADOOP_HOME/../hadoop-mapreduce/lib/*
CLASSPATH=$CLASSPATH:$HADOOP_HOME/../hadoop-yarn/*:$HADOOP_HOME/../hadoop-yarn/lib/*
CLASSPATH=$CLASSPATH:$HADOOP_HOME/../hadoop-hdfs/*:$HADOOP_HOME/../hadoop-hdfs/lib/*
CLASSPATH=$CLASSPATH:$SPARK_HOME/assembly/lib/*

# app jar:
CLASSPATH=$CLASSPATH:target/sparkwordcount-0.0.1-SNAPSHOT.jar

$JAVA_HOME/bin/java -cp $CLASSPATH -Dspark.master=local com.cloudera.sparkwordcount.SparkWordCount <input file> 2

This will run the application in a single local process. If the cluster is running a Spark standalone cluster manager, you can replace "-Dspark.master=local" with "-Dspark.master=spark://<master host>:<master port>".

If the cluster is running YARN, you can replace "-Dspark.master=local" with "-Dspark.master=yarn-client".

Future releases will include a "spark-submit" script that abstracts away the pain of building the classpath and specifying the cluster manager.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data		data
src/main		src/main
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Simple Spark Application

About

Releases

Packages

manishamde/simplesparkapp

Folders and files

Latest commit

History

Repository files navigation

Simple Spark Application

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages