Apache Spark Streaming LDA

This application is for analyzing and storing Twitter data in real time.

Setup Kafka

Download and install Kafka 2.10-0.8.1.1

Start Zookeeper
cd /path/to/kafka
bin/zookeeper-server-start.sh config/zookeeper.properties

Start Kafka
bin/kafka-server-start.sh config/server.properties

Create Kafka topic
bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic tweets

Twitter Kafka pipeline

Create a Twitter user and create a Twitter app at apps.twitter.com to get your Consumer key and Access Token.

Create Twitter Kafka pipeline to collect tweets with:
https://github.com/NFLabs/kafka-twitter

Run Kafka Twitter Producer with:
cd kafka-twitter-master
./gradlew run -Pargs="conf/producer.conf"

MySQL

Install MySQL, log in and then create an user that you want the app to use, and grant it permissions.
Then do the following command:

mysql> USE default;

Spark creates the tables if it doesn't find them in the given database. But if you want you can make them with these commands

mysql> CREATE TABLE LDAResults(
-> peak_at TIMESTAMP, 
-> LDA TEXT,  
-> hashtags TEXT );
mysql> CREATE TABLE Tweets(
-> username TEXT,
-> created_at TIMESTAMP, 
-> text TEXT,
-> hashtags TEXT, 
-> lang TEXT,
-> partition TIMESTAMP );

To keep your database from blowing up I suggest you make some kind of cleaning event that removes every day x days old data.

Setup Apache Spark and the App

Download and build Apache Spark. Version used in development of this app was 1.4.0. Download and add mysql-connector-java.jarto Spark Classpath. Version used 5.1.36.

Change your MySQL database information in Launcher.scala and LDAObject.scala.

cd spark-twitter-lda/
sbt assembly

Launch Apache Spark standalone server.

cd /path/to/spark
sbin/start-all.sh

You need to start this application at even 10minutes for the peak detection to work correctly. Launch App with command:

./bin/spark-submit --class main.scala.Launcher --master spark://localhost.localdomain:7077 /path/to/sparktwitterlda.jar

Visualize your data

For data visualization you can use anything that supports mysql connection. I created a simple graph in d3.js and php.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
project		project
src/main/scala		src/main/scala
web		web
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build.sbt		build.sbt
stopwords.txt		stopwords.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Apache Spark Streaming LDA

Setup Kafka

Twitter Kafka pipeline

MySQL

Setup Apache Spark and the App

Visualize your data

About

Releases

Packages

Languages

License

hussnainahmed/spark-twitter-lda

Folders and files

Latest commit

History

Repository files navigation

Apache Spark Streaming LDA

Setup Kafka

Twitter Kafka pipeline

MySQL

Setup Apache Spark and the App

Visualize your data

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages