This repository includes several works aligned with the aim of exploring the Apache Spark functionalities in the Big Data setting.
As part of these projects, I explore various concepts underlying Apache Spark such as:
- RDD creation, manipulation (using transformation and actions)
- creating Spark context & SQL context in Spark
- Performing streaming using Spark
The Data for each of the projects is sourced from various sources like Twitter, and a few open source reservoirs. Due to the large size of the data, the files are restrained from uploading to the Git server.
Python, Apache Spark, PySpark, SQL, tweepy, Streaming, Socket, json
In this task, Spark transformation and actions are explored on a twitter user data to find the most active users on twitter and find the user with most retweet count
Load the 'Amazon responded' customer data consisting of more than 400 thousand tweets made by the consumers. The data includes the following fields tweet_created_at, user_screen_name, user_id_str
- The above data is explored using Spark actions to find the daily active users by implementing SQL queries in the Spark SQLContext as shown below.
from pyspark import SparkContext
import pyspark.sql.functions
from pyspark.sql import SQLContext
import pandas as pd
from pyspark.sql import SparkSession
sc = SparkContext.getOrCreate()
sqlCon = SQLContext(sc)
#Query
result = spark.sql("SELECT value from table2 where value in (SELECT user_id_str FROM table1)")
result = result.withColumn("Whether_active", lit("Yes"))
#Active users
active_users = result.count() #active users in those chosen for the experiment
The code file can be found here
Building a simple application that reads online streams from Twitter (for a given twitter handle), then processes the tweets using Apache Spark Streaming context created in order to achieve the following tasks
- listing hashtags
- highlight the trending hashtag
- Conduct a simple sentiment analysis using Bing Liu Opinion Lexicon
This involves a two-step approach:
A connection socket is created from which Tweets are listened (for a given twitter handle/hashtag condition) using a custom Tweet listener created on localhost. The below code block shows the custom function to read data from json (sent by twitter)
class TweetsListener(StreamListener):
def __init__(self, csocket):
self.client_socket = csocket
self.counter = 0
self.limit = 1000
def on_status(self, status):
#print(status.text)
self.counter+=1
if self.counter < self.limit:
return True
else:
return False
def on_data(self, data):
try:
msg = json.loads(data)
print(msg['text'].encode('utf-8'))
self.client_socket.send(msg['text'].encode('utf-8'))
return True
except BaseException as e:
print("Error in on_data: %s" % str(e))
return True
def on_error(self, status):
print(status)
return True
The full code for the Twitter Listener can be found here
The Tweets listened are collected from a Spark context and Streaming context.
#create spark context and the streaming context
sc = SparkContext("local[2]")
ssc = StreamingContext(sc,10)
IP = "localhost"
port=5555
#Logging control
sc.setLogLevel("ERROR")
lines = ssc.socketTextStream(IP, port)
The tweets streamed are passed for computation of sentiment scores using Bing-Liu Opinion Lexicon as a reference. Refer the below code
def calculate_sentiment_scores(text):
tokens = text.split(" ")
neg_score = 0
pos_score = 0
for word in dict_neg:
if (word in tokens):
neg_score = neg_score + 1
for word in dict_pos:
if (word in tokens):
pos_score= pos_score + 1
sentiment_score = pos_score - neg_score
return str(sentiment_score)
The sentiment scores calculated for each tweet is then transformed into a Spark RDD and the streaming context is terminated.
The code file can be found here