Apache Spark Projects

Add the soc-LiveJournal1Adj.txt and the userdata.txt file to databricks or your local folder. Export jar files from the projects and run them using the following commands.

Input: Input files

soc-LiveJournal1Adj.txt
The input contains the adjacency list and has multiple lines in the following format:

is a unique integer ID(userid) corresponding to a unique user.
userdata.txt
The userdata.txt contains dummy data which consist of
column1 : userid ()
column2 : firstname
column3 : lastname
column4 : address
column5: city
column6 :state
column7 : zipcode
column8 :country
column9 :username
column10 : date of birth.

Program 1: MapReduce program in Hadoop to implements a simple "Mutual/Common friend list of two friends". This program will find the mutual friends between two friends.

Logic :

Let's take an example of friend list of A, B and C.

Friends of A are B, C, D, E, F.
Friends of B are A, C, F.
Friends of C are A, B, E
So A and B have C, F as their mutual friends. A and C have B, E as their mutual friends. B and C have only A as their mutual friend.

Map Phase :

In map phase we need to split the friend list of each user and create pair with each friend.

Let's process B's friend list
(Friends of B are A, C, F)
Key | Value
A,B | A, C, F
B,C | A, C, F
B,F | A, C, F
We have created pair of B with each of it's friends and sorted it alphabetically. So, the first key (B,A) will become (A,B).

Reducer Phase :

After map phase is shuffling data item into group by key. Same keys go to the same reducer.
A,B | B, C, D, E, F
A,B | A, C, F

Shuffling into {A,B} group and sent to the same reducer.
A,B | {B, C, D, E , F}, {A, C, F}

So, finally at the reducer we have 2 lists corresponding to 2 people. Now, we need to find the intersection to get the mutual friends.

Optimization

To optimize the solution i.e. to make the intersection faster I have used similar concept as merge operation in merge sort. I have sorted the friend list in the map phase. So, in reducer side we get 2 sorted lists. This way we can use the merge like operation to take only the matching values instead of going for all possible combinations in O(N2).

Please, make sure that the keys are sorted alphabetically so that we get friends list for 2 person on the same reducer.

Program 2. Find top-10 friend pairs by their total number of common friends. For each top-10 friend pair print detail information in decreasing order of total number of common friends.

Output format can be:

Program 3. List the 'user id' and 'rating' of users that reviewed businesses classified as “Colleges & Universities” in list of categories.

Data set info :

The dataset files are as follows and columns are separate using ‘::’ business.csv. review.csv. user.csv.

Dataset Description :

The data set comprises of three csv files, namely user.csv, business.csv and review.csv.
Business.csv file contain basic information about local businesses.
Business.csv file contains the following columns "business_id"::"full_address"::"categories"

'business_id': (a unique identifier for the business) 'full_address': (localized address),'categories': [(localized category names)]

review.csv file contains the star rating given by a user to a business. Use user_id to associate this review with others by the same user. Use business_id to associate this review with others of the same business. review.csv file contains the following columns "review_id"::"user_id"::"business_id"::"stars" 'review_id': (a unique identifier for the review)

'user_id': (the identifier of the reviewed business),
'business_id': (the identifier of the authoring user),
'stars': (star rating, integer 1-5),the rating given by the user to a business
user.csv file contains aggregate information about a single user across all of Yelp user.csv file contains the following columns "user_id"::"name"::"url"
user_id': (unique user identifier),
'name': (first name, last initial, like 'Matt J.'), this column has been made anonymous to preserve privacy
'url': url of the user on yelp
Note: :: is Column separator in the files.

Required files are 'business' and 'review'.

Sample output

User id Rating 0WaCdhr3aXb0G0niwTMGTg 4.0

Program 4. List the business_id , full address and categories of the Top 10 businesses located in "NY" using the average ratings.

Sample output:

business_id | full address | categories | avg rating
xdf12344444444, CA 91711 List['Local Services', 'Carpet Cleaning'] 5.0

Program 5 : Spark Streaming Word Count. This is used to give a word count of the streaming data.

Run the sparkstream_word_count.py program and provide it the kafka host and the topic name.
Command : ./spark-submit sparkstream_word_count.py localhost:9092 test
"test" is the topic name and localhost:9092 is the kafka server address.

Run Stream Producer to produce continuous data and publish it on topic test.
Please, add the Guardian website APIs in the program by registering for guardian APIs and creating an API key.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
databricks_scripts		databricks_scripts
dataset		dataset
spark_streaming_word_count		spark_streaming_word_count
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Apache Spark Projects

Program 1: MapReduce program in Hadoop to implements a simple "Mutual/Common friend list of two friends". This program will find the mutual friends between two friends.

Logic :

Map Phase :

Reducer Phase :

Optimization

Program 2. Find top-10 friend pairs by their total number of common friends. For each top-10 friend pair print detail information in decreasing order of total number of common friends.

Output format can be:

Program 3. List the 'user id' and 'rating' of users that reviewed businesses classified as “Colleges & Universities” in list of categories.

Data set info :

Dataset Description :

Sample output

Program 4. List the business_id , full address and categories of the Top 10 businesses located in "NY" using the average ratings.

Sample output:

Program 5 : Spark Streaming Word Count. This is used to give a word count of the streaming data.

About

Releases

Packages

Languages

add1993/apache-spark-projects

Folders and files

Latest commit

History

Repository files navigation

Apache Spark Projects

Program 1: MapReduce program in Hadoop to implements a simple "Mutual/Common friend list of two friends". This program will find the mutual friends between two friends.

Logic :

Map Phase :

Reducer Phase :

Optimization

Program 2. Find top-10 friend pairs by their total number of common friends. For each top-10 friend pair print detail information in decreasing order of total number of common friends.

Output format can be:

Program 3. List the 'user id' and 'rating' of users that reviewed businesses classified as “Colleges & Universities” in list of categories.

Data set info :

Dataset Description :

Sample output

Program 4. List the business_id , full address and categories of the Top 10 businesses located in "NY" using the average ratings.

Sample output:

Program 5 : Spark Streaming Word Count. This is used to give a word count of the streaming data.

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages