Add the soc-LiveJournal1Adj.txt and the userdata.txt file to databricks or your local folder. Export jar files from the projects and run them using the following commands.
Input: Input files
-
soc-LiveJournal1Adj.txt
The input contains the adjacency list and has multiple lines in the following format:
is a unique integer ID(userid) corresponding to a unique user. -
userdata.txt
The userdata.txt contains dummy data which consist of
column1 : userid ()
column2 : firstname
column3 : lastname
column4 : address
column5: city
column6 :state
column7 : zipcode
column8 :country
column9 :username
column10 : date of birth.
Program 1: MapReduce program in Hadoop to implements a simple "Mutual/Common friend list of two friends". This program will find the mutual friends between two friends.
Let's take an example of friend list of A, B and C.
Friends of A are B, C, D, E, F.
Friends of B are A, C, F.
Friends of C are A, B, E
So A and B have C, F as their mutual friends. A and C have B, E as their mutual friends. B and C have only A as their mutual friend.
In map phase we need to split the friend list of each user and create pair with each friend.
Let's process A's friend list
(Friends of A are B, C, D, E , F)
Key | Value
A,B | B, C, D, E, F
A,C | B, C, D, E, F
A,D | B, C, D, E, F
A,E | B, C, D, E, F
A,F | B, C, D, E, F
Let's process B's friend list
(Friends of B are A, C, F)
Key | Value
A,B | A, C, F
B,C | A, C, F
B,F | A, C, F
We have created pair of B with each of it's friends and sorted it alphabetically. So, the first key (B,A) will become (A,B).
After map phase is shuffling data item into group by key. Same keys go to the same reducer.
A,B | B, C, D, E, F
A,B | A, C, F
Shuffling into {A,B} group and sent to the same reducer.
A,B | {B, C, D, E , F}, {A, C, F}
So, finally at the reducer we have 2 lists corresponding to 2 people. Now, we need to find the intersection to get the mutual friends.
To optimize the solution i.e. to make the intersection faster I have used similar concept as merge operation in merge sort. I have sorted the friend list in the map phase. So, in reducer side we get 2 sorted lists. This way we can use the merge like operation to take only the matching values instead of going for all possible combinations in O(N2).
Please, make sure that the keys are sorted alphabetically so that we get friends list for 2 person on the same reducer.
Program 2. Find top-10 friend pairs by their total number of common friends. For each top-10 friend pair print detail information in decreasing order of total number of common friends.
Program 3. List the 'user id' and 'rating' of users that reviewed businesses classified as “Colleges & Universities” in list of categories.
The dataset files are as follows and columns are separate using ‘::’ business.csv. review.csv. user.csv.
The data set comprises of three csv files, namely user.csv, business.csv and review.csv.
Business.csv file contain basic information about local businesses.
Business.csv file contains the following columns "business_id"::"full_address"::"categories"
'business_id': (a unique identifier for the business) 'full_address': (localized address),'categories': [(localized category names)]
review.csv file contains the star rating given by a user to a business. Use user_id to associate this review with others by the same user. Use business_id to associate this review with others of the same business.
review.csv file contains the following columns "review_id"::"user_id"::"business_id"::"stars" 'review_id': (a unique identifier for the review)
'user_id': (the identifier of the reviewed business),
'business_id': (the identifier of the authoring user),
'stars': (star rating, integer 1-5),the rating given by the user to a business
user.csv file contains aggregate information about a single user across all of Yelp user.csv file contains the following columns "user_id"::"name"::"url"
user_id': (unique user identifier),
'name': (first name, last initial, like 'Matt J.'), this column has been made anonymous to preserve privacy
'url': url of the user on yelp
Note: :: is Column separator in the files.
Required files are 'business' and 'review'.
User id Rating 0WaCdhr3aXb0G0niwTMGTg 4.0
Program 4. List the business_id , full address and categories of the Top 10 businesses located in "NY" using the average ratings.
business_id | full address | categories | avg rating
xdf12344444444, CA 91711 List['Local Services', 'Carpet Cleaning'] 5.0
Run the sparkstream_word_count.py program and provide it the kafka host and the topic name.
Command : ./spark-submit sparkstream_word_count.py localhost:9092 test
"test" is the topic name and localhost:9092 is the kafka server address.
Run Stream Producer to produce continuous data and publish it on topic test.
Please, add the Guardian website APIs in the program by registering for guardian APIs and creating an API key.