Project to get started with spark using Python and Scala.
Yelp.com website and the Yelp mobile app, which publish crowd-sourced reviews about businesses.
Dataset used in this project.
The goal of these tasks is to get you familiar with Spark operation types (e.g., transformations and actions).
Task 1: Data Exploration.
- The total number of reviews.
- The number of reviews in 2018.
- The number of distinct users who wrote reviews.
- The top 10 users who wrote the largest numbers of reviews and the number of reviews they wrote.
- The number of distinct businesses that have been reviewed.
- The top 10 businesses that had the largest numbers of reviews and the number of reviews they had.
Task 2: Partition.
- In this task, show the number of partitions for the RDD used for Task 1 Question F and the number of items per partition.
- Use a customized partition function to improve the performance of map and reduce tasks. A time duration (for executing Task 1 Question F) comparison between the default partition and the customized partition (RDD built using the partition function).
Task 3: Exploration on Multiple Datasets.
- Explore two datasets together containing review information (test_review.json) and business information (business.json). What are the average stars for each city?
- Compare the execution time of using two methods to print top 10 cities with highest stars.
Method 1: Collect all the data, sort in python/scala, and then print the first 10 cities.
Method 2: Sort in Spark, take the first 10 cities, and then print these 10 cities.