Solving analytical questions on the semi-structured MovieLens dataset containing a million records using Spark and Scala. This features the use of Spark RDD, Spark SQL and Spark Dataframes executed on Spark-Shell (REPL) using Scala API. We aim to draw useful insights about users and movies by leveraging different forms of Spark APIs.
- Linux (Ubuntu 15.04)
- Hadoop 2.7.2
- Spark 2.0.2
- Scala 2.11
-
Simply clone the repository
git clone https://github.com/Thomas-George-T/Movies-Analytics-in-Spark-and-Scala.git
-
In the repo, Navigate to Spark RDD, Spark SQL or Spark Dataframe locations as needed.
-
Run the execute script to view results
sh execute.sh
-
The
execute.sh
will pass the scala code through spark-shell and then display the findings in the terminal from the results folder.
- What are the top 10 most viewed movies?
- What are the distinct list of genres available?
- How many movies for each genre?
- How many movies are starting with numbers or letters (Example: Starting with 1/2/3../A/B/C..Z)?
- List the latest released movies
- Create tables for movies.dat, users.dat and ratings.dat: Saving Tables from Spark SQL
- Find the list of the oldest released movies.
- How many movies are released each year?
- How many number of movies are there for each rating?
- How many users have rated each movie?
- What is the total rating for each movie?
- What is the average rating for each movie?
- Prepare Movies data: Extracting the Year and Genre from the Text
- Prepare Users data: Loading a double delimited csv file
- Prepare Ratings data: Programmatically specifying a schema for the dataframe
- Import Data from URL: Scala
- Save table without defining DDL in Hive
- Broadcast Variable example
- Accumulator example
- Databricks Community Edition
Note: The results were collected and repartitioned into the same text file: This is not a recommended practice since performance is highly impacted but it is done here for the sake of readability.
This project was featured on Data Machina Issue #130 listed at number 3 under ScalaTOR. Thank you for the listing
This repository is licensed under Apache License 2.0 - see License for more details