End-to-End Movie Data Processing and Recommender System

This project covers the complete data life cycle from extraction to the deployment of a recommender system. It includes data extraction, transformation, loading into a database, integration into a data warehouse, and ultimately using the data for recommendations. The entire process is automated and incorporates a machine learning model for generating personalized recommendations, providing a comprehensive solution for movie data analysis and user engagement.

Project Overview

Installation

Clone the Repository:

git clone https://github.com/your-repo/movie-recommender.git
cd movie-recommender

Install Dependencies: This project uses Poetry for dependency management. Install it if you don’t have it already:

curl -sSL https://install.python-poetry.org | python3 -
poetry install

Set Up Environment Variables: Create a .env file in the root directory and add the following variables:

MOVIE_DATA_PATH=metadata_with_imdb_metadata.csv
EMBEDDING_MODEL=all-MiniLM-L6-v2
MODEL_PATH=models/
FAISS_INDEX_FILE=faiss_index.bin
EMBEDDINGS_FILE=movie_embeddings.pkl
MLFLOW_TRACKING_URI=http://localhost:5000
MLFLOW_EXPERIMENT_NAME=movie_recommender
MLFLOW_RUN_NAME=faiss_recommender

Data Life Cycle

This project covers the complete data life cycle from extraction to the deployment of a recommender system. It includes data extraction, transformation, loading into a database, integration into a data warehouse, and ultimately using the data for recommendations.

Data Sources

The data is sourced from various formats including:

JSON
CSV
XLSX

These files contain different aspects of movie data, which are standardized for further processing. The data is received daily at 1 AM.

The automation scheduling is handled by the following scripts:

automate_transformation.sh: This script processes and transforms the incoming data.
daily_schedule.sh: This script schedules the automation to ensure data is processed and stored in its respective date-specific folder.

For more details on the automation, see the Preprocessing_scripts folder.

The log_files folder contains execution output logs of the data processing. These logs document the processing activities but do not include the data itself.

Data Transformation

Standardization: The initial step involved transforming all incoming data formats into a consistent format.
Metadata Adjustment: A metadata file containing key movie data was modified to ensure completeness and accuracy.
Normalization: The data was normalized to ensure it was suitable for loading into the database.

For more details on the transformation scripts, visit the Preprocessing_scripts folder.

Database Setup

The normalized data was stored in the Database_all_in folder, which contains:

Queries: SQL queries for data retrieval and manipulation.
Schema and Creation: Scripts for creating the database schema and initial data loading.

The database was designed to efficiently store and retrieve movie-related data. More details can be found in the Database_all_in folder.

Data Warehouse

After setting up the database, the data was loaded into a data warehouse located in the DWH directory. This warehouse is designed using a star schema, which organizes data into fact tables and dimension tables.

This facilitates easier analysis and reporting, as well as integration with analytical tools. For more information, check the DWH folder.

Recommender Model

Following the data warehouse setup, the next steps involve leveraging this data to build a recommender system. The scripts related to the recommender model can be found in the Recommender_system folder.

Technologies Used

The following technologies were used to build this project:

SQL
SSIS
Python 3.8
Pandas
Poetry for dependency management
SQLAlchemy for database interaction with SQL Server
FastAPI
Bash
Sentence Transformers for generating embeddings
Faiss for similarity search
MLflow for model tracking and experiment logging
Streamlit for building the user interface

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
API_Project/Project		API_Project/Project
DWH		DWH
Database_all_in		Database_all_in
Preprocessing_scripts		Preprocessing_scripts
Programmability/Functions		Programmability/Functions
Raw_Data		Raw_Data
config		config
data		data
log_files		log_files
models		models
recommenders		recommenders
utils		utils
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
app.py		app.py
automate_transformation.sh		automate_transformation.sh
daily_schedule.sh		daily_schedule.sh
deploy.ps1		deploy.ps1
how df714d9		how df714d9
image.png		image.png
page.png		page.png
pyproject.toml		pyproject.toml
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

End-to-End Movie Data Processing and Recommender System

Table of Contents

Project Overview

Installation

Data Life Cycle

Data Sources

Data Transformation

Database Setup

Data Warehouse

Recommender Model

Technologies Used

Contributers

About

Releases

Packages

Contributors 4

Languages

codsalah/Movie_Data_Warehouse_with_Recommender_System

Folders and files

Latest commit

History

Repository files navigation

End-to-End Movie Data Processing and Recommender System

Table of Contents

Project Overview

Installation

Data Life Cycle

Data Sources

Data Transformation

Database Setup

Data Warehouse

Recommender Model

Technologies Used

Contributers

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages