Skip to content
This repository has been archived by the owner on Mar 25, 2024. It is now read-only.

A full data warehouse infrastructure with ETL pipelines running inside docker on Apache Airflow for data orchestration, AWS Redshift for cloud data warehouse and Metabase to serve the needs of data visualizations such as analytical dashboards.

License

Notifications You must be signed in to change notification settings

iam-mhaseeb/Skytrax-Data-Warehouse

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Skytrax Data Warehouse

A full data warehouse infrastructure with ETL pipelines running inside docker on Apache Airflow for data orchestration, AWS Redshift for cloud data warehouse and Metabase to serve the needs of data visualizations such as analytical dashboards.

skytrax-warehouse

Architecture

Architecture

Data Warehouse Consists of various modules:

Overview

Data is obtained from here. The data collected is stored on local disk and is timely moved to the Landing Bucket on AWS S3. ETL jobs are written in SQL and scheduled in airflow to run every hour to keep data fresh in cloud data warehouse.

Data Modeling

Following are the fact and dimension tables created:

Dimension Table

aircrafts
airlines
passengers
airports
lounges

Fact Tables

fact_ratings

ETL Flow

  • Data Collected from here is moved to landing zone s3 buckets.
  • ETL job has s3 module which copies data from landing zone to stagging in Redshift.
  • Once the data is moved to Redshift, a task in airflow is triggered which reads the data from stagging area and apply transformation.
  • Using the Redshift staging tables and UPSERT operation is performed on the dimensional & fact Data Warehouse tables to update the data.
  • ETL job execution is completed once the Data Warehouse is updated.
  • Airflow DAG runs the data quality check on Warehouse tables between the ETL job to ensure right data.
  • Dag execution completes once the Data Warehouse is updated.

Environment Setup

Hardware Used

Redshift: For Redshift I used 2 Node cluster with Instance Types dc2.large

Setting Up Infrastructure

Run the following commands in terminal to setup whole infrastructure locally:

  1. git clone https://github.com/iam-mhaseeb/Skytrax-Data-Warehouse
  2. cd Skytrax-Data-Warehouse
  3. Considering you have docker service installed and running run docker-compose up. It will take sometime to pull latest images & install everything automatically in docker.

Setting up Redshift

You can follow the AWS Guide to run a Redshift cluster.

How to run

Airflow

Make sure docker containers are running. Open the Airflow UI by hitting http://localhost:8080 in browser and setup required connections.

You should be able to see skytrax_etl_pipeline Dag like in pictures below:

Skytrax Pipeline DAG DAG View

You can explore dag further in different views like below:

DAG View: DAG

DAG Tree View: DAG Tree

DAG Gantt View: DAG Gantt View

Metabase

Make sure docker containers are running. Open the Metabase UI by hitting http://localhost:3000 in browser & setup your metabase account and database.

You should be able to play with data after running dag successfully like I made dashboard in pictures below:

Dashboard1: DAG

Dashboard2: DAG Tree

Scenarios

  • Data increase by 100x. read > write. write > read

    • Redshift: Analytical database, optimized for aggregation, also good performance for read-heavy workloads
    • Introduce EMR cluster size to handle bigger volume of data
  • Pipelines would be run on 7am daily. how to update dashboard? would it still work?

    • DAG is scheduled to run every hour and can be configured to run every morning at 7 AM if required.
    • Data quality operators are used at appropriate position. In case of DAG failures email triggers can be configured to let the team know about pipeline failures.
  • Make it available to 100+ people

    • We can set the concurrency limit for your Amazon Redshift cluster. While the concurrency limit is 50 parallel queries for a single period of time, this is on a per cluster basis, meaning you can launch as many clusters as fit for you business.

Authors

License

This project is licensed under the MIT License - see the LICENSE file for details

About

A full data warehouse infrastructure with ETL pipelines running inside docker on Apache Airflow for data orchestration, AWS Redshift for cloud data warehouse and Metabase to serve the needs of data visualizations such as analytical dashboards.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages