This project demonstrates a complete data pipeline using Databricks, showcasing how to extract data from an external url, transform it with SQL and Python, and load it into a structured format for analysis. The project includes a CI/CD setup for ensuring code quality, reproducibility, and testing. The pipeline identifies trends in alcohol consumption and drug use across different countries and age groups, with a focus on actionable insights from complex SQL queries.
- Data Source:
drinks
anddrug use
tables. - Data Sink: Transformed data is stored in Delta tables on Databricks.
- Transformation: Fill in na and new features created
- Visualization: Analysis results are visualized using Python's Matplotlib and Seaborn.
- Extract data from url.
- Load data into a Databricks Delta table.
- Apply ransformations for data aggregation and filtering.
- Visualize the results and save plots.
mylib/
: Python scripts for SQL queries, data extraction, and transformations..devcontainer/
: Configuration for the development container.- Makefile: Provides commands for setup, formatting, linting, testing, and running SQL queries:
make install
: Installs dependencies.make format
: Formats Python files.make lint
: Lints Python files.make test
: Runs unit tests.make all
: Runs all tasks (install, format, lint, and test).
.github/workflows/CICD.yml
: CI/CD pipeline configuration using GitHub Actions.README.md
: Setup instructions, usage guidelines, and project description.
-
Clone the repository:
git clone https://github.com/nogibjj/Allen_Wang_miniproj_11.git cd Allen_Wang_miniproj_11
-
Install dependencies:
make install
-
Format code:
make format
-
Lint code:
make lint
-
Test code:
make test