Skip to content

11e11/data_engineering_for_beginners_code

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data Engineering for Beginners

Code for the Data Engineering for Beginners e-book.

Setup

The code for SQL, Python, and data model sections are written using Spark SQL. To run the code, you will need the prerequisites listed below.

Prerequisites

  1. git version >= 2.37.1
  2. Docker version >= 20.10.17 and Docker compose v2 version >= v2.10.2.

Windows users: please setup WSL and a local Ubuntu Virtual machine following the instructions here. Install the above prerequisites on your ubuntu terminal; if you have trouble installing docker, follow the steps here (only Step 1 is necessary). Please install the make command with sudo apt install make -y (if it's not already present).

Starting and stopping containers

Fork this repository data_engineering_for_beginners_code.
GiitHub Fork After forking, clone the repo to your local machine and start the containers as shown below:

git clone https://github.com/your-user-name/data_engineering_for_beginners_code.git
cd data_engineering_for_beginners_code
docker compose up -d # to start the docker containers
sleep 30 

Running code via Jupyter Notebooks

Open the Starter Jupyter Notebook at http://localhost:8888/lab/tree/notebooks/starter-notebook.ipynb and try out the commands in ther Data Engineering for Beginners e-book as shown below.

Notebook Template

If you are creating a new notebook, make sure to select the Python 3 (ipykernel) Notebook.

When you are done, stop docker containers with the below command:

docker compose down 

Airflow & dbt

For the Airflow, dbt & capstone section, go into the airflow directory and run the make commands as shown below.

Note All the code in the dbt, Airflow and capstone chapters are to be run via the terminal at data_engineering_for_beginners_code/airflow directory.

docker compose down # Make sure to stop Spark/Jupyternotebook containers before turning on Airflow's 
cd airflow
make restart # This will ask for your password to create some folders

You can open Airflow UI at http://localhost:8080 and log in with airflow as username and password. In the Airflow UI, you can run the dag.

After the dag is run, in the terminal, run make dbt-docs for dbt to serve the docs, which is viewable by going to http://localhost:8081.

You can stop the containers & return to the parent directory as shown below:

make down
cd ..

About

Code for DE101 book at https://de101.startdataengineering.com/

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • HTML 85.1%
  • Jupyter Notebook 14.1%
  • Other 0.8%