Build an ML Pipeline with Airflow and Kubernetes

This work is supported by Dataswati

Introduction

This repo was designed for ODSC Europe Workshop : Build an ML Pipeline with Airflow and Kubernetes

What's in there

All the code material for the workshop :

The ml code is in the dataswati folder and is about fitting a water Potability classifier using a dataset from Kaggle
The airflow Dag code is it the dags folder (we activate airflow git-sync to retrieve this code)
The Dockerfile to build the image that is used by the KubernetesPodOperator and synchronized with Docker Hub
override_values.yaml allows to override the airflow helm chart to activate git-sync with the dag folder

1. Installation

1.1 Install lightweight version of Kubernetes : microk8s

LINUX

sudo snap install microk8s --classic

sudo usermod -a -G microk8s $USER
sudo chown -f -R $USER ~/.kube
su - $USER

WINDOWS

Download installer here and follow instructions

MACOS

brew install ubuntu/microk8s/microk8s

microk8s install

1.2 Test installation

microk8s status --wait-ready

1.3 Helm configuration

Helm is the package manager for kubernetes

microk8s enable helm3 dns storage

1.4 Airflow installation on Kubernetes

See the doc for official airflow Helm chart

Use the -f to override values in the official aiflow chart with override_values.yaml. This will enable git-sync to allow airflow to get the dags from the dags folder

microk8s helm3 repo add apache-airflow https://airflow.apache.org

microk8s helm3 install airflow2 apache-airflow/airflow -f AirflowKubernetes/override_values.yaml

To access the Web UI => apply the port forwarding as indicated

microk8s kubectl port-forward svc/airflow2-webserver 8080:8080 --namespace default

You can now access to the Airflow UI where you will see the DAG

1.5 Last important step : set you host path as an Airflow Variable

Go to Admin => Variable and create a variable called HOST_PATH with the path to you dataswati folder (eg: /home/Luis/dev/odsc/AirflowKubernetes/dataswati)

This will be important to determine where on your computer the volumes will be synced with

2. Machine Learning

2.1 Airflow DAG

Typical Machine Learning Pipeline : there is training data and unseen data, both dataset go through the same transformations (imputation and feature engineering) and then the training data is used for training different models with random hyperparameter search with crossvalidation only the best model of each type is kept and they are use to make a prediction on the unseen data (at the same time we can evaluate the predictions because we actually have the targets for the unseen data)

2.2 ML Code

This part of the repository was generated using Coookiecutter Datascience that allows scaffolding of a data science project in a matter of minutes and brings cool functionalities with a lot of helpers. The code is split into 3 submodules :

data where we treat the existing data,
features where we add new features to the data
models where we create and train ML models and use them for prediction.

2.3 The dataset

Here we use a dataset from Kaggle : Water Potability. The advantage is to have access to the notebooks on Kaggle that give a baseline for prediction score and cool EDAs, you can also access a pandas profiling report and a pycaret tests in the exploration notebook

Name		Name	Last commit message	Last commit date
Latest commit History 79 Commits
dags		dags
dataswati		dataswati
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
override_values.yaml		override_values.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Build an ML Pipeline with Airflow and Kubernetes

Introduction

What's in there

1. Installation

1.1 Install lightweight version of Kubernetes : microk8s

1.2 Test installation

1.3 Helm configuration

1.4 Airflow installation on Kubernetes

1.5 Last important step : set you host path as an Airflow Variable

2. Machine Learning

2.1 Airflow DAG

2.2 ML Code

2.3 The dataset

About

Releases

Packages

Languages

LuisBlanche/AirflowKubernetes

Folders and files

Latest commit

History

Repository files navigation

Build an ML Pipeline with Airflow and Kubernetes

Introduction

What's in there

1. Installation

1.1 Install lightweight version of Kubernetes : microk8s

1.2 Test installation

1.3 Helm configuration

1.4 Airflow installation on Kubernetes

1.5 Last important step : set you host path as an Airflow Variable

2. Machine Learning

2.1 Airflow DAG

2.2 ML Code

2.3 The dataset

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages