This work is supported by Dataswati
This repo was designed for ODSC Europe Workshop : Build an ML Pipeline with Airflow and Kubernetes
All the code material for the workshop :
- The ml code is in the dataswati folder and is about fitting a water Potability classifier using a dataset from Kaggle
- The airflow Dag code is it the dags folder (we activate airflow git-sync to retrieve this code)
- The Dockerfile to build the image that is used by the KubernetesPodOperator and synchronized with Docker Hub
- override_values.yaml allows to override the airflow helm chart to activate git-sync with the dag folder
1.1 Install lightweight version of Kubernetes : microk8s
LINUX
sudo snap install microk8s --classic
sudo usermod -a -G microk8s $USER
sudo chown -f -R $USER ~/.kube
su - $USER
WINDOWS
Download installer here and follow instructions
MACOS
brew install ubuntu/microk8s/microk8s
microk8s install
microk8s status --wait-ready
Helm is the package manager for kubernetes
microk8s enable helm3 dns storage
See the doc for official airflow Helm chart
Use the -f to override values in the official aiflow chart with override_values.yaml. This will enable git-sync
to allow airflow to get the dags from the dags folder
microk8s helm3 repo add apache-airflow https://airflow.apache.org
microk8s helm3 install airflow2 apache-airflow/airflow -f AirflowKubernetes/override_values.yaml
To access the Web UI => apply the port forwarding as indicated
microk8s kubectl port-forward svc/airflow2-webserver 8080:8080 --namespace default
You can now access to the Airflow UI where you will see the DAG
Go to Admin => Variable and create a variable called HOST_PATH
with the path to you dataswati folder (eg: /home/Luis/dev/odsc/AirflowKubernetes/dataswati
)
This will be important to determine where on your computer the volumes will be synced with
Typical Machine Learning Pipeline : there is training data and unseen data, both dataset go through the same transformations (imputation and feature engineering) and then the training data is used for training different models with random hyperparameter search with crossvalidation only the best model of each type is kept and they are use to make a prediction on the unseen data (at the same time we can evaluate the predictions because we actually have the targets for the unseen data)
This part of the repository was generated using Coookiecutter Datascience that allows scaffolding of a data science project in a matter of minutes and brings cool functionalities with a lot of helpers. The code is split into 3 submodules :
- data where we treat the existing data,
- features where we add new features to the data
- models where we create and train ML models and use them for prediction.
Here we use a dataset from Kaggle : Water Potability. The advantage is to have access to the notebooks on Kaggle that give a baseline for prediction score and cool EDAs, you can also access a pandas profiling report and a pycaret tests in the exploration notebook