This repository contains a forked version of mumoshu/kube-airflow providing a production ready Helm chart for running Airflow with the Celery executor on a Kubernetes Cluster.
- Based on work from mumoshu/kube-airflow
- Leverage the Docker Airflow image puckel/docker-airflow
Ensure your helm installation is done, you may need to have TILLER_NAMESPACE
set as
environment variable.
Deploy to Kubernetes using:
make helm-upgrade HELM_RELEASE_NAME=af1 NAMESPACE=yournamespace HELM_VALUES=/path/to/your/own/values.yaml
The deployment uses the Helm's Trick decribed here to force reployment when the configmap template file change.
The Chart provides ingress configuration to allow customization the installation by adapting
the config.yaml
depending on your setup. Please read the comments in the value.yaml file for more
detail on how to configure your reverse proxy.
This Helm automatically prefixes all names using the release name to avoid collisions.
This chart exposes 2 endpoints:
- Airflow Web UI
- Flower, a debug UI for Celery
Both can be placed either at the root of a domain or at a sub path, for example:
http://mycompany.com/airflow/
http://mycompany.com/airflow/flower
NOTE: Mounting the Airflow UI under a subpath requires an airflow version > 1.9.x. For the moment
(March 2018) this is not available on official package, you will have to use an image where
airflow has been updated to its current HEAD. You can use the following one:
stibbons31/docker-airflow-dev:2.0dev
Please also note than Airflow UI and Flower do not behave the same:
- Airflow Web UI behave transparently, to configure it one just need to specify the
ingress.web.path
value. - Flower cannot handle this scheme directly and requires to use an URL rewrite mechanism in front of it. In short, it is able to generate the right URLs in the returned HTML file but cannot respond to these URL. It is commonly found in software that wasn't intended to work under something else than a root URL or localhost port. To use it, see the
value.yaml
in detail on how to configure your ingress controller to rewrite the URL (or "strip" the prefix path)
airflow.cfg
configuration can be changed by defining environment variables in the following form:
AIRFLOW__<section>__<key>
.
See the Airflow documentation for more information
This helm chart allows you to add these additional settings with the value key airflow.config
.
But beware changing these values won't trigger a redeployment automatically (see the section above
"Helm Deployment"). You may need to force the redeployment in this case (--recreate-pods
) or
use the Configmap Controller.
Celery workers uses StatefulSet instead of deployment. It is used to freeze their DNS using a Kubernetes Headless Service, and allow the webserver to requests the logs from each workers individually. This requires to expose a port (8793) and ensure the pod DNS is accessible to the web server pod, which is why StatefulSet is for.
To use Airflow you need to add your custom DAG files. There are 3 options to do this:
- Use a git-sync sidecar
- Mount a Persistent Volume (PV)
- Embed the DAGs into the Docker container
Git-sync pulls a git repository into a local directory. In this scenario, you would store your DAG files into a git repository, and they are automatically updated into Airflow.
You can store your DAG files on an external volume, and mount this volume into the relevant Pods (scheduler, web, worker). In this scenario, your CI/CD pipeline should update the DAG files in the PV. Since all Pods should have the same collection of DAG files, it is recommended to create just one PV that is shared. This ensures that the Pods are always in sync about the DagBag.
To share a PV with multiple Pods, the PV needs to have accessMode 'ReadOnlyMany' or 'ReadWriteMany'. If you are on AWS, you can use Elastic File System (EFS). If you are on Azure, you can use Azure File Storage (AFS).
If you want more control on the way you deploy your DAGs, you can use embedded DAGs, where DAGs are burned inside the Docker container deployed as Scheduler and Workers.
Be aware this requirement more heavy tooling than using git-sync, especially if you use CI/CD:
- your CI/CD should be able to build a new docker image each time your DAGs are updated.
- your CI/CD should be able to control the deployment of this new image in your kubernetes cluster
Example of procedure:
- Fork this project
- Place your DAG inside the
dags
folder of this project, update/requirements.txt
to install new dependencies if needed (see bellow) - Add build script connected to your CI that will build the new docker image
- Deploy on your Kubernetes cluster
If you want to add specific python dependencies to use in your DAGs, you need to mount a
/requirements.txt
file at the root of the image.
See the
docker-airflow readme for
more information.
This project uses a makefile to perform all major operation. It is mostly here as a reference to see which commands need to be performed.
You can start a test on minikube using the following commands:
make minikube-start
make dashboard
make helm-install-traefik
make helm-init
make test
make update-etc-host
make minikube-service-list
You can browse to the airflow webserver using:
make minikube-browse-web
Airflow webserver is not mounted at the root of the URL. You need to append /airflow
to the
opened window:
http://192.168.99.100:31706/airflow/
Flower is also configured in a subpath of the URL: /airflow/flower
. But it behaves badly if a
reverse proxy is not properly configured. You can see a full description in the
test/minikube-values.yaml
file.
In this example, the expected behavior is:
-
Flower appears at the root for example:
http://192.168.99.100:32677/
-
Links point to the subpath, for instance:
http://192.168.99.100:32677/airflow/flower/tasks
Instead of:
http://192.168.99.100:32677/tasks
This example is actually configured to use Traefik as ingress controler that perform the reverse proxy operations, especially for Flower where it is tricky.
For example, if we have this list of available services:
$ make minikube-service-list
minikube service list
|-------------|-------------------------|--------------------------------|
| NAMESPACE | NAME | URL |
|-------------|-------------------------|--------------------------------|
| airflow-dev | airflow-flower | http://192.168.99.100:32088 |
| airflow-dev | airflow-postgresql | No node port |
| airflow-dev | airflow-redis | No node port |
| airflow-dev | airflow-web | http://192.168.99.100:30189 |
| airflow-dev | airflow-worker | No node port |
| default | kubernetes | No node port |
| kube-system | default-http-backend | http://192.168.99.100:30001 |
| kube-system | kube-dns | No node port |
| kube-system | kubernetes-dashboard | http://192.168.99.100:30000 |
| kube-system | tiller-deploy | No node port |
| kube-system | traefik-ingress-service | http://192.168.99.100:31333 |
| | | http://192.168.99.100:30616 |
| kube-system | traefik-web-ui | No node port |
|-------------|-------------------------|--------------------------------|
The line that interest us is the port of the first IP exposed by traefik-web-ui
. It is the
main ingress. If will not be port 80 because of the way minikube works.
The second port is the Traefik dashboard.
Given you have your /etc/host
properly set (ex: by make update-etc-host
):
$ cat /etc/hosts
192.168.99.100 minikube traeffik-ui.minikube
You can then manually go to the following URL:
- Airflow Web server: http://minikube:31333/airflow/admin/
- Flower: http://minikube:31333/airflow/flower/
And see how both behave nicely !
Udate the value for the celery.num_workers
then:
make helm-upgrade
Fork, improve and PR. ;-)