Installation

Python job with Apache Beam and Spark Cluster on Kubernetes

Why

There is currently no step-by-step guide on how to configure Apache Beam with Spark cluster on Kubernetes for Python job.

Installation

This guide only work for on real K8s cluster. If you are running a MacOS/Window, you will need to boot up Virtualbox with Linux iso, install Ubuntu and Kubernetes.
Make sure your K8s support ReadWriteMany storage. You can install NFS or Longhorn on K8s. Below are the optional steps to install Longhorn on the Kubernetes nodes

    sudo apt-get install open-iscsi nfs-common
    kubectl apply -f https://raw.githubusercontent.com/longhorn/longhorn/master/deploy/longhorn.yaml

Create a K8s namespace
```
kubectl create namespace spark-beam
```
Create storage, if you are using Virtualbox and self installed Kubernetes and Longhorn. Uncommented the storageClassName on the file before execute it
```
 kubectl apply -f pvc.yaml --namespace spark-beam
```
If you are using a real K8s cluster and your team administrator already installed a network storage system
```
kubectl apply -f pvc.yaml --namespace spark-beam
```
Check storage is successfully created before preceed STATUS=Bound
```
kubectl get pvc,pv --namespace spark-beam
```

Deploy a Spark cluster with one master and one worker

kubectl apply -f deployment.yaml --namespace spark-beam

Wait until all pods are running, STATUS=Running

kubectl get pods,services --namespace spark-beam

Run Example

Install Apache beam python library
```
pip install apache-beam==2.33.0
```
Get the IP and input it into example.py for --job_endpoint and --artifact_endpoint. No need to change to the port number
```
kubectl get nodes -o wide
```
Run the example
```
python example.py
```

Build Custom Spark Image (Optional)

git clone [email protected]:bitnami/bitnami-docker-spark.git

cd bitnami-docker-spark
git checkout 2.4.6-debian-10-r13
cd 2/debian-10

Edit the Dockerfile by including below step to install docker-ce. The editted file should look like the one in this repo

RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y libltdl7
RUN apt-get update; \
apt-get -y install apt-transport-https ca-certificates curl gnupg software-properties-common; \
curl -fsSL https://download.docker.com/linux/debian/gpg | apt-key add -; \
apt-key fingerprint 0EBFCD88; \
add-apt-repository \
    "deb [arch=amd64] https://download.docker.com/linux/debian \
    $(lsb_release -cs) \
    stable" ;\
apt-get update; \
apt-get -y install docker-ce

finally build and push the image to Docker Hub

docker build . -t docker.io/<account>/spark-custom-2.4.6
docker push  docker.io/<account>/spark-custom-2.4.6

Footnote

Need to use Linux as the host for Kubernetes as we are running Docker in K8s
Only work for Spark 2 and not version 3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Python job with Apache Beam and Spark Cluster on Kubernetes

Why

Installation

Run Example

Build Custom Spark Image (Optional)

Footnote

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Dockerfile		Dockerfile
deployment.yaml		deployment.yaml
example.py		example.py
pvc.yaml		pvc.yaml
readme.md		readme.md

cometta/python-apache-beam-spark

Folders and files

Latest commit

History

Repository files navigation

Python job with Apache Beam and Spark Cluster on Kubernetes

Why

Installation

Run Example

Build Custom Spark Image (Optional)

Footnote

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages