Skip to content
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
67 changes: 67 additions & 0 deletions docs/running-on-kubernetes.md
Original file line number Diff line number Diff line change
Expand Up @@ -1732,6 +1732,73 @@ Spark allows users to specify a custom Kubernetes schedulers.
- Create additional Kubernetes custom resources for driver/executor scheduling.
- Set scheduler hints according to configuration or existing Pod info dynamically.

#### Using Volcano as Customized Scheduler for Spark on Kubernetes

**This feature is currently experimental. In future versions, there may be behavioral changes around configuration, feature step improvement.**

##### Prerequisites
* Spark on Kubernetes with [Volcano](https://volcano.sh/en) as a custom scheduler is supported since Spark v3.3.0 and Volcano v1.5.1. Below is an example to install Volcano 1.5.1:

```bash
# x86_64
kubectl apply -f https://raw.githubusercontent.com/volcano-sh/volcano/v1.5.1/installer/volcano-development.yaml

# arm64:
kubectl apply -f https://raw.githubusercontent.com/volcano-sh/volcano/v1.5.1/installer/volcano-development-arm64.yaml
```

##### Build
To create a Spark distribution along with Volcano suppport like those distributed by the Spark [Downloads page](https://spark.apache.org/downloads.html), also see more in ["Building Spark"](https://spark.apache.org/docs/latest/building-spark.html):

```bash
./dev/make-distribution.sh --name custom-spark --pip --r --tgz -Psparkr -Phive -Phive-thriftserver -Pkubernetes -Pvolcano
```

##### Usage
Spark on Kubernetes allows using Volcano as a custom scheduler. Users can use Volcano to
support more advanced resource scheduling: queue scheduling, resource reservation, priority scheduling, and more.

To use Volcano as a custom scheduler the user needs to specify the following configuration options:

```bash
# Specify volcano scheduler and PodGroup template
--conf spark.kubernetes.scheduler.name=volcano
--conf spark.kubernetes.scheduler.volcano.podGroupTemplateFile=/path/to/podgroup-template.yaml
# Specify driver/executor VolcanoFeatureStep
--conf spark.kubernetes.driver.pod.featureSteps=org.apache.spark.deploy.k8s.features.VolcanoFeatureStep
--conf spark.kubernetes.executor.pod.featureSteps=org.apache.spark.deploy.k8s.features.VolcanoFeatureStep```
```

##### Volcano Feature Step
Volcano feature steps help users to create a Volcano PodGroup and set driver/executor pod annotation to link with this [PodGroup](https://volcano.sh/en/docs/podgroup/).

Note that currently only driver/job level PodGroup is supported in Volcano Feature Step.

##### Volcano PodGroup Template
Volcano defines PodGroup spec using [CRD yaml](https://volcano.sh/en/docs/podgroup/#example).

Similar to [Pod template](#pod-template), Spark users can use Volcano PodGroup Template to define the PodGroup spec configurations.
To do so, specify the Spark property `spark.kubernetes.scheduler.volcano.podGroupTemplateFile` to point to files accessible to the `spark-submit` process.
Below is an example of PodGroup template:

```yaml
apiVersion: scheduling.volcano.sh/v1beta1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a side note, v1beta1 could be a blocker for the users until Volcano community makes it official. I mentioned this before to you, @Yikun . The multi-arch image support is also another TODO, IIRC.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mentioned this before to you, @Yikun . The multi-arch image support is also another TODO, IIRC.

also cc @william-wang are working on this, will ready soon.

Copy link

@william-wang william-wang Mar 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun Thanks for the reminding, we are already working on the multi-arch and will make it ready as soon as possible. For the v1beta1, I understand your concern, we have stayed in v1beta1 for a long time, it is stable and there are 50+ enterprise users use Volcano in their production environment. We have planed to upgrade it to v1 version this year and v1 will also keep complete compatibility with v1beta1 :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the confirmation, @william-wang .

kind: PodGroup
spec:
# Specify minMember to 1 to make a driver pod
minMember: 1
# Specify minResources to support resource reservation (the driver pod resource and executors pod resource should be considered)
# It is useful for ensource the available resources meet the minimum requirements of the Spark job and avoiding the
# situation where drivers are scheduled, and then they are unable to schedule sufficient executors to progress.
minResources:
cpu: "2"
memory: "3Gi"
# Specify the priority, help users to specify job priority in the queue during scheduling.
priorityClassName: system-node-critical
# Specify the queue, indicates the resource queue which the job should be submitted to
queue: default
```

### Stage Level Scheduling Overview

Stage level scheduling is supported on Kubernetes when dynamic allocation is enabled. This also requires <code>spark.dynamicAllocation.shuffleTracking.enabled</code> to be enabled since Kubernetes doesn't support an external shuffle service at this time. The order in which containers for different profiles is requested from Kubernetes is not guaranteed. Note that since dynamic allocation on Kubernetes requires the shuffle tracking feature, this means that executors from previous stages that used a different ResourceProfile may not idle timeout due to having shuffle data on them. This could result in using more cluster resources and in the worst case if there are no remaining resources on the Kubernetes cluster then Spark could potentially hang. You may consider looking at config <code>spark.dynamicAllocation.shuffleTracking.timeout</code> to set a timeout, but that could result in data having to be recomputed if the shuffle data is really needed.
Expand Down