[SPARK-39023] [K8s] Add Executor Pod inter-pod anti-affinity #36358

zwangsheng · 2022-04-26T09:35:44Z

What changes were proposed in this pull request?

Add Inter-Pod anti-affinity to Executor Pod.

Why are the changes needed?

Why should we need this?

When Spark On Kubernetes is running, Executor Pod clusters occur in certain conditions (uneven resource allocation in Kubernetes, high load On some nodes, low load On some nodes), causing Shuffle data skew. This causes Spark application to fail or performance bottlenecks, such as Shuffle Fetch timeout and connection refuse after connection number.

How does this PR help?

Add the AntiAffinity feature to Executor Pod to ensure simple anti-affinity scheduling at Application granularity.

Executor Pod Yaml represents:

podAntiAffinity:
  preferredDuringSchedulingIgnoredDuringExecution:
  weight: 100
  - podAffinityTerm:
       labelSelector:
         matchExpressions: 
         - key: spark-app-selector 
            operator: In 
            values: 
            - spark-test // appId
       topologyKey: kubernetes.io/hostname

Why should use this?

The functionality mentioned in this PR was tested on a cluster.

Using three Kubernetes Node(node-1, node-2, node-3).

In the case of sufficient or insufficient cluster resources, it has the same effect Whether the feature is enabled or not. Kubernetes assigns pods to nodes with low load based on global resources. When only one node has a small load, Kubernetes will schedule all executor pods to this node.

Here is the experiment results:
(Experiments show pods are scheduled to which node)

Experiment 1:
Three nodes hold a small amount of equal load.

Enable Feature:

round	exec-1	exec-2	exec-3	exec-4	exec-5	exec-6	exec-7
1	node-1	node-2	node-3	node-2	node-1	node-3	node-1
2	node-1	node-2	node-3	node-2	node-1	node-3	node-1
3	node-1	node-2	node-3	node-2	node-1	node-3	node-1
4	node-1	node-2	node-3	node-2	node-1	node-3	node-1

Disable Feature:

round	exec-1	exec-2	exec-3	exec-4	exec-5	exec-6	exec-7
1	node-1	node-2	node-3	node-2	node-1	node-3	node-1
2	node-1	node-2	node-3	node-2	node-1	node-3	node-1
3	node-1	node-2	node-3	node-2	node-1	node-3	node-1
4	node-1	node-2	node-3	node-2	node-1	node-3	node-1

Experiment 2:
Node 1 have no idle resources, Node 2 and Node 3 hold a small amount of equal load.

Enable Feature:

round	exec-1	exec-2	exec-3	exec-4
1	node-2	node-3	node-2	node-3
1	node-2	node-3	node-2	node-3
1	node-2	node-3	node-2	node-3
1	node-2	node-3	node-2	node-3

Disable Feature:

round	exec-1	exec-2	exec-3	exec-4
1	node-2	node-3	node-2	node-3
1	node-2	node-3	node-2	node-3
1	node-2	node-3	node-2	node-3
1	node-2	node-3	node-2	node-3

If some nodes are busy or the load is unbalanced, leaving the feature off means that Kubernetes will pick the Node with the lowest load and distribute it until it is no longer the one with the lowest load. After the feature is enabled, it is first allocated to nodes with low load, and then allocated to nodes with low load except hassan according to the anti-affinity of Application granularity.

Experiment 2:
Node 1 have no idle resources, Node 2 has a higher load than Node 3.

Enable Feature:

round	exec-1	exec-2	exec-3	exec-4
1	node-3	node-2	node-3	node-2
2	node-3	node-2	node-3	node-2
3	node-3	node-2	node-3	node-2

Disable Feature:

round	exec-1	exec-2	exec-3	exec-4
1	node-3	node-3	node-3	node-3
2	node-3	node-3	node-3	node-3
3	node-3	node-3	node-3	node-3

According to the above experimental results, we can see that under normal circumstances, after the feature is turned on, there will not be any difference from that when it is not turned on; In extreme cases, after enabling the feature, it can better alleviate the accumulation of Pods On a Node and prevent performance bottleneck to a certain extent.

Will this make any difference?

Add features to apply for the Executor of Pod add AntiAffinity PreferredDuringSchedulingIgnoreedDuringExecution, Used to add the Application granularity Preferred antiaffinity based on Kubernetes' global resource-oriented (and other customized scheduling policies). We hope Kubernetes can help us smooth out ExecutorPod from the Application level as much as possible while adhering to the original scheduling rules.

Why choose this?

Why choose Inter-pod affinity and anti-affinity?

We are currently concerned with gathering distribution between pods.

Why choose PreferredDuringSchedulingIgnoreedDuringExecution?

According to the Kubernetes Assigning Pods To Nodes，can see that Kubernetes provides us with requiredDuringSchedulingIgnoredDuringExecution & preferredDuringSchedulingIgnoredDuringExecution。

requiredDuringSchedulingIgnoredDuringExecution: The scheduler can't schedule the Pod unless the rule is met. This functions like nodeSelector, but with a more expressive syntax.
preferredDuringSchedulingIgnoredDuringExecution: The scheduler tries to find a node that meets the rule. If a matching node is not available, the scheduler still schedules the Pod.

It's not hard to see，preferredDuringSchedulingIgnoredDuringExecution is more in line with the issues we raised in this PR. We want Kubernetes to help smooth out Executor pods as much as possible, but in the worst case, where there is only one Node with resources left, we also need to be able to allocate Executor pods to that one Node.

Why would we want to influence Kubernetes scheduling from Spark code?

We need to antiaffinity executorPods from the granularity of the Application, so we choose to add the antiaffinity before allocating the Executor and after generating applicationId.

Due to the ApplicationId limitation, we could not pin it to the pod-template.yaml.

What are the negative effects?

From the perspective of normal scheduling policy, there is no great negative impact. From experiments and theories, it can be known that the new feature will not affect the existing scheduling content.

However, to some extent, breaking Executor pods will affect the localization of Shuffle data, that is, the number of blocks in LocalHost will be reduced.

Judging from the experiments and results so far, this trade-off is worthwhile.

If we do not adopt the strategy of breaking up executors, they will gather under certain circumstances, leading to Shuffle data skews and significantly affecting task performance. More seriously, after the connection number is full, Executor Fetch Block fails and Stage fails. Even Application failure.

Breaking executors increases network connectivity and traffic at the cluster level. However, the increased consumption is scattered in the cluster and not concentrated on several nodes, so the overall stability is acceptable.

Why does changing the number of Executor Pods affect Shuffle Data skew?

The number of Executor Pods is related to the amount of Shuffle data. Currently, the Shuffle tilt problem can be alleviated by controlling the number of Executor Pods.

At present, this is only a step of anti-affinity reference, and later may be carried out directly through shuffle quantity.

Does this PR introduce any user-facing change?

Yes

How was this patch tested?

Local

dcoliversun

Could you please add unit test?

zwangsheng · 2022-04-26T10:31:28Z

Could you please add unit test?

Thanks for your review, i'm going to add unit test later.

martin-g

Is this solution generic enough ? I.e. would it solve this problem for all applications or some applications will need customizations/modifications ?

One can achieve the same now by using the PodTemplate config. The advantage is that the application can provide a config that is specific for its needs.

martin-g · 2022-04-26T11:06:51Z

.../core/src/test/scala/org/apache/spark/deploy/k8s/features/AntiAffinityFeatureStepSuite.scala

+    val sparkPod =
+      new AntiAffinityFeatureStep(executorConf).configurePod(SparkPod.initialPod())
+
+    assert(sparkPod.pod.getSpec.getAffinity.getPodAntiAffinity != null)


Please cache sparkPod.pod.getSpec.getAffinity.getPodAntiAffinity as a local variable. It will simplify the code a lot!

zwangsheng · 2022-04-26T12:35:02Z

Thanks for your review!

Is this solution generic enough ? I.e. would it solve this problem for all applications or some applications will need customizations/modifications ?

This solution is not for special application. It is used to alleviate data skew when running on kubernetes.In essence, there is casued by no balanced scheduling for the amount of shuffle data.

One can achieve the same now by using the PodTemplate config. The advantage is that the application can provide a config that is specific for its needs.

In this feature, need applicationId to help drift apart another executor pod, so i choose add anti-affinity during building executor pod instead of using template yaml.

BTW, I'd like to know about the one you mentioned by using the PodTemplate config.

martin-g · 2022-04-27T06:11:05Z

BTW, I'd like to know about the one you mentioned by using the PodTemplate config.

https://spark.apache.org/docs/latest/running-on-kubernetes.html#pod-template

zwangsheng · 2022-04-27T06:29:56Z

https://spark.apache.org/docs/latest/running-on-kubernetes.html#pod-template

Thanks for your reply, I know this and have used it in other ways. Perhaps I didn't make it clear above that this PR requires to keep the Executor anti-affinity of Application granularity, we need to use the Application Id, which is generated after the Driver starts, so using a fixed Pod-Template approach is not appropriate.
In general, thank you for your advice.

AmplabJenkins · 2022-04-27T06:33:00Z

Can one of the admins verify this patch?

martin-g · 2022-04-27T07:21:09Z

You may use {{APPID}} as a placeholder in the template. See #35704

zwangsheng · 2022-04-27T07:34:56Z

You may use {{APPID}} as a placeholder in the template. See #35704

I see. Thank you for your introduction!

Currently, I consider that shuffle skew caused by Kubernetes scheduling occurs frequently during spark on kubernetes. If using Pod-template way, users may need to know about this, rather than learn about it through Spark configurations Doc. Can help alleviate the problem by turning on parameters rather than learning and using Pod templates. This may make it difficult for the user to ease the shuffle skew On Kubernetes.

dongjoon-hyun

Hi, @zwangsheng . Thank you for making a PR.
However, Apache Spark community wants to avoid feature duplications like this.
The proposed feature is already delivered to many production environments via PodTemplate and has been used by the customers without any problem. Adding another configuration only makes the users confused .

dongjoon-hyun · 2022-05-19T06:27:44Z

resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Config.scala

+      .doc("If enable, register executor with anti affinity. This anti affinity will help " +
+        "Kubernetes assign executors of the same Application to different nodes " +
+        "as much as possible")
+      .version("3.2.1")


In addition, Apache Spark follows Semantic Versioning policy which means new features and improvements should be the version of master branch, currently, 3.4.0.

https://spark.apache.org/versioning-policy.html

dongjoon-hyun · 2022-05-19T06:28:50Z

resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Config.scala

+        "as much as possible")
+      .version("3.2.1")
+      .booleanConf
+      .createWithDefault(false)


The default value is correct because AntiAffinity could hurt EKS AutoScale feature.

zwangsheng · 2022-05-20T03:35:07Z

Hi, @zwangsheng . Thank you for making a PR.
However, Apache Spark community wants to avoid feature duplications like this.
The proposed feature is already delivered to many production environments via PodTemplate and has been used by the customers without any problem. Adding another configuration only makes the users confused .

@dongjoon-hyun Thanks for your reply. I can understand the above and accept it.

Thanks all for review this PR!!!

I will close this PR and look forward to meeting in the another PR.

dongjoon-hyun · 2022-05-20T03:44:51Z

Thank you so much, @zwangsheng .

zwangsheng added 2 commits April 26, 2022 17:25

init

cf235a9

fix pod

95c496f

github-actions bot added the KUBERNETES label Apr 26, 2022

dcoliversun reviewed Apr 26, 2022

View reviewed changes

add unit test

50224a1

martin-g reviewed Apr 26, 2022

View reviewed changes

zwangsheng added 2 commits May 5, 2022 15:10

param

04db857

fix unit test

5669155

dongjoon-hyun requested changes May 19, 2022

View reviewed changes

dongjoon-hyun reviewed May 19, 2022

View reviewed changes

zwangsheng closed this May 20, 2022

[SPARK-39023] [K8s] Add Executor Pod inter-pod anti-affinity #36358

[SPARK-39023] [K8s] Add Executor Pod inter-pod anti-affinity #36358

Uh oh!

Conversation

zwangsheng commented Apr 26, 2022

What changes were proposed in this pull request?

Why are the changes needed?

Why should we need this?

How does this PR help?

Why should use this?

Will this make any difference?

Why choose this?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

dcoliversun left a comment

Choose a reason for hiding this comment

Uh oh!

zwangsheng commented Apr 26, 2022

Uh oh!

martin-g left a comment

Choose a reason for hiding this comment

Uh oh!

martin-g Apr 26, 2022

Choose a reason for hiding this comment

Uh oh!

zwangsheng commented Apr 26, 2022

Uh oh!

martin-g commented Apr 27, 2022

Uh oh!

zwangsheng commented Apr 27, 2022

Uh oh!

AmplabJenkins commented Apr 27, 2022

Uh oh!

martin-g commented Apr 27, 2022

Uh oh!

zwangsheng commented Apr 27, 2022

Uh oh!

dongjoon-hyun left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun May 19, 2022

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun May 19, 2022

Choose a reason for hiding this comment

Uh oh!

zwangsheng commented May 20, 2022

Uh oh!

dongjoon-hyun commented May 20, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

dongjoon-hyun left a comment •

edited

Loading