Skip to content

Conversation

@zwangsheng
Copy link
Contributor

What changes were proposed in this pull request?

Add Inter-Pod anti-affinity to Executor Pod.

Why are the changes needed?

Why should we need this?

When Spark On Kubernetes is running, Executor Pod clusters occur in certain conditions (uneven resource allocation in Kubernetes, high load On some nodes, low load On some nodes), causing Shuffle data skew. This causes Spark application to fail or performance bottlenecks, such as Shuffle Fetch timeout and connection refuse after connection number.

How does this PR help?

Add the AntiAffinity feature to Executor Pod to ensure simple anti-affinity scheduling at Application granularity.

Executor Pod Yaml represents:

podAntiAffinity:
  preferredDuringSchedulingIgnoredDuringExecution:
  weight: 100
  - podAffinityTerm:
       labelSelector:
         matchExpressions: 
         - key: spark-app-selector 
            operator: In 
            values: 
            - spark-test // appId
       topologyKey: kubernetes.io/hostname 

Why should use this?

The functionality mentioned in this PR was tested on a cluster.

Using three Kubernetes Node(node-1, node-2, node-3).

In the case of sufficient or insufficient cluster resources, it has the same effect Whether the feature is enabled or not. Kubernetes assigns pods to nodes with low load based on global resources. When only one node has a small load, Kubernetes will schedule all executor pods to this node.

Here is the experiment results:
(Experiments show pods are scheduled to which node)

Experiment 1:
Three nodes hold a small amount of equal load.

Enable Feature:

round exec-1 exec-2 exec-3 exec-4 exec-5 exec-6 exec-7
1 node-1 node-2 node-3 node-2 node-1 node-3 node-1
2 node-1 node-2 node-3 node-2 node-1 node-3 node-1
3 node-1 node-2 node-3 node-2 node-1 node-3 node-1
4 node-1 node-2 node-3 node-2 node-1 node-3 node-1

Disable Feature:

round exec-1 exec-2 exec-3 exec-4 exec-5 exec-6 exec-7
1 node-1 node-2 node-3 node-2 node-1 node-3 node-1
2 node-1 node-2 node-3 node-2 node-1 node-3 node-1
3 node-1 node-2 node-3 node-2 node-1 node-3 node-1
4 node-1 node-2 node-3 node-2 node-1 node-3 node-1

Experiment 2:
Node 1 have no idle resources, Node 2 and Node 3 hold a small amount of equal load.

Enable Feature:

round exec-1 exec-2 exec-3 exec-4
1 node-2 node-3 node-2 node-3
1 node-2 node-3 node-2 node-3
1 node-2 node-3 node-2 node-3
1 node-2 node-3 node-2 node-3

Disable Feature:

round exec-1 exec-2 exec-3 exec-4
1 node-2 node-3 node-2 node-3
1 node-2 node-3 node-2 node-3
1 node-2 node-3 node-2 node-3
1 node-2 node-3 node-2 node-3

If some nodes are busy or the load is unbalanced, leaving the feature off means that Kubernetes will pick the Node with the lowest load and distribute it until it is no longer the one with the lowest load. After the feature is enabled, it is first allocated to nodes with low load, and then allocated to nodes with low load except hassan according to the anti-affinity of Application granularity.

Experiment 2:
Node 1 have no idle resources, Node 2 has a higher load than Node 3.

Enable Feature:

round exec-1 exec-2 exec-3 exec-4
1 node-3 node-2 node-3 node-2
2 node-3 node-2 node-3 node-2
3 node-3 node-2 node-3 node-2

Disable Feature:

round exec-1 exec-2 exec-3 exec-4
1 node-3 node-3 node-3 node-3
2 node-3 node-3 node-3 node-3
3 node-3 node-3 node-3 node-3

According to the above experimental results, we can see that under normal circumstances, after the feature is turned on, there will not be any difference from that when it is not turned on; In extreme cases, after enabling the feature, it can better alleviate the accumulation of Pods On a Node and prevent performance bottleneck to a certain extent.

Will this make any difference?

Add features to apply for the Executor of Pod add AntiAffinity PreferredDuringSchedulingIgnoreedDuringExecution, Used to add the Application granularity Preferred antiaffinity based on Kubernetes' global resource-oriented (and other customized scheduling policies). We hope Kubernetes can help us smooth out ExecutorPod from the Application level as much as possible while adhering to the original scheduling rules.

Why choose this?

Why choose Inter-pod affinity and anti-affinity?

We are currently concerned with gathering distribution between pods.

Why choose PreferredDuringSchedulingIgnoreedDuringExecution?

According to the Kubernetes Assigning Pods To Nodes,can see that Kubernetes provides us with requiredDuringSchedulingIgnoredDuringExecution & preferredDuringSchedulingIgnoredDuringExecution

  • requiredDuringSchedulingIgnoredDuringExecution: The scheduler can't schedule the Pod unless the rule is met. This functions like nodeSelector, but with a more expressive syntax.

  • preferredDuringSchedulingIgnoredDuringExecution: The scheduler tries to find a node that meets the rule. If a matching node is not available, the scheduler still schedules the Pod.

It's not hard to see,preferredDuringSchedulingIgnoredDuringExecution is more in line with the issues we raised in this PR. We want Kubernetes to help smooth out Executor pods as much as possible, but in the worst case, where there is only one Node with resources left, we also need to be able to allocate Executor pods to that one Node.

Why would we want to influence Kubernetes scheduling from Spark code?

We need to antiaffinity executorPods from the granularity of the Application, so we choose to add the antiaffinity before allocating the Executor and after generating applicationId.

Due to the ApplicationId limitation, we could not pin it to the pod-template.yaml.

What are the negative effects?

From the perspective of normal scheduling policy, there is no great negative impact. From experiments and theories, it can be known that the new feature will not affect the existing scheduling content.

However, to some extent, breaking Executor pods will affect the localization of Shuffle data, that is, the number of blocks in LocalHost will be reduced.

Judging from the experiments and results so far, this trade-off is worthwhile.

If we do not adopt the strategy of breaking up executors, they will gather under certain circumstances, leading to Shuffle data skews and significantly affecting task performance. More seriously, after the connection number is full, Executor Fetch Block fails and Stage fails. Even Application failure.

Breaking executors increases network connectivity and traffic at the cluster level. However, the increased consumption is scattered in the cluster and not concentrated on several nodes, so the overall stability is acceptable.

Why does changing the number of Executor Pods affect Shuffle Data skew?

The number of Executor Pods is related to the amount of Shuffle data. Currently, the Shuffle tilt problem can be alleviated by controlling the number of Executor Pods.

At present, this is only a step of anti-affinity reference, and later may be carried out directly through shuffle quantity.

Does this PR introduce any user-facing change?

Yes

How was this patch tested?

Local

Copy link
Contributor

@dcoliversun dcoliversun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add unit test?

@zwangsheng
Copy link
Contributor Author

Could you please add unit test?

Thanks for your review, i'm going to add unit test later.

Copy link
Member

@martin-g martin-g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this solution generic enough ? I.e. would it solve this problem for all applications or some applications will need customizations/modifications ?

One can achieve the same now by using the PodTemplate config. The advantage is that the application can provide a config that is specific for its needs.

val sparkPod =
new AntiAffinityFeatureStep(executorConf).configurePod(SparkPod.initialPod())

assert(sparkPod.pod.getSpec.getAffinity.getPodAntiAffinity != null)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please cache sparkPod.pod.getSpec.getAffinity.getPodAntiAffinity as a local variable. It will simplify the code a lot!

@zwangsheng
Copy link
Contributor Author

Thanks for your review!

Is this solution generic enough ? I.e. would it solve this problem for all applications or some applications will need customizations/modifications ?

This solution is not for special application. It is used to alleviate data skew when running on kubernetes.In essence, there is casued by no balanced scheduling for the amount of shuffle data.

One can achieve the same now by using the PodTemplate config. The advantage is that the application can provide a config that is specific for its needs.

In this feature, need applicationId to help drift apart another executor pod, so i choose add anti-affinity during building executor pod instead of using template yaml.

BTW, I'd like to know about the one you mentioned by using the PodTemplate config.

@martin-g
Copy link
Member

BTW, I'd like to know about the one you mentioned by using the PodTemplate config.

https://spark.apache.org/docs/latest/running-on-kubernetes.html#pod-template

@zwangsheng
Copy link
Contributor Author

https://spark.apache.org/docs/latest/running-on-kubernetes.html#pod-template

Thanks for your reply, I know this and have used it in other ways. Perhaps I didn't make it clear above that this PR requires to keep the Executor anti-affinity of Application granularity, we need to use the Application Id, which is generated after the Driver starts, so using a fixed Pod-Template approach is not appropriate.
In general, thank you for your advice.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@martin-g
Copy link
Member

You may use {{APPID}} as a placeholder in the template. See #35704

@zwangsheng
Copy link
Contributor Author

You may use {{APPID}} as a placeholder in the template. See #35704

I see. Thank you for your introduction!

Currently, I consider that shuffle skew caused by Kubernetes scheduling occurs frequently during spark on kubernetes. If using Pod-template way, users may need to know about this, rather than learn about it through Spark configurations Doc. Can help alleviate the problem by turning on parameters rather than learning and using Pod templates. This may make it difficult for the user to ease the shuffle skew On Kubernetes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @zwangsheng . Thank you for making a PR.
However, Apache Spark community wants to avoid feature duplications like this.
The proposed feature is already delivered to many production environments via PodTemplate and has been used by the customers without any problem. Adding another configuration only makes the users confused .

.doc("If enable, register executor with anti affinity. This anti affinity will help " +
"Kubernetes assign executors of the same Application to different nodes " +
"as much as possible")
.version("3.2.1")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition, Apache Spark follows Semantic Versioning policy which means new features and improvements should be the version of master branch, currently, 3.4.0.

"as much as possible")
.version("3.2.1")
.booleanConf
.createWithDefault(false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default value is correct because AntiAffinity could hurt EKS AutoScale feature.

@zwangsheng
Copy link
Contributor Author

Hi, @zwangsheng . Thank you for making a PR.
However, Apache Spark community wants to avoid feature duplications like this.
The proposed feature is already delivered to many production environments via PodTemplate and has been used by the customers without any problem. Adding another configuration only makes the users confused .

@dongjoon-hyun Thanks for your reply. I can understand the above and accept it.

Thanks all for review this PR!!!

I will close this PR and look forward to meeting in the another PR.

@zwangsheng zwangsheng closed this May 20, 2022
@dongjoon-hyun
Copy link
Member

Thank you so much, @zwangsheng .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants