Inter-PodAffinity is calculated on multiple pods #68725

Huang-Wei · 2018-09-17T07:28:07Z

What this PR does / why we need it:
Inter-PodAffinity is now calculated on multiple pods instead of single pod.

Which issue(s) this PR fixes:
Fixes #68701

Special notes for your reviewer:

Release note:

Inter-PodAffinity is now calculated on multiple pods instead of single pod.

/sig scheduling
/kind feature

pkg/scheduler/algorithm/predicates/predicates.go

Huang-Wei · 2018-09-28T21:19:14Z

/retest

Huang-Wei · 2018-09-28T22:14:34Z

Removed "[WIP]" from title.

@bsalamat @ahmad-diaa code is ready for review.

pkg/scheduler/algorithm/predicates/predicates_test.go

bsalamat · 2018-09-29T00:28:35Z

pkg/scheduler/algorithm/predicates/predicates.go

 			if !matchExists {
 				// This pod may the first pod in a series that have affinity to themselves. In order
 				// to not leave such pods in pending state forever, we check that if no other pod
 				// in the cluster matches the namespace and selector of this pod and the pod matches
 				// its own terms, then we allow the pod to pass the affinity check.
-				if !(len(topologyPairsPotentialAffinityPods.topologyPairToPods) == 0 && targetPodMatchesAffinityOfPod(pod, pod)) {
+				if !targetPodMatchesAffinityOfPod(pod, pod) {


This is now incorrect. In the previous code we were checking that when there was no pod in the whole cluster that matched the affinity of the incoming pod AND the incoming pod didn't match its own affinity, we would return "NotMatch". This PR has eliminated the first part of the condition. So, if the incoming pod's affinity is not satisfied on this node, we will not return "NotMatch" when the pod matches its own affinity, regardless of the fact that there might be other pods in the cluster that match the incoming pod's affinity.

So is that to say, currently it needs to get an "aggregated" match result in advance which represents "if there is some node matches incoming pod's affinity".

If the "aggregated" match result is true, then just give final fit judgement depending on matchExists

If the "aggregated" match result is false, and matchExists is also false, then proceed to check if its podAffinity matches its own labels.

@bsalamat I believe case https://gist.github.com/Huang-Wei/008dae22f0bba635ed4f1ad151f9768c is what you meant.

it needs to get an "aggregated" match result in advance which represents "if there is some node matches incoming pod's affinity".

This isn't that easy to implement based on current code.

Firstly, we can't simply do a pre-compute b/c for some of nodes, predicates would have failed before going to InterPodAffinity predicate. So an arbitrary pre-compute is impractical and inefficient.

Secondly, like what current logic is, it's computed on demand. But it should be able to achieve that if !matchExists, hold on until it gets a signal indicating either 1) some other node passed PodAffinity check, or 2) nodes running InterPodAffinity predicate (note a: not all nodes) have all failed on PodAffinity check.

Secondly, like what current logic is, it's computed on demand. But it should be able to achieve that if !matchExists, hold on until it gets a signal indicating either 1) some other node passed PodAffinity check, or 2) nodes running InterPodAffinity predicate (note a: not all nodes) have all failed on PodAffinity check.

Current predicates framework assumes predicates can be run independently on each node. So it's pretty challenging to have above logic implemented in InterPodAffinity predicate.

Another implementation is to do a post-handling after all predicates finish. But this requires (1) InterPodAffinity predicate is the last predicate (it's the case right now) and (2) inside InterPodAffinity predicate, PodAffinity check should also be handled at last step (specifically, in satisfiesPodsAffinityAntiAffinity(), the logic of handling affinity should be moved after logic of handling antiAffinity).

A quick implementation is here.

@bsalamat I believe case https://gist.github.com/Huang-Wei/008dae22f0bba635ed4f1ad151f9768c is what you meant.

Yes, except that the pod is schedulable on both nodes. It has affinity to itself. Since its affinity doesn't match any pods in the cluster, but itself, its affinity become a no-op and it can be scheduled on any node as long as it passes other predicates.

Current predicates framework assumes predicates can be run independently on each node. So it's pretty challenging to have above logic implemented in InterPodAffinity predicate.

Yes. I agree. I don't have any good solution in mind at the moment.

Another implementation is to do a post-handling after all predicates finish. But this requires (1) InterPodAffinity predicate is the last predicate (it's the case right now) and (2) inside InterPodAffinity predicate, PodAffinity check should also be handled at last step (specifically, in satisfiesPodsAffinityAntiAffinity(), the logic of handling affinity should be moved after logic of handling antiAffinity).

Yes, but that's very brittle. These assumptions may be valid in the current implementation, but they could change in the future. We shouldn't corner ourselves by making these assumptions invariant.

Thanks @bsalamat. Those are almost the same understanding from me. Let me think more to give a more elegant solution.

BTW: regarding

Yes, except that the pod is schedulable on both nodes.

Maybe you mis-read that case - nodeA is a fit actually :) So only nodeA is a fit.

Maybe you mis-read that case - nodeA is a fit actually :) So only nodeA is a fit.

Yes, I didn't notice that. You are right. NodeA is the only one.

Huang-Wei · 2019-01-18T07:03:24Z

@bsalamat @misterikkit do you have any further comments on this PR?

bsalamat

This PR is changed a lot after the last time I reviewed it. Here are my early comments. I still need to check the main part of the algorithm. My biggest concern so far is the introduction of a new structure in the cache which can use a lot of memory in larger clusters. Please see my comment below.

pkg/scheduler/nodeinfo/topology_info.go

bsalamat · 2019-03-07T05:33:15Z

pkg/scheduler/algorithm/predicates/predicates.go

@@ -1651,3 +1654,21 @@ func (c *VolumeBindingChecker) predicate(pod *v1.Pod, meta PredicateMetadata, no
 	klog.V(5).Infof("All PVCs found matches for pod %v/%v, node %q", pod.Namespace, pod.Name, node.Name)
 	return true, nil, nil
 }
+
+// BuildTopologyInfo buids a TopologyInfo based on a nodeInfoMap
+func BuildTopologyInfo(nodeInfoMap map[string]*schedulernodeinfo.NodeInfo) schedulernodeinfo.TopologyInfo {


Looks like this is used only in tests. In that case, it should be moved to one of the test files.

As tests in multiple package need it, it's now moved to the same package as TopologyInfo.

pkg/scheduler/algorithm/predicates/metadata.go

bsalamat · 2019-03-08T00:21:30Z

pkg/scheduler/nodeinfo/topology_info.go

@@ -0,0 +1,111 @@
+/*


The question is whether we should place topology_info.go and its tests in the "internal" directory or here. The "nodeinfo" directory is not in the internal directory, because other modules use it. We should put topology_info under nodeinfo if we intend to allow other modules (autoscaler, node, etc) to use it and we want to provide backward compatibility. Otherwise, it shouldn't be here.

I guess this file was initially written before refactoring. Moved to "internal" package.

bsalamat · 2019-03-08T00:25:12Z

pkg/scheduler/nodeinfo/topology_info.go

+	Value string
+}
+
+// TopologyInfo denotes a mapping from TopologyPair to a string set


The string set is a set of node names. Please state that in the comment. The current comment is not very helpful.

~~Done.~~

I reused the structure for both pod names (in meta.affinityQuery) and node names (in meta.topologyInfo).

Will update the comments to mention both usages.

bsalamat · 2019-03-08T00:29:06Z

pkg/scheduler/internal/cache/cache.go

@@ -63,6 +63,8 @@ type schedulerCache struct {
 	nodeTree  *NodeTree
 	// A map from image name to its imageState.
 	imageStates map[string]*imageState
+	// A map from node label to node names
+	toplogyInfo schedulernodeinfo.TopologyInfo


Do we really need to put this in the cache? Basically, we are building and maintaining this structure in the scheduler memory even if no pods in the cluster use inter-pod affinity. This increases scheduler's memory usage. Is there any strong reason for having it here?

The reason is that if we update the (global) topologyInfo only upon scheduling of an inter-affinity pod, it will cause an instant O(n) time. In current design, it's an average cost on each operation (update on node labels).

Comparing to the average time cost, the consistent space cost (memory footprint you pointed out) more concerns me. Let me see if it can be optimized.

Huang-Wei · 2019-03-22T05:43:33Z

/hold
to avoid accidental merge.

k8s-ci-robot · 2019-03-27T00:01:39Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Huang-Wei
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approvers:

If they are not already assigned, you can assign the PR to them by writing /assign in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

pkg/controller/OWNERS
~~pkg/scheduler/OWNERS~~ [Huang-Wei]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

bsalamat

I have attempted several times to review this PR, but unfortunately this PR has become so large and has made so many changes that were unnecessary to be in a single PR that it is very hard to follow and review.
I believe some of the refactoring and optimizations didn't need to be in the same PR. For example, introduction of topology_info to the cache and updating it at node event handlers didn't need to be a part of this PR. It could have been a follow up PR.
Generally, it is not a great idea to combine performance optimizations in the same PR that changes a major functionality.

fejta-bot · 2019-06-25T19:03:56Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2019-07-25T19:51:15Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot · 2019-08-24T20:48:40Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot · 2019-08-24T20:48:48Z

@fejta-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Huang-Wei changed the title ~~Inter-PodAffinity is calculated on multiple pods~~ [WIP] Inter-PodAffinity is calculated on multiple pods Sep 17, 2018

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 17, 2018

k8s-ci-robot requested review from misterikkit and resouer September 17, 2018 07:28

Huang-Wei force-pushed the podaffinity-aggregated-match branch from 9257a20 to a09dba2 Compare September 17, 2018 17:06

Huang-Wei force-pushed the podaffinity-aggregated-match branch from 14ec663 to 6983cfc Compare September 18, 2018 00:25

bsalamat self-assigned this Sep 21, 2018

Huang-Wei commented Sep 22, 2018

View reviewed changes

pkg/scheduler/algorithm/predicates/predicates.go Outdated Show resolved Hide resolved

Huang-Wei force-pushed the podaffinity-aggregated-match branch from 6983cfc to bb3af7d Compare September 27, 2018 00:46

bsalamat reviewed Sep 27, 2018

View reviewed changes

pkg/scheduler/algorithm/predicates/predicates.go Outdated Show resolved Hide resolved

Huang-Wei force-pushed the podaffinity-aggregated-match branch from bb3af7d to 2e88364 Compare September 27, 2018 19:30

bsalamat reviewed Sep 28, 2018

View reviewed changes

pkg/scheduler/algorithm/predicates/predicates.go Outdated Show resolved Hide resolved

Huang-Wei force-pushed the podaffinity-aggregated-match branch from 2e88364 to 05adaf6 Compare September 28, 2018 18:50

ahmad-diaa reviewed Sep 28, 2018

View reviewed changes

pkg/scheduler/algorithm/predicates/predicates.go Outdated Show resolved Hide resolved

Huang-Wei changed the title ~~[WIP] Inter-PodAffinity is calculated on multiple pods~~ Inter-PodAffinity is calculated on multiple pods Sep 28, 2018

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 28, 2018

bsalamat reviewed Sep 29, 2018

View reviewed changes

Huang-Wei force-pushed the podaffinity-aggregated-match branch from a5e1083 to d848163 Compare October 2, 2018 18:35

k8s-ci-robot removed the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 2, 2018

Huang-Wei force-pushed the podaffinity-aggregated-match branch from 4de1fd8 to 56d1898 Compare December 23, 2018 20:35

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 23, 2018

Huang-Wei mentioned this pull request Jan 14, 2019

Rethink pod affinity/anti-affinity #72479

Open

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 21, 2019

bsalamat reviewed Mar 8, 2019

View reviewed changes

Huang-Wei force-pushed the podaffinity-aggregated-match branch from 56d1898 to dab2e74 Compare March 22, 2019 05:41

k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Mar 22, 2019

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 22, 2019

Huang-Wei added 4 commits March 26, 2019 13:43

Inter-PodAffinity is calculated on multiple pods

495c7b5

fix potential concurrent map write issue

156c2a3

address comments

b19dcef

auto-generated files

4dd1775

Huang-Wei force-pushed the podaffinity-aggregated-match branch from dab2e74 to 4dd1775 Compare March 26, 2019 20:45

format some code and comments

52ebf11

k8s-ci-robot added the sig/apps Categorizes an issue or PR as relevant to SIG Apps. label Mar 27, 2019

k8s-ci-robot removed the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 27, 2019

bsalamat reviewed Mar 27, 2019

View reviewed changes

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 25, 2019

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 25, 2019

k8s-ci-robot closed this Aug 24, 2019

Huang-Wei mentioned this pull request May 14, 2020

Added pre-processed required affinity terms to scheduler's PodInfo type. #91062

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inter-PodAffinity is calculated on multiple pods #68725

Inter-PodAffinity is calculated on multiple pods #68725

Huang-Wei commented Sep 17, 2018

Huang-Wei commented Sep 28, 2018

Huang-Wei commented Sep 28, 2018

bsalamat Sep 29, 2018

Huang-Wei Sep 29, 2018

Huang-Wei Sep 29, 2018

Huang-Wei Sep 29, 2018

Huang-Wei Sep 30, 2018

bsalamat Oct 1, 2018

Huang-Wei Oct 2, 2018

bsalamat Oct 2, 2018

Huang-Wei commented Jan 18, 2019

bsalamat left a comment

bsalamat Mar 7, 2019

Huang-Wei Mar 22, 2019

bsalamat Mar 8, 2019

Huang-Wei Mar 22, 2019

bsalamat Mar 8, 2019

Huang-Wei Mar 22, 2019 •

edited

Loading

bsalamat Mar 8, 2019

Huang-Wei Mar 22, 2019

Huang-Wei commented Mar 22, 2019

k8s-ci-robot commented Mar 27, 2019

bsalamat left a comment

fejta-bot commented Jun 25, 2019

fejta-bot commented Jul 25, 2019

fejta-bot commented Aug 24, 2019

k8s-ci-robot commented Aug 24, 2019

Inter-PodAffinity is calculated on multiple pods #68725

Inter-PodAffinity is calculated on multiple pods #68725

Conversation

Huang-Wei commented Sep 17, 2018

Huang-Wei commented Sep 28, 2018

Huang-Wei commented Sep 28, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Huang-Wei commented Jan 18, 2019

bsalamat left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Huang-Wei Mar 22, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Huang-Wei commented Mar 22, 2019

k8s-ci-robot commented Mar 27, 2019

bsalamat left a comment

Choose a reason for hiding this comment

fejta-bot commented Jun 25, 2019

fejta-bot commented Jul 25, 2019

fejta-bot commented Aug 24, 2019

k8s-ci-robot commented Aug 24, 2019

Huang-Wei Mar 22, 2019 •

edited

Loading