-
Notifications
You must be signed in to change notification settings - Fork 40.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inter-PodAffinity is calculated on multiple pods #68725
Inter-PodAffinity is calculated on multiple pods #68725
Conversation
9257a20
to
a09dba2
Compare
14ec663
to
6983cfc
Compare
6983cfc
to
bb3af7d
Compare
bb3af7d
to
2e88364
Compare
2e88364
to
05adaf6
Compare
/retest |
Removed "[WIP]" from title. @bsalamat @ahmad-diaa code is ready for review. |
if !matchExists { | ||
// This pod may the first pod in a series that have affinity to themselves. In order | ||
// to not leave such pods in pending state forever, we check that if no other pod | ||
// in the cluster matches the namespace and selector of this pod and the pod matches | ||
// its own terms, then we allow the pod to pass the affinity check. | ||
if !(len(topologyPairsPotentialAffinityPods.topologyPairToPods) == 0 && targetPodMatchesAffinityOfPod(pod, pod)) { | ||
if !targetPodMatchesAffinityOfPod(pod, pod) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is now incorrect. In the previous code we were checking that when there was no pod in the whole cluster that matched the affinity of the incoming pod AND the incoming pod didn't match its own affinity, we would return "NotMatch". This PR has eliminated the first part of the condition. So, if the incoming pod's affinity is not satisfied on this node, we will not return "NotMatch" when the pod matches its own affinity, regardless of the fact that there might be other pods in the cluster that match the incoming pod's affinity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So is that to say, currently it needs to get an "aggregated" match result in advance which represents "if there is some node matches incoming pod's affinity".
- If the "aggregated" match result is true, then just give final fit judgement depending on
matchExists
- If the "aggregated" match result is false, and
matchExists
is also false, then proceed to check if its podAffinity matches its own labels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bsalamat I believe case https://gist.github.com/Huang-Wei/008dae22f0bba635ed4f1ad151f9768c is what you meant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it needs to get an "aggregated" match result in advance which represents "if there is some node matches incoming pod's affinity".
This isn't that easy to implement based on current code.
Firstly, we can't simply do a pre-compute b/c for some of nodes, predicates would have failed before going to InterPodAffinity predicate. So an arbitrary pre-compute is impractical and inefficient.
Secondly, like what current logic is, it's computed on demand. But it should be able to achieve that if !matchExists
, hold on until it gets a signal indicating either 1) some other node passed PodAffinity check, or 2) nodes running InterPodAffinity predicate (note a: not all nodes) have all failed on PodAffinity check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Secondly, like what current logic is, it's computed on demand. But it should be able to achieve that if
!matchExists
, hold on until it gets a signal indicating either 1) some other node passed PodAffinity check, or 2) nodes running InterPodAffinity predicate (note a: not all nodes) have all failed on PodAffinity check.
Current predicates framework assumes predicates can be run independently on each node. So it's pretty challenging to have above logic implemented in InterPodAffinity predicate.
Another implementation is to do a post-handling after all predicates finish. But this requires (1) InterPodAffinity predicate is the last predicate (it's the case right now) and (2) inside InterPodAffinity predicate, PodAffinity check should also be handled at last step (specifically, in satisfiesPodsAffinityAntiAffinity(), the logic of handling affinity should be moved after logic of handling antiAffinity).
A quick implementation is here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bsalamat I believe case https://gist.github.com/Huang-Wei/008dae22f0bba635ed4f1ad151f9768c is what you meant.
Yes, except that the pod is schedulable on both nodes. It has affinity to itself. Since its affinity doesn't match any pods in the cluster, but itself, its affinity become a no-op and it can be scheduled on any node as long as it passes other predicates.
Current predicates framework assumes predicates can be run independently on each node. So it's pretty challenging to have above logic implemented in InterPodAffinity predicate.
Yes. I agree. I don't have any good solution in mind at the moment.
Another implementation is to do a post-handling after all predicates finish. But this requires (1) InterPodAffinity predicate is the last predicate (it's the case right now) and (2) inside InterPodAffinity predicate, PodAffinity check should also be handled at last step (specifically, in satisfiesPodsAffinityAntiAffinity(), the logic of handling affinity should be moved after logic of handling antiAffinity).
Yes, but that's very brittle. These assumptions may be valid in the current implementation, but they could change in the future. We shouldn't corner ourselves by making these assumptions invariant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @bsalamat. Those are almost the same understanding from me. Let me think more to give a more elegant solution.
BTW: regarding
Yes, except that the pod is schedulable on both nodes.
Maybe you mis-read that case - nodeA is a fit actually :) So only nodeA is a fit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you mis-read that case - nodeA is a fit actually :) So only nodeA is a fit.
Yes, I didn't notice that. You are right. NodeA is the only one.
a5e1083
to
d848163
Compare
4de1fd8
to
56d1898
Compare
@bsalamat @misterikkit do you have any further comments on this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is changed a lot after the last time I reviewed it. Here are my early comments. I still need to check the main part of the algorithm. My biggest concern so far is the introduction of a new structure in the cache which can use a lot of memory in larger clusters. Please see my comment below.
@@ -1651,3 +1654,21 @@ func (c *VolumeBindingChecker) predicate(pod *v1.Pod, meta PredicateMetadata, no | |||
klog.V(5).Infof("All PVCs found matches for pod %v/%v, node %q", pod.Namespace, pod.Name, node.Name) | |||
return true, nil, nil | |||
} | |||
|
|||
// BuildTopologyInfo buids a TopologyInfo based on a nodeInfoMap | |||
func BuildTopologyInfo(nodeInfoMap map[string]*schedulernodeinfo.NodeInfo) schedulernodeinfo.TopologyInfo { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this is used only in tests. In that case, it should be moved to one of the test files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As tests in multiple package need it, it's now moved to the same package as TopologyInfo
.
@@ -0,0 +1,111 @@ | |||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The question is whether we should place topology_info.go and its tests in the "internal" directory or here. The "nodeinfo" directory is not in the internal directory, because other modules use it. We should put topology_info under nodeinfo if we intend to allow other modules (autoscaler, node, etc) to use it and we want to provide backward compatibility. Otherwise, it shouldn't be here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this file was initially written before refactoring. Moved to "internal" package.
Value string | ||
} | ||
|
||
// TopologyInfo denotes a mapping from TopologyPair to a string set |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The string set is a set of node names. Please state that in the comment. The current comment is not very helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
I reused the structure for both pod names (in meta.affinityQuery) and node names (in meta.topologyInfo).
Will update the comments to mention both usages.
@@ -63,6 +63,8 @@ type schedulerCache struct { | |||
nodeTree *NodeTree | |||
// A map from image name to its imageState. | |||
imageStates map[string]*imageState | |||
// A map from node label to node names | |||
toplogyInfo schedulernodeinfo.TopologyInfo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need to put this in the cache? Basically, we are building and maintaining this structure in the scheduler memory even if no pods in the cluster use inter-pod affinity. This increases scheduler's memory usage. Is there any strong reason for having it here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason is that if we update the (global) topologyInfo only upon scheduling of an inter-affinity pod, it will cause an instant O(n) time. In current design, it's an average cost on each operation (update on node labels).
Comparing to the average time cost, the consistent space cost (memory footprint you pointed out) more concerns me. Let me see if it can be optimized.
56d1898
to
dab2e74
Compare
/hold |
dab2e74
to
4dd1775
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: Huang-Wei If they are not already assigned, you can assign the PR to them by writing The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have attempted several times to review this PR, but unfortunately this PR has become so large and has made so many changes that were unnecessary to be in a single PR that it is very hard to follow and review.
I believe some of the refactoring and optimizations didn't need to be in the same PR. For example, introduction of topology_info
to the cache and updating it at node event handlers didn't need to be a part of this PR. It could have been a follow up PR.
Generally, it is not a great idea to combine performance optimizations in the same PR that changes a major functionality.
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
@fejta-bot: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What this PR does / why we need it:
Inter-PodAffinity is now calculated on multiple pods instead of single pod.
Which issue(s) this PR fixes:
Fixes #68701
Special notes for your reviewer:
Release note:
/sig scheduling
/kind feature