Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inter-PodAffinity is calculated on multiple pods #68725

Closed

Conversation

Huang-Wei
Copy link
Member

What this PR does / why we need it:
Inter-PodAffinity is now calculated on multiple pods instead of single pod.

Which issue(s) this PR fixes:
Fixes #68701

Special notes for your reviewer:

Release note:

Inter-PodAffinity is now calculated on multiple pods instead of single pod.

/sig scheduling
/kind feature

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. kind/feature Categorizes issue or PR as related to a new feature. labels Sep 17, 2018
@Huang-Wei Huang-Wei changed the title Inter-PodAffinity is calculated on multiple pods [WIP] Inter-PodAffinity is calculated on multiple pods Sep 17, 2018
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 17, 2018
@Huang-Wei Huang-Wei force-pushed the podaffinity-aggregated-match branch from 9257a20 to a09dba2 Compare September 17, 2018 17:06
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Sep 17, 2018
@Huang-Wei Huang-Wei force-pushed the podaffinity-aggregated-match branch from 14ec663 to 6983cfc Compare September 18, 2018 00:25
@bsalamat bsalamat self-assigned this Sep 21, 2018
@Huang-Wei Huang-Wei force-pushed the podaffinity-aggregated-match branch from 6983cfc to bb3af7d Compare September 27, 2018 00:46
@Huang-Wei Huang-Wei force-pushed the podaffinity-aggregated-match branch from bb3af7d to 2e88364 Compare September 27, 2018 19:30
@Huang-Wei Huang-Wei force-pushed the podaffinity-aggregated-match branch from 2e88364 to 05adaf6 Compare September 28, 2018 18:50
@Huang-Wei Huang-Wei changed the title [WIP] Inter-PodAffinity is calculated on multiple pods Inter-PodAffinity is calculated on multiple pods Sep 28, 2018
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 28, 2018
@Huang-Wei
Copy link
Member Author

/retest

@Huang-Wei
Copy link
Member Author

Removed "[WIP]" from title.

@bsalamat @ahmad-diaa code is ready for review.

if !matchExists {
// This pod may the first pod in a series that have affinity to themselves. In order
// to not leave such pods in pending state forever, we check that if no other pod
// in the cluster matches the namespace and selector of this pod and the pod matches
// its own terms, then we allow the pod to pass the affinity check.
if !(len(topologyPairsPotentialAffinityPods.topologyPairToPods) == 0 && targetPodMatchesAffinityOfPod(pod, pod)) {
if !targetPodMatchesAffinityOfPod(pod, pod) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now incorrect. In the previous code we were checking that when there was no pod in the whole cluster that matched the affinity of the incoming pod AND the incoming pod didn't match its own affinity, we would return "NotMatch". This PR has eliminated the first part of the condition. So, if the incoming pod's affinity is not satisfied on this node, we will not return "NotMatch" when the pod matches its own affinity, regardless of the fact that there might be other pods in the cluster that match the incoming pod's affinity.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So is that to say, currently it needs to get an "aggregated" match result in advance which represents "if there is some node matches incoming pod's affinity".

  • If the "aggregated" match result is true, then just give final fit judgement depending on matchExists
  • If the "aggregated" match result is false, and matchExists is also false, then proceed to check if its podAffinity matches its own labels.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it needs to get an "aggregated" match result in advance which represents "if there is some node matches incoming pod's affinity".

This isn't that easy to implement based on current code.

Firstly, we can't simply do a pre-compute b/c for some of nodes, predicates would have failed before going to InterPodAffinity predicate. So an arbitrary pre-compute is impractical and inefficient.

Secondly, like what current logic is, it's computed on demand. But it should be able to achieve that if !matchExists, hold on until it gets a signal indicating either 1) some other node passed PodAffinity check, or 2) nodes running InterPodAffinity predicate (note a: not all nodes) have all failed on PodAffinity check.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Secondly, like what current logic is, it's computed on demand. But it should be able to achieve that if !matchExists, hold on until it gets a signal indicating either 1) some other node passed PodAffinity check, or 2) nodes running InterPodAffinity predicate (note a: not all nodes) have all failed on PodAffinity check.

Current predicates framework assumes predicates can be run independently on each node. So it's pretty challenging to have above logic implemented in InterPodAffinity predicate.

Another implementation is to do a post-handling after all predicates finish. But this requires (1) InterPodAffinity predicate is the last predicate (it's the case right now) and (2) inside InterPodAffinity predicate, PodAffinity check should also be handled at last step (specifically, in satisfiesPodsAffinityAntiAffinity(), the logic of handling affinity should be moved after logic of handling antiAffinity).

A quick implementation is here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bsalamat I believe case https://gist.github.com/Huang-Wei/008dae22f0bba635ed4f1ad151f9768c is what you meant.

Yes, except that the pod is schedulable on both nodes. It has affinity to itself. Since its affinity doesn't match any pods in the cluster, but itself, its affinity become a no-op and it can be scheduled on any node as long as it passes other predicates.

Current predicates framework assumes predicates can be run independently on each node. So it's pretty challenging to have above logic implemented in InterPodAffinity predicate.

Yes. I agree. I don't have any good solution in mind at the moment.

Another implementation is to do a post-handling after all predicates finish. But this requires (1) InterPodAffinity predicate is the last predicate (it's the case right now) and (2) inside InterPodAffinity predicate, PodAffinity check should also be handled at last step (specifically, in satisfiesPodsAffinityAntiAffinity(), the logic of handling affinity should be moved after logic of handling antiAffinity).

Yes, but that's very brittle. These assumptions may be valid in the current implementation, but they could change in the future. We shouldn't corner ourselves by making these assumptions invariant.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @bsalamat. Those are almost the same understanding from me. Let me think more to give a more elegant solution.

BTW: regarding

Yes, except that the pod is schedulable on both nodes.

Maybe you mis-read that case - nodeA is a fit actually :) So only nodeA is a fit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you mis-read that case - nodeA is a fit actually :) So only nodeA is a fit.

Yes, I didn't notice that. You are right. NodeA is the only one.

@Huang-Wei Huang-Wei force-pushed the podaffinity-aggregated-match branch from a5e1083 to d848163 Compare October 2, 2018 18:35
@k8s-ci-robot k8s-ci-robot removed the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 2, 2018
@Huang-Wei Huang-Wei force-pushed the podaffinity-aggregated-match branch from 4de1fd8 to 56d1898 Compare December 23, 2018 20:35
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 23, 2018
@Huang-Wei
Copy link
Member Author

@bsalamat @misterikkit do you have any further comments on this PR?

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 21, 2019
Copy link
Member

@bsalamat bsalamat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is changed a lot after the last time I reviewed it. Here are my early comments. I still need to check the main part of the algorithm. My biggest concern so far is the introduction of a new structure in the cache which can use a lot of memory in larger clusters. Please see my comment below.

@@ -1651,3 +1654,21 @@ func (c *VolumeBindingChecker) predicate(pod *v1.Pod, meta PredicateMetadata, no
klog.V(5).Infof("All PVCs found matches for pod %v/%v, node %q", pod.Namespace, pod.Name, node.Name)
return true, nil, nil
}

// BuildTopologyInfo buids a TopologyInfo based on a nodeInfoMap
func BuildTopologyInfo(nodeInfoMap map[string]*schedulernodeinfo.NodeInfo) schedulernodeinfo.TopologyInfo {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this is used only in tests. In that case, it should be moved to one of the test files.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As tests in multiple package need it, it's now moved to the same package as TopologyInfo.

@@ -0,0 +1,111 @@
/*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The question is whether we should place topology_info.go and its tests in the "internal" directory or here. The "nodeinfo" directory is not in the internal directory, because other modules use it. We should put topology_info under nodeinfo if we intend to allow other modules (autoscaler, node, etc) to use it and we want to provide backward compatibility. Otherwise, it shouldn't be here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this file was initially written before refactoring. Moved to "internal" package.

Value string
}

// TopologyInfo denotes a mapping from TopologyPair to a string set
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The string set is a set of node names. Please state that in the comment. The current comment is not very helpful.

Copy link
Member Author

@Huang-Wei Huang-Wei Mar 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

I reused the structure for both pod names (in meta.affinityQuery) and node names (in meta.topologyInfo).

Will update the comments to mention both usages.

@@ -63,6 +63,8 @@ type schedulerCache struct {
nodeTree *NodeTree
// A map from image name to its imageState.
imageStates map[string]*imageState
// A map from node label to node names
toplogyInfo schedulernodeinfo.TopologyInfo
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need to put this in the cache? Basically, we are building and maintaining this structure in the scheduler memory even if no pods in the cluster use inter-pod affinity. This increases scheduler's memory usage. Is there any strong reason for having it here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason is that if we update the (global) topologyInfo only upon scheduling of an inter-affinity pod, it will cause an instant O(n) time. In current design, it's an average cost on each operation (update on node labels).

Comparing to the average time cost, the consistent space cost (memory footprint you pointed out) more concerns me. Let me see if it can be optimized.

@Huang-Wei Huang-Wei force-pushed the podaffinity-aggregated-match branch from 56d1898 to dab2e74 Compare March 22, 2019 05:41
@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Mar 22, 2019
@Huang-Wei
Copy link
Member Author

/hold
to avoid accidental merge.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 22, 2019
@Huang-Wei Huang-Wei force-pushed the podaffinity-aggregated-match branch from dab2e74 to 4dd1775 Compare March 26, 2019 20:45
@k8s-ci-robot k8s-ci-robot added the sig/apps Categorizes an issue or PR as relevant to SIG Apps. label Mar 27, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Huang-Wei
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approvers:

If they are not already assigned, you can assign the PR to them by writing /assign in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot removed the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 27, 2019
Copy link
Member

@bsalamat bsalamat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have attempted several times to review this PR, but unfortunately this PR has become so large and has made so many changes that were unnecessary to be in a single PR that it is very hard to follow and review.
I believe some of the refactoring and optimizations didn't need to be in the same PR. For example, introduction of topology_info to the cache and updating it at node event handlers didn't need to be a part of this PR. It could have been a follow up PR.
Generally, it is not a great idea to combine performance optimizations in the same PR that changes a major functionality.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 25, 2019
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 25, 2019
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support inter-Pod affinity to one or more Pods
6 participants