From bd64321a1857a2905ef79eda30bebb990bcf59d1 Mon Sep 17 00:00:00 2001 From: Kensei Nakada Date: Sat, 4 Mar 2023 14:11:52 +0900 Subject: [PATCH 01/13] add general section --- .../2023-04-11-topology-spread-features.md | 43 +++++++++++++++++++ 1 file changed, 43 insertions(+) create mode 100644 content/en/blog/_posts/2023-04-11-topology-spread-features.md diff --git a/content/en/blog/_posts/2023-04-11-topology-spread-features.md b/content/en/blog/_posts/2023-04-11-topology-spread-features.md new file mode 100644 index 0000000000000..8d38440562cd8 --- /dev/null +++ b/content/en/blog/_posts/2023-04-11-topology-spread-features.md @@ -0,0 +1,43 @@ +--- +layout: blog +title: "TBD" // TODO: have a cool title. +date: 2023-04-11 +slug: topology-spread-new-features +evergreen: true +--- + +**Authors:** [Alex Wang](https://github.com/denkensk)(), [Kante Yin](https://github.com/kerthcet)(), [Kensei Nakada](https://github.com/sanposhiho)(Mercari) + +In Kubernetes v1.19, [Pod Topology Spread Constraints](https://kubernetes.io/docs/concepts/scheduling-eviction/topology-spread-constraints/) went to GA. +It is the feature to control how Pods are spread to each failure-domain (regions, zones, nodes etc). + +As time passes, we've got further feedbacks from users, +and we're actively working on improving the Topology Spread via three KEPs from v1.25. +All of these features have reached beta in Kubernetes v1.27 and been enabled by default. + +This blog post is going to introduce each feature and the usecase/issue behind them. + +## KEP-3022: min domains in Pod Topology Spread + +TODO(sanposhiho): write it + +## KEP-3094: Take taints/tolerations into consideration when calculating PodTopologySpread skew + +TODO(kerthcet): write it + +## KEP-3243: Respect PodTopologySpread after rolling upgrades + +TODO(denkensk): write it + +## Getting involved + +These features are managed by the [SIG/Scheduling](https://github.com/kubernetes/community/tree/master/sig-scheduling). + +Please join us and share your feedback. We look forward to hearing from you! + +## How can I learn more? + +- [Pod Topology Spread Constraints | Kubernetes](https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/#container-resource-metrics) +- [KEP-3022: min domains in Pod Topology Spread](https://github.com/kubernetes/enhancements/tree/master/keps/sig-scheduling/3022-min-domains-in-pod-topology-spread) +- [KEP-3094: Take taints/tolerations into consideration when calculating PodTopologySpread skew](https://github.com/kubernetes/enhancements/tree/master/keps/sig-scheduling/3094-pod-topology-spread-considering-taints) +- [KEP-3243: Respect PodTopologySpread after rolling upgrades](https://github.com/kubernetes/enhancements/tree/master/keps/sig-scheduling/3243-respect-pod-topology-spread-after-rolling-upgrades) \ No newline at end of file From 528e1138e99fd24b0350a0ca3cbe5c2bec5fd998 Mon Sep 17 00:00:00 2001 From: Kensei Nakada Date: Sat, 1 Apr 2023 15:34:10 +0900 Subject: [PATCH 02/13] add min domains --- .../2023-04-11-topology-spread-features.md | 30 +++++++++++++++++-- 1 file changed, 28 insertions(+), 2 deletions(-) diff --git a/content/en/blog/_posts/2023-04-11-topology-spread-features.md b/content/en/blog/_posts/2023-04-11-topology-spread-features.md index 8d38440562cd8..290a5ac2b73a9 100644 --- a/content/en/blog/_posts/2023-04-11-topology-spread-features.md +++ b/content/en/blog/_posts/2023-04-11-topology-spread-features.md @@ -6,7 +6,7 @@ slug: topology-spread-new-features evergreen: true --- -**Authors:** [Alex Wang](https://github.com/denkensk)(), [Kante Yin](https://github.com/kerthcet)(), [Kensei Nakada](https://github.com/sanposhiho)(Mercari) +**Authors:** [Alex Wang](https://github.com/denkensk)(Shopee), [Kante Yin](https://github.com/kerthcet)(DaoCloud), [Kensei Nakada](https://github.com/sanposhiho)(Mercari) In Kubernetes v1.19, [Pod Topology Spread Constraints](https://kubernetes.io/docs/concepts/scheduling-eviction/topology-spread-constraints/) went to GA. It is the feature to control how Pods are spread to each failure-domain (regions, zones, nodes etc). @@ -19,7 +19,33 @@ This blog post is going to introduce each feature and the usecase/issue behind t ## KEP-3022: min domains in Pod Topology Spread -TODO(sanposhiho): write it +Pod Topology Spread has the `maxSkew` parameter to define the degree to which Pods may be unevenly distributed. + +But, there wasn't a way to control the number of domains over which we should spread. +Some users want to force spreading Pods over a minimum number of domains, and if there aren't enough already present, make the cluster-autoscaler provision them. + +Then, we introduced the `minDomains` parameter in the Pod Topology Spread. +Via `minDomains` parameter, you can define the minimum number of domains. + +For example, there are 3 Nodes with the enough capacity, +and newly created replicaset has the following `topologySpreadConstraints` in template. + +```yaml +topologySpreadConstraints: +- maxSkew: 1 + minDomains: 5 # requires 5 Nodes at least. + whenUnsatisfiable: DoNotSchedule # minDomains is valid only when DoNotSchedule is used. + topologyKey: kubernetes.io/hostname + labelSelector: + matchLabels: + foo: bar +``` + +This case, 3 Pods will be scheduled to those 3 Nodes, +but other 2 Pods from this replicaset will be unschedulable until more Nodes join the cluster. + +The cluster autoscaler provisions new Nodes based on these unschedulable Pods, +and as a result, the replicas are finally spread over 5 Nodes. ## KEP-3094: Take taints/tolerations into consideration when calculating PodTopologySpread skew From ba1d6bd99907f4cb0c09c1a44d9e4fb1045126ab Mon Sep 17 00:00:00 2001 From: Kante Yin Date: Mon, 3 Apr 2023 00:48:27 +0800 Subject: [PATCH 03/13] Add section about nodeInclusionPolicy Signed-off-by: Kante Yin --- .../2023-04-11-topology-spread-features.md | 50 +++++++++++++++++-- 1 file changed, 45 insertions(+), 5 deletions(-) diff --git a/content/en/blog/_posts/2023-04-11-topology-spread-features.md b/content/en/blog/_posts/2023-04-11-topology-spread-features.md index 290a5ac2b73a9..990e3ff784fa8 100644 --- a/content/en/blog/_posts/2023-04-11-topology-spread-features.md +++ b/content/en/blog/_posts/2023-04-11-topology-spread-features.md @@ -8,7 +8,7 @@ evergreen: true **Authors:** [Alex Wang](https://github.com/denkensk)(Shopee), [Kante Yin](https://github.com/kerthcet)(DaoCloud), [Kensei Nakada](https://github.com/sanposhiho)(Mercari) -In Kubernetes v1.19, [Pod Topology Spread Constraints](https://kubernetes.io/docs/concepts/scheduling-eviction/topology-spread-constraints/) went to GA. +In Kubernetes v1.19, [Pod Topology Spread Constraints](https://kubernetes.io/docs/concepts/scheduling-eviction/topology-spread-constraints/) went to GA. It is the feature to control how Pods are spread to each failure-domain (regions, zones, nodes etc). As time passes, we've got further feedbacks from users, @@ -47,17 +47,57 @@ but other 2 Pods from this replicaset will be unschedulable until more Nodes joi The cluster autoscaler provisions new Nodes based on these unschedulable Pods, and as a result, the replicas are finally spread over 5 Nodes. -## KEP-3094: Take taints/tolerations into consideration when calculating PodTopologySpread skew +## Take taints/tolerations into consideration when calculating PodTopologySpread skew -TODO(kerthcet): write it +Before this, when we deploy a pod with `podTopologySpread` configured, we'll take all +affinity nodes(satisfied with pod nodeAffinity and nodeSelector) into consideration +in filtering and scoring, but a node with pod untolerated taint may also be a candidate +because we didn't take care of node taints, which will lead to the pod pending. + +To avoid this and make a more fine-gained decision in scheduling, we introduced two new fields in +`TopologySpreadConstraint` to define node inclusion policies including nodeAffinity and nodeTaint. + +It mostly looks like: + +```yaml +apiVersion: v1 +kind: Pod +metadata: + name: example-pod +spec: + # Configure a topology spread constraint + topologySpreadConstraints: + - maxSkew: + # ... + nodeAffinityPolicy: [Honor|Ignore] + nodeTaintsPolicy: [Honor|Ignore] + # other Pod fields go here +``` + +**nodeAffinityPolicy** indicates how we'll treat Pod's nodeAffinity/nodeSelector in pod topology spreading. +If `Honor`, we'll filter out nodes not matching nodeAffinity/nodeSelector in calculation. +If `Ignore`, these nodes will be included instead. + +For backwards-compatibility, nodeAffinityPolicy is default to `Honor`. + +**nodeTaintsPolicy** indicates how we'll treat node taints in pod topology spreading. +If `Honor`, only tainted nodes for which the incoming pod has a toleration, will be included in calculation. +If `Ignore`, we'll not consider the node taints at all in calculation, so a node with pod untolerated taint +will also be included. + +For backwards-compatibility, nodeTaintsPolicy is default to the `Ignore`. + +The feature was introduced in v1.25 as alpha level. By default, it was disabled, so if you want to use this feature in v1.25, +you have to enable the feature gate `NodeInclusionPolicyInPodTopologySpread` actively. In the following v1.26, we graduated +this feature to beta and it was enabled by default since. ## KEP-3243: Respect PodTopologySpread after rolling upgrades TODO(denkensk): write it -## Getting involved +## Getting involved -These features are managed by the [SIG/Scheduling](https://github.com/kubernetes/community/tree/master/sig-scheduling). +These features are managed by the [SIG/Scheduling](https://github.com/kubernetes/community/tree/master/sig-scheduling). Please join us and share your feedback. We look forward to hearing from you! From c35cd20175cc3a86d3622794db869a3eaa294d3a Mon Sep 17 00:00:00 2001 From: Alex Wang Date: Mon, 3 Apr 2023 15:46:20 +0800 Subject: [PATCH 04/13] blog: add section about matchLabelKeys Signed-off-by: Alex Wang --- .../2023-04-11-topology-spread-features.md | 33 ++++++++++++++++++- 1 file changed, 32 insertions(+), 1 deletion(-) diff --git a/content/en/blog/_posts/2023-04-11-topology-spread-features.md b/content/en/blog/_posts/2023-04-11-topology-spread-features.md index 990e3ff784fa8..9be4953732f5b 100644 --- a/content/en/blog/_posts/2023-04-11-topology-spread-features.md +++ b/content/en/blog/_posts/2023-04-11-topology-spread-features.md @@ -93,7 +93,38 @@ this feature to beta and it was enabled by default since. ## KEP-3243: Respect PodTopologySpread after rolling upgrades -TODO(denkensk): write it +Pod Topology Spread uses the fields `topologyKey` or `labelSelector` to identify the group of pods over which +spreading will be calculated. But it applies to all pods in a Deployment irrespective of their owning +ReplicaSet. As a result, when a new revision is rolled out, spreading will apply across pods from both the +old and new ReplicaSets, and so by the time the new ReplicaSet is completely rolled out and the old one is +rolled back, the actual spreading we are left with may not match expectations because the deleted pods from +the older ReplicaSet will cause skewed distribution for the remaining pods. + +In order to solve this problem and to make more accurate decisions in scheduling, we added a new named +`matchLabelKeys` to `topologySpreadConstraints`. `matchLabelKeys` is a list of pod label keys to select +the pods over which spreading will be calculated. The keys are used to lookup values from the pod labels, +those key-value labels are ANDed with `labelSelector` to select the group of existing pods over +which spreading will be calculated for the incoming pod. + +With `matchLabelKeys`, you don't need to update the `pod.spec` between different revisions. +The controller/operator just needs to set different values to the same label key for different revisions. +The scheduler will assume the values automatically based on `matchLabelKeys`. +For example, if you are configuring a Deployment, you can use the label keyed with +[pod-template-hash](https://kubernetes.io//docs/concepts/workloads/controllers/deployment/#pod-template-hash-label), +which is added automatically by the Deployment controller, to distinguish between different +revisions in a single Deployment. + +```yaml +topologySpreadConstraints: + - maxSkew: 1 + topologyKey: kubernetes.io/hostname + whenUnsatisfiable: DoNotSchedule + labelSelector: + matchLabels: + app: foo + matchLabelKeys: + - pod-template-hash +``` ## Getting involved From dcfe5ae35f97dc1719e7df3bd83ad9dac46cd8fa Mon Sep 17 00:00:00 2001 From: Kensei Nakada Date: Mon, 3 Apr 2023 16:50:49 +0900 Subject: [PATCH 05/13] updat title and slug --- content/en/blog/_posts/2023-04-11-topology-spread-features.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/content/en/blog/_posts/2023-04-11-topology-spread-features.md b/content/en/blog/_posts/2023-04-11-topology-spread-features.md index 9be4953732f5b..a159117e310d0 100644 --- a/content/en/blog/_posts/2023-04-11-topology-spread-features.md +++ b/content/en/blog/_posts/2023-04-11-topology-spread-features.md @@ -1,8 +1,8 @@ --- layout: blog -title: "TBD" // TODO: have a cool title. +title: "Kubernetes 1.27: More fine-grained pod topology spread policies reached beta" date: 2023-04-11 -slug: topology-spread-new-features +slug: fine-grained-pod-topology-spread-features-beta evergreen: true --- From bbe2382abfe1f96ef1ddca60ad7fd90bc684d348 Mon Sep 17 00:00:00 2001 From: Kensei Nakada Date: Mon, 3 Apr 2023 16:52:53 +0900 Subject: [PATCH 06/13] change the section header --- content/en/blog/_posts/2023-04-11-topology-spread-features.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/en/blog/_posts/2023-04-11-topology-spread-features.md b/content/en/blog/_posts/2023-04-11-topology-spread-features.md index a159117e310d0..bd908eece73d5 100644 --- a/content/en/blog/_posts/2023-04-11-topology-spread-features.md +++ b/content/en/blog/_posts/2023-04-11-topology-spread-features.md @@ -47,7 +47,7 @@ but other 2 Pods from this replicaset will be unschedulable until more Nodes joi The cluster autoscaler provisions new Nodes based on these unschedulable Pods, and as a result, the replicas are finally spread over 5 Nodes. -## Take taints/tolerations into consideration when calculating PodTopologySpread skew +## KEP-3094: Take taints/tolerations into consideration when calculating PodTopologySpread skew Before this, when we deploy a pod with `podTopologySpread` configured, we'll take all affinity nodes(satisfied with pod nodeAffinity and nodeSelector) into consideration From 4e1b4f7f43a70cdb86cbcc439292cd0d7866c286 Mon Sep 17 00:00:00 2001 From: Kensei Nakada Date: Wed, 5 Apr 2023 21:56:27 +0900 Subject: [PATCH 07/13] fix based on the suggestion --- .../_posts/2023-04-11-topology-spread-features.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/content/en/blog/_posts/2023-04-11-topology-spread-features.md b/content/en/blog/_posts/2023-04-11-topology-spread-features.md index bd908eece73d5..422e8dd286951 100644 --- a/content/en/blog/_posts/2023-04-11-topology-spread-features.md +++ b/content/en/blog/_posts/2023-04-11-topology-spread-features.md @@ -11,11 +11,11 @@ evergreen: true In Kubernetes v1.19, [Pod Topology Spread Constraints](https://kubernetes.io/docs/concepts/scheduling-eviction/topology-spread-constraints/) went to GA. It is the feature to control how Pods are spread to each failure-domain (regions, zones, nodes etc). -As time passes, we've got further feedbacks from users, -and we're actively working on improving the Topology Spread via three KEPs from v1.25. -All of these features have reached beta in Kubernetes v1.27 and been enabled by default. +As time passed, we received feedback from users, +and, as a result, we're actively working on improving the Topology Spread feature via three KEPs. +All of these features have reached beta in Kubernetes v1.27 and are enabled by default. -This blog post is going to introduce each feature and the usecase/issue behind them. +This blog post introduces each feature and the use case behind each of them. ## KEP-3022: min domains in Pod Topology Spread @@ -27,8 +27,8 @@ Some users want to force spreading Pods over a minimum number of domains, and if Then, we introduced the `minDomains` parameter in the Pod Topology Spread. Via `minDomains` parameter, you can define the minimum number of domains. -For example, there are 3 Nodes with the enough capacity, -and newly created replicaset has the following `topologySpreadConstraints` in template. +For example, assume there are 3 Nodes with the enough capacity, +and a newly created replicaset has the following `topologySpreadConstraints` in template. ```yaml topologySpreadConstraints: @@ -41,7 +41,7 @@ topologySpreadConstraints: foo: bar ``` -This case, 3 Pods will be scheduled to those 3 Nodes, +In this case, 3 Pods will be scheduled to those 3 Nodes, but other 2 Pods from this replicaset will be unschedulable until more Nodes join the cluster. The cluster autoscaler provisions new Nodes based on these unschedulable Pods, From 81482cfcef87e1c71a431ed140dd7006e3a63ea3 Mon Sep 17 00:00:00 2001 From: Alex Wang Date: Wed, 5 Apr 2023 22:59:17 +0800 Subject: [PATCH 08/13] blog: update content about matchLabelKeys Signed-off-by: Alex Wang --- .../2023-04-11-topology-spread-features.md | 27 +++++++++++-------- 1 file changed, 16 insertions(+), 11 deletions(-) diff --git a/content/en/blog/_posts/2023-04-11-topology-spread-features.md b/content/en/blog/_posts/2023-04-11-topology-spread-features.md index 422e8dd286951..328017e2ffc2e 100644 --- a/content/en/blog/_posts/2023-04-11-topology-spread-features.md +++ b/content/en/blog/_posts/2023-04-11-topology-spread-features.md @@ -93,18 +93,23 @@ this feature to beta and it was enabled by default since. ## KEP-3243: Respect PodTopologySpread after rolling upgrades -Pod Topology Spread uses the fields `topologyKey` or `labelSelector` to identify the group of pods over which -spreading will be calculated. But it applies to all pods in a Deployment irrespective of their owning -ReplicaSet. As a result, when a new revision is rolled out, spreading will apply across pods from both the -old and new ReplicaSets, and so by the time the new ReplicaSet is completely rolled out and the old one is -rolled back, the actual spreading we are left with may not match expectations because the deleted pods from -the older ReplicaSet will cause skewed distribution for the remaining pods. - -In order to solve this problem and to make more accurate decisions in scheduling, we added a new named +Pod Topology Spread uses the field `labelSelector` to identify the group of pods over which +spreading will be calculated. When using topology spreading with Deployments, it is common +practice to use the `labelSelector` of the Deployment as the `labelSelector` in the topology +spread constraints. However, this implies that all pods of a Deployment are part of the spreading +calculation, regardless of whether they belong to different revisions. As a result, when a new revision +is rolled out, spreading will apply across pods from both the old and new ReplicaSets, and so by the +time the new ReplicaSet is completely rolled out and the old one is rolled back, the actual spreading +we are left with may not match expectations because the deleted pods from the older ReplicaSet will cause +skewed distribution for the remaining pods. To avoid this problem, in the past users needed to add a +revision label to Deployment and update it manually at each rolling upgrade (both the label on the +podTemplate and the `labelSelector` in the `topologySpreadConstraints`). + +To solve this problem once and for all, and to make more accurate decisions in scheduling, we added a new named `matchLabelKeys` to `topologySpreadConstraints`. `matchLabelKeys` is a list of pod label keys to select -the pods over which spreading will be calculated. The keys are used to lookup values from the pod labels, -those key-value labels are ANDed with `labelSelector` to select the group of existing pods over -which spreading will be calculated for the incoming pod. +the pods over which spreading will be calculated. The keys are used to lookup values from the labels of +the Pod being scheduled, those key-value labels are ANDed with `labelSelector` to select the group of +existing pods over which spreading will be calculated for the incoming pod. With `matchLabelKeys`, you don't need to update the `pod.spec` between different revisions. The controller/operator just needs to set different values to the same label key for different revisions. From a1419760642e99897ffec04010de197327b4c758 Mon Sep 17 00:00:00 2001 From: Kante Yin Date: Thu, 6 Apr 2023 14:49:15 +0800 Subject: [PATCH 09/13] Address the comment about NodeInclusionPolicy Signed-off-by: Kante Yin --- .../2023-04-11-topology-spread-features.md | 41 ++++++++++--------- 1 file changed, 21 insertions(+), 20 deletions(-) diff --git a/content/en/blog/_posts/2023-04-11-topology-spread-features.md b/content/en/blog/_posts/2023-04-11-topology-spread-features.md index 328017e2ffc2e..44e27aaeb187f 100644 --- a/content/en/blog/_posts/2023-04-11-topology-spread-features.md +++ b/content/en/blog/_posts/2023-04-11-topology-spread-features.md @@ -19,15 +19,15 @@ This blog post introduces each feature and the use case behind each of them. ## KEP-3022: min domains in Pod Topology Spread -Pod Topology Spread has the `maxSkew` parameter to define the degree to which Pods may be unevenly distributed. +Pod Topology Spread has the `maxSkew` parameter to define the degree to which Pods may be unevenly distributed. -But, there wasn't a way to control the number of domains over which we should spread. +But, there wasn't a way to control the number of domains over which we should spread. Some users want to force spreading Pods over a minimum number of domains, and if there aren't enough already present, make the cluster-autoscaler provision them. -Then, we introduced the `minDomains` parameter in the Pod Topology Spread. -Via `minDomains` parameter, you can define the minimum number of domains. +Then, we introduced the `minDomains` parameter in the Pod Topology Spread. +Via `minDomains` parameter, you can define the minimum number of domains. -For example, assume there are 3 Nodes with the enough capacity, +For example, assume there are 3 Nodes with the enough capacity, and a newly created replicaset has the following `topologySpreadConstraints` in template. ```yaml @@ -35,7 +35,7 @@ topologySpreadConstraints: - maxSkew: 1 minDomains: 5 # requires 5 Nodes at least. whenUnsatisfiable: DoNotSchedule # minDomains is valid only when DoNotSchedule is used. - topologyKey: kubernetes.io/hostname + topologyKey: kubernetes.io/hostname labelSelector: matchLabels: foo: bar @@ -49,15 +49,16 @@ and as a result, the replicas are finally spread over 5 Nodes. ## KEP-3094: Take taints/tolerations into consideration when calculating PodTopologySpread skew -Before this, when we deploy a pod with `podTopologySpread` configured, we'll take all -affinity nodes(satisfied with pod nodeAffinity and nodeSelector) into consideration -in filtering and scoring, but a node with pod untolerated taint may also be a candidate -because we didn't take care of node taints, which will lead to the pod pending. +Before this enhancement, when you deploy a pod with `podTopologySpread` configured, kube-scheduler would +take all inclined nodes(satisfied with pod nodeAffinity and nodeSelector) into consideration +in filtering and scoring, but would not care about whether the node taints are tolerated by the incoming pod or not. +This may lead to a node with untolerated taint best fit the pod in podTopologySpread plugin, and as a result, +the pod will stuck in pending for it violates the nodeTaint plugin. -To avoid this and make a more fine-gained decision in scheduling, we introduced two new fields in -`TopologySpreadConstraint` to define node inclusion policies including nodeAffinity and nodeTaint. + To allow more fine-gained decisions about which Nodes to account for when calculating spreading skew, we introduced + two new fields in `TopologySpreadConstraint` to define node inclusion policies including nodeAffinity and nodeTaint. -It mostly looks like: +A manifest that applies these policies looks like the following: ```yaml apiVersion: v1 @@ -75,17 +76,17 @@ spec: ``` **nodeAffinityPolicy** indicates how we'll treat Pod's nodeAffinity/nodeSelector in pod topology spreading. -If `Honor`, we'll filter out nodes not matching nodeAffinity/nodeSelector in calculation. -If `Ignore`, these nodes will be included instead. +If `Honor`, kube-scheduler will filter out nodes not matching nodeAffinity/nodeSelector in the calculation of spreading skew. +If `Ignore`, all nodes will be included, regardless of whether they match the Pod's nodeAffinity/nodeSelector or not. -For backwards-compatibility, nodeAffinityPolicy is default to `Honor`. +For backwards-compatibility, nodeAffinityPolicy defaults to `Honor`. **nodeTaintsPolicy** indicates how we'll treat node taints in pod topology spreading. -If `Honor`, only tainted nodes for which the incoming pod has a toleration, will be included in calculation. -If `Ignore`, we'll not consider the node taints at all in calculation, so a node with pod untolerated taint -will also be included. +If `Honor`, only tainted nodes for which the incoming pod has a toleration, will be included in the calculation of spreading skew. +If `Ignore`, kube-scheduler will not consider the node taints at all in the calculation of spreading skew, so a node with +pod untolerated taint will also be included. -For backwards-compatibility, nodeTaintsPolicy is default to the `Ignore`. +For backwards-compatibility, nodeTaintsPolicy defaults to the `Ignore`. The feature was introduced in v1.25 as alpha level. By default, it was disabled, so if you want to use this feature in v1.25, you have to enable the feature gate `NodeInclusionPolicyInPodTopologySpread` actively. In the following v1.26, we graduated From ae626b96c61d9e397d98b140853e2fff149cbb0f Mon Sep 17 00:00:00 2001 From: Kensei Nakada Date: Fri, 7 Apr 2023 08:54:39 +0900 Subject: [PATCH 10/13] fix based on reviews --- .../_posts/2023-04-11-topology-spread-features.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/content/en/blog/_posts/2023-04-11-topology-spread-features.md b/content/en/blog/_posts/2023-04-11-topology-spread-features.md index 44e27aaeb187f..f8fa99cdc5cce 100644 --- a/content/en/blog/_posts/2023-04-11-topology-spread-features.md +++ b/content/en/blog/_posts/2023-04-11-topology-spread-features.md @@ -9,7 +9,7 @@ evergreen: true **Authors:** [Alex Wang](https://github.com/denkensk)(Shopee), [Kante Yin](https://github.com/kerthcet)(DaoCloud), [Kensei Nakada](https://github.com/sanposhiho)(Mercari) In Kubernetes v1.19, [Pod Topology Spread Constraints](https://kubernetes.io/docs/concepts/scheduling-eviction/topology-spread-constraints/) went to GA. -It is the feature to control how Pods are spread to each failure-domain (regions, zones, nodes etc). +It is the feature to control how Pods are spread in the cluster topology or failure domains (regions, zones, nodes etc). As time passed, we received feedback from users, and, as a result, we're actively working on improving the Topology Spread feature via three KEPs. @@ -50,10 +50,10 @@ and as a result, the replicas are finally spread over 5 Nodes. ## KEP-3094: Take taints/tolerations into consideration when calculating PodTopologySpread skew Before this enhancement, when you deploy a pod with `podTopologySpread` configured, kube-scheduler would -take all inclined nodes(satisfied with pod nodeAffinity and nodeSelector) into consideration +take the Nodes that satisfy the Pod's nodeAffinity and nodeSelector into consideration in filtering and scoring, but would not care about whether the node taints are tolerated by the incoming pod or not. -This may lead to a node with untolerated taint best fit the pod in podTopologySpread plugin, and as a result, -the pod will stuck in pending for it violates the nodeTaint plugin. +This may lead to a node with untolerated taint as the only candidate for spreading, and as a result, +the pod will stuck in Pending if it doesn't tolerate the taint. To allow more fine-gained decisions about which Nodes to account for when calculating spreading skew, we introduced two new fields in `TopologySpreadConstraint` to define node inclusion policies including nodeAffinity and nodeTaint. @@ -106,14 +106,14 @@ skewed distribution for the remaining pods. To avoid this problem, in the past u revision label to Deployment and update it manually at each rolling upgrade (both the label on the podTemplate and the `labelSelector` in the `topologySpreadConstraints`). -To solve this problem once and for all, and to make more accurate decisions in scheduling, we added a new named +To solve this problem with a simpler API, we added a new field named `matchLabelKeys` to `topologySpreadConstraints`. `matchLabelKeys` is a list of pod label keys to select the pods over which spreading will be calculated. The keys are used to lookup values from the labels of the Pod being scheduled, those key-value labels are ANDed with `labelSelector` to select the group of existing pods over which spreading will be calculated for the incoming pod. With `matchLabelKeys`, you don't need to update the `pod.spec` between different revisions. -The controller/operator just needs to set different values to the same label key for different revisions. +The controller or operator managing rollouts just needs to set different values to the same label key for different revisions. The scheduler will assume the values automatically based on `matchLabelKeys`. For example, if you are configuring a Deployment, you can use the label keyed with [pod-template-hash](https://kubernetes.io//docs/concepts/workloads/controllers/deployment/#pod-template-hash-label), From 2d14bb7c383b02cd924b9312eb9d9083a74d9af7 Mon Sep 17 00:00:00 2001 From: Kensei Nakada Date: Mon, 10 Apr 2023 10:31:38 +0900 Subject: [PATCH 11/13] fix link --- .../en/blog/_posts/2023-04-11-topology-spread-features.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/content/en/blog/_posts/2023-04-11-topology-spread-features.md b/content/en/blog/_posts/2023-04-11-topology-spread-features.md index f8fa99cdc5cce..2bd8e45f7dd62 100644 --- a/content/en/blog/_posts/2023-04-11-topology-spread-features.md +++ b/content/en/blog/_posts/2023-04-11-topology-spread-features.md @@ -8,7 +8,7 @@ evergreen: true **Authors:** [Alex Wang](https://github.com/denkensk)(Shopee), [Kante Yin](https://github.com/kerthcet)(DaoCloud), [Kensei Nakada](https://github.com/sanposhiho)(Mercari) -In Kubernetes v1.19, [Pod Topology Spread Constraints](https://kubernetes.io/docs/concepts/scheduling-eviction/topology-spread-constraints/) went to GA. +In Kubernetes v1.19, [Pod Topology Spread Constraints](/docs/concepts/scheduling-eviction/topology-spread-constraints/) went to GA. It is the feature to control how Pods are spread in the cluster topology or failure domains (regions, zones, nodes etc). As time passed, we received feedback from users, @@ -116,7 +116,7 @@ With `matchLabelKeys`, you don't need to update the `pod.spec` between different The controller or operator managing rollouts just needs to set different values to the same label key for different revisions. The scheduler will assume the values automatically based on `matchLabelKeys`. For example, if you are configuring a Deployment, you can use the label keyed with -[pod-template-hash](https://kubernetes.io//docs/concepts/workloads/controllers/deployment/#pod-template-hash-label), +[pod-template-hash](/docs/concepts/workloads/controllers/deployment/#pod-template-hash-label), which is added automatically by the Deployment controller, to distinguish between different revisions in a single Deployment. @@ -140,7 +140,7 @@ Please join us and share your feedback. We look forward to hearing from you! ## How can I learn more? -- [Pod Topology Spread Constraints | Kubernetes](https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/#container-resource-metrics) +- [Pod Topology Spread Constraints | Kubernetes](/docs/concepts/scheduling-eviction/topology-spread-constraints/) - [KEP-3022: min domains in Pod Topology Spread](https://github.com/kubernetes/enhancements/tree/master/keps/sig-scheduling/3022-min-domains-in-pod-topology-spread) - [KEP-3094: Take taints/tolerations into consideration when calculating PodTopologySpread skew](https://github.com/kubernetes/enhancements/tree/master/keps/sig-scheduling/3094-pod-topology-spread-considering-taints) - [KEP-3243: Respect PodTopologySpread after rolling upgrades](https://github.com/kubernetes/enhancements/tree/master/keps/sig-scheduling/3243-respect-pod-topology-spread-after-rolling-upgrades) \ No newline at end of file From f03ac425ebefcda184408057be07226fb326d0ac Mon Sep 17 00:00:00 2001 From: Kensei Nakada Date: Thu, 13 Apr 2023 12:23:02 +0900 Subject: [PATCH 12/13] fix based on the suggestion --- ...=> 2023-04-17-topology-spread-features.md} | 57 ++++++++++--------- 1 file changed, 31 insertions(+), 26 deletions(-) rename content/en/blog/_posts/{2023-04-11-topology-spread-features.md => 2023-04-17-topology-spread-features.md} (71%) diff --git a/content/en/blog/_posts/2023-04-11-topology-spread-features.md b/content/en/blog/_posts/2023-04-17-topology-spread-features.md similarity index 71% rename from content/en/blog/_posts/2023-04-11-topology-spread-features.md rename to content/en/blog/_posts/2023-04-17-topology-spread-features.md index 2bd8e45f7dd62..170496742c8f7 100644 --- a/content/en/blog/_posts/2023-04-11-topology-spread-features.md +++ b/content/en/blog/_posts/2023-04-17-topology-spread-features.md @@ -1,17 +1,16 @@ --- layout: blog title: "Kubernetes 1.27: More fine-grained pod topology spread policies reached beta" -date: 2023-04-11 +date: 2023-04-17 slug: fine-grained-pod-topology-spread-features-beta -evergreen: true --- **Authors:** [Alex Wang](https://github.com/denkensk)(Shopee), [Kante Yin](https://github.com/kerthcet)(DaoCloud), [Kensei Nakada](https://github.com/sanposhiho)(Mercari) -In Kubernetes v1.19, [Pod Topology Spread Constraints](/docs/concepts/scheduling-eviction/topology-spread-constraints/) went to GA. -It is the feature to control how Pods are spread in the cluster topology or failure domains (regions, zones, nodes etc). +In Kubernetes v1.19, [Pod topology spread constraints](/docs/concepts/scheduling-eviction/topology-spread-constraints/) +went to general availability (GA). -As time passed, we received feedback from users, +As time passed, we - SIG Scheduling - received feedback from users, and, as a result, we're actively working on improving the Topology Spread feature via three KEPs. All of these features have reached beta in Kubernetes v1.27 and are enabled by default. @@ -24,16 +23,18 @@ Pod Topology Spread has the `maxSkew` parameter to define the degree to which Po But, there wasn't a way to control the number of domains over which we should spread. Some users want to force spreading Pods over a minimum number of domains, and if there aren't enough already present, make the cluster-autoscaler provision them. -Then, we introduced the `minDomains` parameter in the Pod Topology Spread. +Kubernetes v1.24 introduced the `minDomains` parameter for pod topology spread constraints, +as an alpha feature. Via `minDomains` parameter, you can define the minimum number of domains. For example, assume there are 3 Nodes with the enough capacity, -and a newly created replicaset has the following `topologySpreadConstraints` in template. +and a newly created ReplicaSet has the following `topologySpreadConstraints` in its Pod template. ```yaml +... topologySpreadConstraints: - maxSkew: 1 - minDomains: 5 # requires 5 Nodes at least. + minDomains: 5 # requires 5 Nodes at least (because each Node has a unique hostname). whenUnsatisfiable: DoNotSchedule # minDomains is valid only when DoNotSchedule is used. topologyKey: kubernetes.io/hostname labelSelector: @@ -44,10 +45,10 @@ topologySpreadConstraints: In this case, 3 Pods will be scheduled to those 3 Nodes, but other 2 Pods from this replicaset will be unschedulable until more Nodes join the cluster. -The cluster autoscaler provisions new Nodes based on these unschedulable Pods, +You can imagine that the cluster autoscaler provisions new Nodes based on these unschedulable Pods, and as a result, the replicas are finally spread over 5 Nodes. -## KEP-3094: Take taints/tolerations into consideration when calculating PodTopologySpread skew +## KEP-3094: Take taints/tolerations into consideration when calculating podTopologySpread skew Before this enhancement, when you deploy a pod with `podTopologySpread` configured, kube-scheduler would take the Nodes that satisfy the Pod's nodeAffinity and nodeSelector into consideration @@ -55,8 +56,9 @@ in filtering and scoring, but would not care about whether the node taints are t This may lead to a node with untolerated taint as the only candidate for spreading, and as a result, the pod will stuck in Pending if it doesn't tolerate the taint. - To allow more fine-gained decisions about which Nodes to account for when calculating spreading skew, we introduced - two new fields in `TopologySpreadConstraint` to define node inclusion policies including nodeAffinity and nodeTaint. +To allow more fine-gained decisions about which Nodes to account for when calculating spreading skew, +Kubernetes 1.25 introduced two new fields within `topologySpreadConstraints` to define node inclusion policies: +`nodeAffinityPolicy` and `nodeTaintPolicy`. A manifest that applies these policies looks like the following: @@ -75,24 +77,27 @@ spec: # other Pod fields go here ``` -**nodeAffinityPolicy** indicates how we'll treat Pod's nodeAffinity/nodeSelector in pod topology spreading. -If `Honor`, kube-scheduler will filter out nodes not matching nodeAffinity/nodeSelector in the calculation of spreading skew. -If `Ignore`, all nodes will be included, regardless of whether they match the Pod's nodeAffinity/nodeSelector or not. +The `nodeAffinityPolicy` field indicates how Kubernetes treats a Pod's `nodeAffinity` or `nodeSelector` for +pod topology spreading. +If `Honor`, kube-scheduler filters out nodes not matching `nodeAffinity`/`nodeSelector` in the calculation of +spreading skew. +If `Ignore`, all nodes will be included, regardless of whether they match the Pod's `nodeAffinity`/`nodeSelector` +or not. -For backwards-compatibility, nodeAffinityPolicy defaults to `Honor`. +For backwards compatibility, `nodeAffinityPolicy` defaults to `Honor`. -**nodeTaintsPolicy** indicates how we'll treat node taints in pod topology spreading. +The `nodeTaintsPolicy` field defines how Kubernetes considers node taints for pod topology spreading. If `Honor`, only tainted nodes for which the incoming pod has a toleration, will be included in the calculation of spreading skew. If `Ignore`, kube-scheduler will not consider the node taints at all in the calculation of spreading skew, so a node with pod untolerated taint will also be included. -For backwards-compatibility, nodeTaintsPolicy defaults to the `Ignore`. +For backwards compatibility, `nodeTaintsPolicy` defaults to `Ignore`. -The feature was introduced in v1.25 as alpha level. By default, it was disabled, so if you want to use this feature in v1.25, -you have to enable the feature gate `NodeInclusionPolicyInPodTopologySpread` actively. In the following v1.26, we graduated -this feature to beta and it was enabled by default since. +The feature was introduced in v1.25 as alpha. By default, it was disabled, so if you want to use this feature in v1.25, +you had to explictly enable the feature gate `NodeInclusionPolicyInPodTopologySpread`. In the following v1.26 +release, that associated feature graduated to beta and is enabled by default. -## KEP-3243: Respect PodTopologySpread after rolling upgrades +## KEP-3243: Respect Pod topology spread after rolling upgrades Pod Topology Spread uses the field `labelSelector` to identify the group of pods over which spreading will be calculated. When using topology spreading with Deployments, it is common @@ -104,9 +109,9 @@ time the new ReplicaSet is completely rolled out and the old one is rolled back, we are left with may not match expectations because the deleted pods from the older ReplicaSet will cause skewed distribution for the remaining pods. To avoid this problem, in the past users needed to add a revision label to Deployment and update it manually at each rolling upgrade (both the label on the -podTemplate and the `labelSelector` in the `topologySpreadConstraints`). +pod template and the `labelSelector` in the `topologySpreadConstraints`). -To solve this problem with a simpler API, we added a new field named +To solve this problem with a simpler API, Kubernetes v1.25 introduced a new field named `matchLabelKeys` to `topologySpreadConstraints`. `matchLabelKeys` is a list of pod label keys to select the pods over which spreading will be calculated. The keys are used to lookup values from the labels of the Pod being scheduled, those key-value labels are ANDed with `labelSelector` to select the group of @@ -134,13 +139,13 @@ topologySpreadConstraints: ## Getting involved -These features are managed by the [SIG/Scheduling](https://github.com/kubernetes/community/tree/master/sig-scheduling). +These features are managed by Kubernetes [SIG Scheduling](https://github.com/kubernetes/community/tree/master/sig-scheduling). Please join us and share your feedback. We look forward to hearing from you! ## How can I learn more? -- [Pod Topology Spread Constraints | Kubernetes](/docs/concepts/scheduling-eviction/topology-spread-constraints/) +- [Pod Topology Spread Constraints](/docs/concepts/scheduling-eviction/topology-spread-constraints/) in the Kubernetes documentation - [KEP-3022: min domains in Pod Topology Spread](https://github.com/kubernetes/enhancements/tree/master/keps/sig-scheduling/3022-min-domains-in-pod-topology-spread) - [KEP-3094: Take taints/tolerations into consideration when calculating PodTopologySpread skew](https://github.com/kubernetes/enhancements/tree/master/keps/sig-scheduling/3094-pod-topology-spread-considering-taints) - [KEP-3243: Respect PodTopologySpread after rolling upgrades](https://github.com/kubernetes/enhancements/tree/master/keps/sig-scheduling/3243-respect-pod-topology-spread-after-rolling-upgrades) \ No newline at end of file From 184185755a90f027ae11537f5c5930d809fc2621 Mon Sep 17 00:00:00 2001 From: Tim Bannister Date: Thu, 13 Apr 2023 20:42:37 +0100 Subject: [PATCH 13/13] Fix typography --- content/en/blog/_posts/2023-04-17-topology-spread-features.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/en/blog/_posts/2023-04-17-topology-spread-features.md b/content/en/blog/_posts/2023-04-17-topology-spread-features.md index 170496742c8f7..9edaada138de0 100644 --- a/content/en/blog/_posts/2023-04-17-topology-spread-features.md +++ b/content/en/blog/_posts/2023-04-17-topology-spread-features.md @@ -5,7 +5,7 @@ date: 2023-04-17 slug: fine-grained-pod-topology-spread-features-beta --- -**Authors:** [Alex Wang](https://github.com/denkensk)(Shopee), [Kante Yin](https://github.com/kerthcet)(DaoCloud), [Kensei Nakada](https://github.com/sanposhiho)(Mercari) +**Authors:** [Alex Wang](https://github.com/denkensk) (Shopee), [Kante Yin](https://github.com/kerthcet) (DaoCloud), [Kensei Nakada](https://github.com/sanposhiho) (Mercari) In Kubernetes v1.19, [Pod topology spread constraints](/docs/concepts/scheduling-eviction/topology-spread-constraints/) went to general availability (GA).