From 268411f6ab7d5ef01bb60ff922ff36894e8f0c31 Mon Sep 17 00:00:00 2001 From: Marek Grabowski Date: Fri, 8 Sep 2017 12:05:02 +0100 Subject: [PATCH 1/7] Add documentation for TaintNodesByCondition --- docs/concepts/architecture/nodes.md | 20 ++++++++++++++++++- .../configuration/taint-and-toleration.md | 20 ++++++++++++++++--- .../workloads/controllers/daemonset.md | 16 ++++++++------- 3 files changed, 45 insertions(+), 11 deletions(-) diff --git a/docs/concepts/architecture/nodes.md b/docs/concepts/architecture/nodes.md index 1d361db045e69..11788d5792cc5 100644 --- a/docs/concepts/architecture/nodes.md +++ b/docs/concepts/architecture/nodes.md @@ -65,7 +65,22 @@ The node condition is represented as a JSON object. For example, the following r If the Status of the Ready condition is "Unknown" or "False" for longer than the `pod-eviction-timeout`, an argument is passed to the [kube-controller-manager](/docs/admin/kube-controller-manager) and all of the Pods on the node are scheduled for deletion by the Node Controller. The default eviction timeout duration is **five minutes**. In some cases when the node is unreachable, the apiserver is unable to communicate with the kubelet on it. The decision to delete the pods cannot be communicated to the kubelet until it re-establishes communication with the apiserver. In the meantime, the pods which are scheduled for deletion may continue to run on the partitioned node. -In versions of Kubernetes prior to 1.5, the node controller would [force delete](/docs/concepts/workloads/pods/pod/#force-deletion-of-pods) these unreachable pods from the apiserver. However, in 1.5 and higher, the node controller does not force delete pods until it is confirmed that they have stopped running in the cluster. One can see these pods which may be running on an unreachable node as being in the "Terminating" or "Unknown" states. In cases where Kubernetes cannot deduce from the underlying infrastructure if a node has permanently left a cluster, the cluster administrator may need to delete the node object by hand. Deleting the node object from Kubernetes causes all the Pod objects running on it to be deleted from the apiserver, freeing up their names. +In versions of Kubernetes prior to 1.5, the node controller would [force delete](/docs/concepts/workloads/pods/pod/#force-deletion-of-pods) +these unreachable pods from the apiserver. However, in 1.5 and higher, the node controller does not force delete pods until it is +confirmed that they have stopped running in the cluster. One can see these pods which may be running on an unreachable node as being in +the "Terminating" or "Unknown" states. In cases where Kubernetes cannot deduce from the underlying infrastructure if a node has +permanently left a cluster, the cluster administrator may need to delete the node object by hand. Deleting the node object from +Kubernetes causes all the Pod objects running on it to be deleted from the apiserver, freeing up their names. + +In version 1.8 a possibility to automatically create [taints](/docs/concepts/configuration/taint-and-toleration) representing Conditions +was added as an alpha feature. Enabling it makes scheduler ignore Conditions when considering a Node, instead it looks at the taints and +Pod's tolerations. This allows users to decide whether they want to keep old behavior and don't schedule their Pods on Nodes with some +Conditions, or rather corresponding taints, or if they want to add a toleration and allow it. To enable this behavior you need to pass +an additional feature gate flag `--feature-gates=...,TaintNodesByCondition=true` to apiserver, controller-manager and scheduler. + +Note that because of small delay +(usually <1s) between time when Condition is observed and Taint is created it's possible that enabling this feature will slightly +increase number of Pods that are successfully scheduled but rejected by the Kubelet. ### Capacity @@ -174,6 +189,9 @@ NodeController is responsible for adding taints corresponding to node problems l node unreachable or not ready. See [this documentation](/docs/concepts/configuration/taint-and-toleration) for details about `NoExecute` taints and the alpha feature. +Since Kubernetes 1.8 NodeController may be made responsible for creating taints represeting +Node Conditions. This is an alpha feature as of 1.8. + ### Self-Registration of Nodes When the kubelet flag `--register-node` is true (the default), the kubelet will attempt to diff --git a/docs/concepts/configuration/taint-and-toleration.md b/docs/concepts/configuration/taint-and-toleration.md index f16f69b88fca9..83d73cb9aa66a 100644 --- a/docs/concepts/configuration/taint-and-toleration.md +++ b/docs/concepts/configuration/taint-and-toleration.md @@ -249,9 +249,23 @@ admission controller](https://git.k8s.io/kubernetes/plugin/pkg/admission/default * `node.alpha.kubernetes.io/unreachable` * `node.alpha.kubernetes.io/notReady` - * `node.kubernetes.io/memoryPressure` - * `node.kubernetes.io/diskPressure` - * `node.kubernetes.io/outOfDisk` (*only for critical pods*) This ensures that DaemonSet pods are never evicted due to these problems, which matches the behavior when this feature is disabled. + +## Taint Nodes by Condition + +In Kubernetes 1.8 we added an alpha feature that makes NodeController create taints corresponding to node conditions, and disables the +condition check in the scheduler (instead the scheduler checks the taints). This assures that Conditions don't affect what's scheduled +onto the node and the user can choose to ignore some of the node's problems (represented as Conditions) by adding appropriate pod +tolerations. + +To make sure that turning on this feature doesn't break Daemon sets from 1.8 DaemonSet controller will automatically add following +`NoSchedule` tolerations to all deamons: + + * `node.kubernetes.io/memory-pressure` + * `node.kubernetes.io/disk-pressure` + * `node.kubernetes.io/out-of-disk` (*only for critical pods*) + +Above settings are ones that keep backward compatibility, but we understand they may not fit all user's use cases, which is why cluster +admin may choose to add arbitrary tolerations to DaemonSets. diff --git a/docs/concepts/workloads/controllers/daemonset.md b/docs/concepts/workloads/controllers/daemonset.md index 26bc660eefa6f..a74c4ae1a1c1f 100644 --- a/docs/concepts/workloads/controllers/daemonset.md +++ b/docs/concepts/workloads/controllers/daemonset.md @@ -103,19 +103,21 @@ but they are created with `NoExecute` tolerations for the following taints with - `node.alpha.kubernetes.io/notReady` - `node.alpha.kubernetes.io/unreachable` - - `node.alpha.kubernetes.io/memoryPressure` - - `node.alpha.kubernetes.io/diskPressure` - -When the support to critical pods is enabled and the pods in a DaemonSet are -labelled as critical, the Daemon pods are created with an additional -`NoExecute` toleration for the `node.alpha.kubernetes.io/outOfDisk` taint with -no `tolerationSeconds`. This ensures that when the `TaintBasedEvictions` alpha feature is enabled, they will not be evicted when there are node problems such as a network partition. (When the `TaintBasedEvictions` feature is not enabled, they are also not evicted in these scenarios, but due to hard-coded behavior of the NodeController rather than due to tolerations). + They also tolerate following `NoSchedule` taints: + - `node.kubernetes.io/memory-pressure` + - `node.kubernetes.io/disk-pressure` + +When the support to critical pods is enabled and the pods in a DaemonSet are +labelled as critical, the Daemon pods are created with an additional +`NoSchedule` toleration for the `node.kubernetes.io/out-of-disk` taint. + +Note that all above `NoSchedule` taints above are created only in version 1.8 or leater if alpha feature `TaintNodesByCondition` is enabled. ## Communicating with Daemon Pods From 7e150d6ed645b240e90767011b2b0d316b443ade Mon Sep 17 00:00:00 2001 From: Steve Perry Date: Mon, 11 Sep 2017 13:35:34 -0700 Subject: [PATCH 2/7] Update nodes.md --- docs/concepts/architecture/nodes.md | 25 +++++++++++++++---------- 1 file changed, 15 insertions(+), 10 deletions(-) diff --git a/docs/concepts/architecture/nodes.md b/docs/concepts/architecture/nodes.md index 11788d5792cc5..b6c0d15b56aa8 100644 --- a/docs/concepts/architecture/nodes.md +++ b/docs/concepts/architecture/nodes.md @@ -72,15 +72,20 @@ the "Terminating" or "Unknown" states. In cases where Kubernetes cannot deduce f permanently left a cluster, the cluster administrator may need to delete the node object by hand. Deleting the node object from Kubernetes causes all the Pod objects running on it to be deleted from the apiserver, freeing up their names. -In version 1.8 a possibility to automatically create [taints](/docs/concepts/configuration/taint-and-toleration) representing Conditions -was added as an alpha feature. Enabling it makes scheduler ignore Conditions when considering a Node, instead it looks at the taints and -Pod's tolerations. This allows users to decide whether they want to keep old behavior and don't schedule their Pods on Nodes with some -Conditions, or rather corresponding taints, or if they want to add a toleration and allow it. To enable this behavior you need to pass -an additional feature gate flag `--feature-gates=...,TaintNodesByCondition=true` to apiserver, controller-manager and scheduler. +Version 1.8 introduces an aplpha feature that automatically creates +[taints](/docs/concepts/configuration/taint-and-toleration) that represent conditions. +To enable this behavior, pass an additional feature gate flag `--feature-gates=...,TaintNodesByCondition=true` +the API server, controller manager, and scheduler. +When `TaintNodesByCondition` is enabled, the scheduler ignores conditions when considering a Node; instead +it looks at the Node's taints and a Pod's tolerations. -Note that because of small delay -(usually <1s) between time when Condition is observed and Taint is created it's possible that enabling this feature will slightly -increase number of Pods that are successfully scheduled but rejected by the Kubelet. +Now users can choose between the old scheduling model and a new, more flexible scheduling model. +A Pod that does not have any tolerations gets scheduled according to the old model. But a Pod that +tolerates the taints of a particular Node can be scheduled on that Node. + +Note that because of small delay, usually less than one second, between time when condition is observed and a taint +is created, it's possible that enabling this feature will slightly increase number of Pods that are successfully +scheduled but rejected by the kubelet. ### Capacity @@ -189,8 +194,8 @@ NodeController is responsible for adding taints corresponding to node problems l node unreachable or not ready. See [this documentation](/docs/concepts/configuration/taint-and-toleration) for details about `NoExecute` taints and the alpha feature. -Since Kubernetes 1.8 NodeController may be made responsible for creating taints represeting -Node Conditions. This is an alpha feature as of 1.8. +Starting in version 1.8, the node controller can be made responsible for creating taints that represent +Node conditions. This is an alpha feature of version 1.8. ### Self-Registration of Nodes From 9e97beabc2698a5c75943b20895b95c9ef5a7377 Mon Sep 17 00:00:00 2001 From: Steve Perry Date: Mon, 11 Sep 2017 13:43:35 -0700 Subject: [PATCH 3/7] Update taint-and-toleration.md --- docs/concepts/configuration/taint-and-toleration.md | 13 +++++-------- 1 file changed, 5 insertions(+), 8 deletions(-) diff --git a/docs/concepts/configuration/taint-and-toleration.md b/docs/concepts/configuration/taint-and-toleration.md index 83d73cb9aa66a..d20db73621a52 100644 --- a/docs/concepts/configuration/taint-and-toleration.md +++ b/docs/concepts/configuration/taint-and-toleration.md @@ -255,17 +255,14 @@ which matches the behavior when this feature is disabled. ## Taint Nodes by Condition -In Kubernetes 1.8 we added an alpha feature that makes NodeController create taints corresponding to node conditions, and disables the -condition check in the scheduler (instead the scheduler checks the taints). This assures that Conditions don't affect what's scheduled -onto the node and the user can choose to ignore some of the node's problems (represented as Conditions) by adding appropriate pod -tolerations. +Version 1.8 introduces an alpha feature that causes the node controller to create taints corresponding to +Node conditions. When this feature is enabled, the scheduler does not check conditions; instead the scheduler checks taints. This assures that conditions don't affect what's scheduled onto the Node. The user can choose to ignore some of the Node's problems (represented as conditions) by adding appropriate Pod tolerations. -To make sure that turning on this feature doesn't break Daemon sets from 1.8 DaemonSet controller will automatically add following -`NoSchedule` tolerations to all deamons: +To make sure that turning on this feature doesn't break DaemonSets, starting in version 1.8, the DaemonSet controller automatically adds the following `NoSchedule` tolerations to all deamons: * `node.kubernetes.io/memory-pressure` * `node.kubernetes.io/disk-pressure` * `node.kubernetes.io/out-of-disk` (*only for critical pods*) -Above settings are ones that keep backward compatibility, but we understand they may not fit all user's use cases, which is why cluster -admin may choose to add arbitrary tolerations to DaemonSets. +The above settings ensure backward compatibility, but we understand they may not fit all user's needs, which is why +cluster admin may choose to add arbitrary tolerations to DaemonSets. From eeafda33a3ea3ec800dcc3b2519e9b349e481ad0 Mon Sep 17 00:00:00 2001 From: Steve Perry Date: Mon, 11 Sep 2017 13:48:57 -0700 Subject: [PATCH 4/7] Update daemonset.md --- docs/concepts/workloads/controllers/daemonset.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/concepts/workloads/controllers/daemonset.md b/docs/concepts/workloads/controllers/daemonset.md index a74c4ae1a1c1f..35f0146223bd0 100644 --- a/docs/concepts/workloads/controllers/daemonset.md +++ b/docs/concepts/workloads/controllers/daemonset.md @@ -117,7 +117,7 @@ When the support to critical pods is enabled and the pods in a DaemonSet are labelled as critical, the Daemon pods are created with an additional `NoSchedule` toleration for the `node.kubernetes.io/out-of-disk` taint. -Note that all above `NoSchedule` taints above are created only in version 1.8 or leater if alpha feature `TaintNodesByCondition` is enabled. +Note that all above `NoSchedule` taints above are created only in version 1.8 or later if the alpha feature `TaintNodesByCondition` is enabled. ## Communicating with Daemon Pods From 670944ed1de777283c82be0030594f8ba3d6568e Mon Sep 17 00:00:00 2001 From: Steve Perry Date: Tue, 12 Sep 2017 09:40:05 -0700 Subject: [PATCH 5/7] Update nodes.md --- docs/concepts/architecture/nodes.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/concepts/architecture/nodes.md b/docs/concepts/architecture/nodes.md index b6c0d15b56aa8..328a71ef56699 100644 --- a/docs/concepts/architecture/nodes.md +++ b/docs/concepts/architecture/nodes.md @@ -72,10 +72,10 @@ the "Terminating" or "Unknown" states. In cases where Kubernetes cannot deduce f permanently left a cluster, the cluster administrator may need to delete the node object by hand. Deleting the node object from Kubernetes causes all the Pod objects running on it to be deleted from the apiserver, freeing up their names. -Version 1.8 introduces an aplpha feature that automatically creates +Version 1.8 introduces an alpha feature that automatically creates [taints](/docs/concepts/configuration/taint-and-toleration) that represent conditions. To enable this behavior, pass an additional feature gate flag `--feature-gates=...,TaintNodesByCondition=true` -the API server, controller manager, and scheduler. +to the API server, controller manager, and scheduler. When `TaintNodesByCondition` is enabled, the scheduler ignores conditions when considering a Node; instead it looks at the Node's taints and a Pod's tolerations. From 1a28e2023408fa29076d11b452bea7a9037a1517 Mon Sep 17 00:00:00 2001 From: Steve Perry Date: Tue, 12 Sep 2017 09:41:54 -0700 Subject: [PATCH 6/7] Update taint-and-toleration.md --- docs/concepts/configuration/taint-and-toleration.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/concepts/configuration/taint-and-toleration.md b/docs/concepts/configuration/taint-and-toleration.md index d20db73621a52..7399f86159179 100644 --- a/docs/concepts/configuration/taint-and-toleration.md +++ b/docs/concepts/configuration/taint-and-toleration.md @@ -188,7 +188,7 @@ running on the node as follows The above behavior is a beta feature. In addition, Kubernetes 1.6 has alpha support for representing node problems. In other words, the node controller -automatically taints a node when certain condition is true. The builtin taints +automatically taints a node when certain condition is true. The built-in taints currently include: * `node.alpha.kubernetes.io/notReady`: Node is not ready. This corresponds to @@ -258,7 +258,7 @@ which matches the behavior when this feature is disabled. Version 1.8 introduces an alpha feature that causes the node controller to create taints corresponding to Node conditions. When this feature is enabled, the scheduler does not check conditions; instead the scheduler checks taints. This assures that conditions don't affect what's scheduled onto the Node. The user can choose to ignore some of the Node's problems (represented as conditions) by adding appropriate Pod tolerations. -To make sure that turning on this feature doesn't break DaemonSets, starting in version 1.8, the DaemonSet controller automatically adds the following `NoSchedule` tolerations to all deamons: +To make sure that turning on this feature doesn't break DaemonSets, starting in version 1.8, the DaemonSet controller automatically adds the following `NoSchedule` tolerations to all daemons: * `node.kubernetes.io/memory-pressure` * `node.kubernetes.io/disk-pressure` From ac03dd9b72443f119437a2a33dfbcd3b2b64855c Mon Sep 17 00:00:00 2001 From: Steve Perry Date: Tue, 12 Sep 2017 09:42:21 -0700 Subject: [PATCH 7/7] Update daemonset.md --- docs/concepts/workloads/controllers/daemonset.md | 1 + 1 file changed, 1 insertion(+) diff --git a/docs/concepts/workloads/controllers/daemonset.md b/docs/concepts/workloads/controllers/daemonset.md index 35f0146223bd0..6cf00bfaffb10 100644 --- a/docs/concepts/workloads/controllers/daemonset.md +++ b/docs/concepts/workloads/controllers/daemonset.md @@ -110,6 +110,7 @@ they will not be evicted when there are node problems such as a network partitio due to hard-coded behavior of the NodeController rather than due to tolerations). They also tolerate following `NoSchedule` taints: + - `node.kubernetes.io/memory-pressure` - `node.kubernetes.io/disk-pressure`