diff --git a/content/en/docs/concepts/configuration/scheduler-perf-tuning.md b/content/en/docs/concepts/configuration/scheduler-perf-tuning.md new file mode 100644 index 0000000000000..628cbb40dc8fe --- /dev/null +++ b/content/en/docs/concepts/configuration/scheduler-perf-tuning.md @@ -0,0 +1,110 @@ +--- +reviewers: +- bsalamat +title: Scheduler Performance Tuning +content_template: templates/concept +weight: 70 +--- + +{{% capture overview %}} + +{{< feature-state for_k8s_version="1.12" >}} + +Kube-scheduler is the Kubernetes default scheduler. It is responsible for +placement of Pods on proper Nodes of a cluster. Nodes of a cluster that meet the +scheduling requirements of a Pod are called feasible Nodes for the Pod. The +scheduler find feasible Nodes for a Pod and then runs a set of functions to +score the feasible Nodes and picks a Node with the highest score among the +feasible ones to run the Pod. It then notifies the API server about this +decision in a process called "Binding". + +{{% /capture %}} + +{{% capture body %}} + +## Percentage of Nodes To Score + +Before Kubernetes 1.12, Kube-scheduler used to check the feasibility of all the +nodes in a cluster and then scored the feasible ones. Kubernetes 1.12 has a new +feature that allows the scheduler to stop looking for more feasible nodes once +it finds a certain number of them. This improves the scheduler's performance in +large clusters. The number is specified as a percentage of the cluster size and +is controlled by a configuration option called `percentageOfNodesToScore`. The +range should be between 1 and 100. Other values are considered as 100%. The +default value of this option is 50%. You can change this value by providing a +different value in the scheduler configuration, but read further to find whether +you should change the value. + +```yaml +apiVersion: componentconfig/v1alpha1 +kind: KubeSchedulerConfiguration +algorithmSource: + provider: DefaultProvider + +... + +percentageOfNodesToScore: 50 +``` + +{{< note >}} **Note**: In clusters with zero or few feasible nodes, the +scheduler still checks all the nodes, simply because there are not enough +feasible nodes to stop the scheduler's search early. {{< /note >}} + +**To disable this feature**, you can set `percentageOfNodesToScore` to 100. + +### Tuning percentageOfNodesToScore + +As stated above, `percentageOfNodesToScore` must be a value between 1 and 100 +with the default value of 50. There is also a hardcoded minimum value of 50 +nodes which is applied internally. It means that the scheduler tries to find at +least 50 nodes regardless of the value of `percentageOfNodesToScore`. This means +that changing this option to lower values in clusters with several hundred nodes +will not have much impact on the number of feasible nodes that the scheduler +tries to find. This is intentional as this option is unlikely to improve +performance noticeably in smaller clusters. In large clusters with over a 1000 +nodes setting this value to lower numbers may show a noticeable performance +improvement. + +An important note to consider when setting this value is that when a smaller +number of nodes in a cluster are checked for feasibility, some nodes are not +sent to be scored for a given Pod. As a result, a Node which could possibly +score a higher value for running the given Pod might not even be passed to the +scoring phase. This would result in a less than ideal placement of the Pod. For +this reason, the value should not be set to very low percentages. A general rule +of thumb is to never set the value to anything lower than 30. Lower values +should be used only when the scheduler's throughput is critical for your +application and the score of nodes is not important. In other words, you prefer +to run the Pod on any Node as long as it is feasible. + +We do not recommend lowering this value from its default if your cluster has +only several hundred Nodes. It is unlikely to improve the scheduler's +performance significantly. + +### How scheduler iterates over Nodes + +This section is intended for those who want to understand the internal details +of this feature. + +In order to give all the Nodes in a cluster a fair chance of being considered +for running Pods, the scheduler iterates over the nodes in a round robin +fashion. You can imagine that Nodes are in an array. The scheduler starts from +the start of the array and checks feasibility of the nodes until it finds enough +Nodes as specified by `percentageOfNodesToScore`. For the next Pod, the +scheduler continues from the point in the array that it stopped when checking +feasibility of the previous Pod. + +If Nodes are in multiple zones, the scheduler iterates over zones first and then +in each zone goes over Nodes. As an example, consider six nodes in two zones: + +``` +Zone 1: Node 1, Node 2, Node 3, Node 4 +Zone 2: Node 5, Node 6 +``` + +The order at which the Nodes are evaluated is: + +Node 1, Node 5, Node 2, Node 6, Node 3, Node 4 + +After going over all the Nodes, it goes back to Node 1. + +{{% /capture %}} diff --git a/data/concepts.yml b/data/concepts.yml index 4519addde536b..d8af248394689 100644 --- a/data/concepts.yml +++ b/data/concepts.yml @@ -85,6 +85,7 @@ toc: - docs/concepts/configuration/secret.md - docs/concepts/configuration/organize-cluster-access-kubeconfig.md - docs/concepts/configuration/pod-priority-preemption.md + - docs/concepts/configuration/scheduler-perf-tuning.md - title: Services, Load Balancing, and Networking landing_page: /docs/concepts/services-networking/service/