diff --git a/_build_cfg.yml b/_build_cfg.yml index c07c2cc12653..b0b6c2a59f71 100644 --- a/_build_cfg.yml +++ b/_build_cfg.yml @@ -127,6 +127,8 @@ Topics: File: admin_install_openshift - Name: Managing Nodes File: manage_nodes + - Name: Scheduler + File: scheduler --- Name: CLI Reference @@ -159,8 +161,6 @@ Topics: File: builds - Name: Deployments File: deployments - - Name: Scheduler - File: scheduler - Name: Integrating External Services File: integrating_external_services - Name: Authorization diff --git a/admin_guide/scheduler.adoc b/admin_guide/scheduler.adoc new file mode 100644 index 000000000000..2ea854660e30 --- /dev/null +++ b/admin_guide/scheduler.adoc @@ -0,0 +1,234 @@ += Scheduler +{product-author} +{product-version} +:data-uri: +:icons: +:experimental: +:toc: macro +:toc-title: + +toc::[] + +== Overview +The Kubernetes pod scheduler is responsible for scheduling new pods onto nodes within the cluster. It reads data from the pod and tries to find a node that is a good fit based on configured policies. It is completely independent and exists as a standalone/pluggable solution. It does not modify the pod and just creates a binding for the pod that ties the pod to the particular node. + +== Generic Scheduler +The existing generic scheduler is the default platform-provided scheduler "engine" that selects a node to host the pod in a 3-step operation: + +=== Step 1: Filter the nodes +The available nodes are filtered based on the constraints or requirements specified. This is done by running each of the nodes through the list of filter functions called 'predicates'. + +=== Step 2: Prioritize the filtered list of nodes +This is achieved by passing each node through a series of 'priority' functions that assign it a score between 0 - 10, with 0 indicating a bad fit and 10 indicating a good fit to host the pod. The scheduler configuration can also take in a simple "weight" (positive numeric value) for each priority function. The node score provided by each priority function is multiplied by the "weight" (default weight is 1) and then combined by just adding the scores for each node provided by all the priority functions. This weight attribute can be used by administrators to give higher importance to some priority functions. + +=== Step 3: Select the best fit node +The nodes are sorted based on their scores and the node with the highest score is selected to host the pod. If multiple nodes have the same high score, then one of them is selected at random. + + +== Available Predicates +There are several predicates provided out of the box in Kubernetes. Some of these predicates can be customized by providing certain parameters. Multiple predicates can be combined to provide additional filtering of nodes. + +=== Static Predicates +These predicates do not take any configuration parameters or inputs from the user. These are specified in the scheduler configuration using their exact name. + +**_PodFitsPorts_** deems a node to be fit for hosting a pod based on the absence of port conflicts. +---- +{"name" : "PodFitsPorts"} +---- +**_PodFitsResources_** determines a fit based on resource availability. The nodes can declare their resource capacities and then pods can specify what resources they require. Fit is based on requested, rather than used resources. +---- +{"name" : "PodFitsResources"} +---- +**_NoDiskConflict_** determines fit based on non-conflicting disk volumes. It evaluates if a pod can fit due to the volumes it requests, and those that are already mounted. +---- +{"name" : "NoDiskConflict"} +---- +**_MatchNodeSelector_** determines fit based on node selector query that is defined in the pod. +---- +{"name" : "MatchNodeSelector"} +---- +**_HostName_** determines fit based on the presence of the Host parameter and a string match with the name of the host. +---- +{"name" : "HostName"} +---- + +=== Configurable Predicates +These predicates can be configured by the user to tweak their functioning. They can be given any user-defined name. The type of the predicate is identified by the argument that they take. Since these are configurable, multiple predicates of the same type (but different configuration parameters) can be combined as long as their user-defined names are different. + +**_ServiceAffinity_** filters out nodes that do not belong to the specified topological level defined by the provided labels. This predicate takes in a list of labels and ensures affinity within the nodes (that have the same label values) for pods belonging to the same service. If the pod specifies a value for the labels in its NodeSelector, then the nodes matching those labels are the ones where the pod is scheduled. If the pod does not specify the labels in its NodeSelector, then the first pod can be placed on any node based on availability and all subsequent pods of the service will be scheduled on nodes that have the same label values. +---- +{"name" : "Zone", "argument" : {"serviceAffinity" : {"labels" : ["zone"]}}} +---- +**_LabelsPresence_** checks whether a particular node has a certain label defined or not, regardless of its value. +---- +{"name" : "ZoneRequired", "argument" : {"serviceAffinity" : {"labels" : ["zone"]}}} +---- + +== Available Priority Functions +A custom set of priority functionis can be specified to configure the scheduler. There are several priority functions provided out-of-the-box in Kubernetes. Some of these priority functions can be customized by providing certain parameters. Multiple priority functions can be combined and different weights can be given to each in order to impact the prioritization. A weight is required to be specified and cannot be 0 or negative. + +=== Static Priority Functions +These priority functions do not take any configuration parameters or inputs from the user. These are specified in the scheduler configuration using their exact name as well as the weight. + +**_LeastRequestedPriority_** favors nodes with fewer requested resources. It calculates the percentage of memory and CPU requested by pods scheduled on the node, and prioritizes nodes that have the highest available/remaining capacity. +---- +{"name" : "LeastRequestedPriority", "weight" : 1} +---- +**_ServiceSpreadingPriority_** spreads pods by minimizing the number of pods belonging to the same service onto the same machine +---- +{"name" : "ServiceSpreadingPriority", "weight" : 1} +---- +**_EqualPriority_** gives an equal weight of one to all nodes and is not required/recommended outside of testing. +---- +{"name" : "EqualPriority", "weight" : 1} +---- + +=== Configurable Priority Functions +These priority functions can be configured by the user by providing certain parameters. They can be given any user-defined name. The type of the priority function is identified by the argument that they take. Since these are configurable, multiple priority functions of the same type (but different configuration parameters) can be combined as long as their user-defined names are different. + +**_ServiceAntiAffinity_** takes a label and ensures a good spread of the pods belonging to the same service across the group of nodes based on the label values. It gives the same score to all nodes that have the same value for the specified label. It gives a highter score to nodes within a group with the least concentration of pods. +---- +{"name" : "RackSpread", "weight" : 1, "argument" : {"serviceAntiAffinity" : {"label" : "rack"}}} +---- +**_LabelsPreference_** prefers nodes that have a particular label defined or not, regardless of its value. +---- +{"name" : "RackPreferred", "weight" : 1, "argument" : {"labelPreference" : {"label" : "rack"}}} +---- + + +== Scheduler Policy +The selection of the predicate and priority functions defines the policy for the scheduler. Administrators can provide a JSON file that specifies the predicates and priority functions to configure the scheduler. The path to the scheduler policy file can be specified in the master configuration file. In the absence of the scheduler policy file, the default configuration gets applied. + +It is important to note that the predicates and priority functions defined in the scheduler configuration file will completely override the default scheduler policy. If any of the default predicates and priiority functions are required, they have to be explicitly specified in the scheduler configuration file. + + +=== Default Scheduler Policy +The default scheduler policy includes the following predicates. +1. PodFitsPorts +1. PodFitsResources +1. NoDiskConflict +1. MatchNodeSelector +1. HostName + +The default scheduler policy includes the following priority functions. Each of the priority function has a weight of '1' applied to it. +1. LeastRequestedPriority +1. BalancedResourceAllocation +1. ServiceSpreadingPriority + + +== Use Cases +One of the important use cases for scheduling within OpenShift is to support flexible affinity and anti-affinity policies. + +=== Infrafructure Topological Levels +Administrators can define multiple topological levels for their infrastructure (nodes). This is done by specifying labels on nodes (eg: region = r1, zone = z1, rack = s1). These label names have no particular meaning and administrators are free to name their infrastructure levels anything (eg, city/building/room). Also, administrators can define any number of levels for their infrastructure topology, with three levels usually being adequate (eg. regions --> zones --> racks). Lastly, administrators can specify affinity and anti-affinity rules at each of these levels in any combination. + +=== Affinity +Administrators should be able to configure the scheduler to specify affinity at any topological level, or even at multiple levels. Affinity at a particular level indicates that all pods that belong to the same service will be scheduled onto nodes that belong to the same level. This handles any latency requirements of applications by allowing administrators to ensure that peer pods do not end up being too geographically separated. If no node is available within the same affinity group to host the pod, then the pod will not get scheduled. + +=== Anti Affinity +Administrators should be able to configure the scheduler to specify anti-affinity at any topological level, or even at multiple levels. Anti-Affinity (or 'spread') at a particular level indicates that all pods that belong to the same service will be spread across nodes that belong to that level. This ensures that the application is well spread for high availability purposes. The scheduler will try to balance the service pods across all applicable nodes as evenly as possible. + + +== Sample Policy Configurations +The configuration below specifies the default scheduler configuration, if it were to be specified via the scheduler policy file. +---- +{ + "kind" : "Policy", + "version" : "v1", + "predicates" : [ + {"name" : "PodFitsPorts"}, + {"name" : "PodFitsResources"}, + {"name" : "NoDiskConflict"}, + {"name" : "MatchNodeSelector"}, + {"name" : "HostName"} + ], + "priorities" : [ + {"name" : "LeastRequestedPriority", "weight" : 1}, + {"name" : "ServiceSpreadingPriority", "weight" : 1} + ] +} +---- + +IMPORTANT: In all of the sample configurations below, the list of predicates and priority functions is truncated to include only the ones that pertain to the use case specified. In practice, a complete/meaningful scheduler policy should include most, if not all, of the default predicates and priority functions listed above. + +Three topological levels defined as region (affinity) --> zone (affinity) --> rack (anti-affinity) +---- +{ + "kind" : "Policy", + "version" : "v1", + "predicates" : [ + ... + {"name" : "RegionZoneAffinity", "argument" : {"serviceAffinity" : {"labels" : ["region", "zone"]}}} + ], + "priorities" : [ + ... + {"name" : "RackSpread", "weight" : 1, "argument" : {"serviceAntiAffinity" : {"label" : "rack"}}} + ] +} +---- + +Three topological levels defined as city (affinity) --> building (anti-affinity) --> room (anti-affinity) +---- +{ + "kind" : "Policy", + "version" : "v1", + "predicates" : [ + ... + {"name" : "CityAffinity", "argument" : {"serviceAffinity" : {"labels" : ["city"]}}} + ], + "priorities" : [ + ... + {"name" : "BuildingSpread", "weight" : 1, "argument" : {"serviceAntiAffinity" : {"label" : "building"}}}, + {"name" : "RoomSpread", "weight" : 1, "argument" : {"serviceAntiAffinity" : {"label" : "room"}}} + ] +} +---- + +Only use nodes with the 'region' label defined and prefer nodes with the 'zone' label defined +---- +{ + "kind" : "Policy", + "version" : "v1", + "predicates" : [ + ... + {"name" : "RequireRegion", "argument" : {"labelsPresence" : {"labels" : ["region"], "presence" : true}}} + + ], + "priorities" : [ + ... + {"name" : "ZonePreferred", "weight" : 1, "argument" : {"labelPreference" : {"label" : "zone", "presence" : true}}} + ] +} +---- + +Random configuration example combining static and configurable predicates and priority functions +---- +{ + "kind" : "Policy", + "version" : "v1", + "predicates" : [ + ... + {"name" : "RegionAffinity", "argument" : {"serviceAffinity" : {"labels" : ["region"]}}}, + {"name" : "RequireRegion", "argument" : {"labelsPresence" : {"labels" : ["region"], "presence" : true}}}, + {"name" : "BuildingNodesAvoid", "argument" : {"labelsPresence" : {"labels" : ["building"], "presence" : false}}}, + {"name" : "PodFitsPorts"}, + {"name" : "MatchNodeSelector"} + ], + "priorities" : [ + ... + {"name" : "ZoneSpread", "weight" : 2, "argument" : {"serviceAntiAffinity" : {"label" : "zone"}}}, + {"name" : "ZonePreferred", "weight" : 1, "argument" : {"labelPreference" : {"label" : "zone", "presence" : true}}}, + {"name" : "ServiceSpreadingPriority", "weight" : 1} + ] +} +---- + + +== Scheduler Extensibility +As is the case with almost everything else in Kubernetes/OpenShift, the scheduler is built using a plugin model and the current implementation itself is a plugin. There are two ways to extend the scheduler functionality. + +=== Enhancements +The scheduler functionality can be enhanced by adding new predicates and priority functions. They can either be contributed upstream or maintained separately. These predicates and priority functions would need to be registered with the scheduler factory and then specified in the scheduler policy file. + +=== Replacement +Since the scheduler is a plugin, it can be replaced in favor of an alternate implementation. The scheduler code has a clean separation that watches new pods as they get created and identifies the most suitable node to host them. It then creates bindings (pod to node bindings) for the pods using the master API. diff --git a/dev_guide/scheduler.adoc b/dev_guide/scheduler.adoc deleted file mode 100644 index 444345b9bba0..000000000000 --- a/dev_guide/scheduler.adoc +++ /dev/null @@ -1,126 +0,0 @@ -= Scheduler -{product-author} -{product-version} -:data-uri: -:icons: -:experimental: -:toc: macro -:toc-title: - -toc::[] - -== Overview -The Kubernetes pod scheduler is responsible for scheduling new pods onto nodes within the cluster. It reads data from the pod and tries to find a node that is a good fit based on configured policies. It is completely independent and exists as a standalone/pluggable solution. It does not modify the pod and just creates a binding for the pod that ties the pod to the particular node. - -== Generic Scheduler -The existing generic scheduler is the default platform-provided scheduler "engine" that selects a node to host the pod in a 3-step operation: - -=== Step 1: Filter the nodes -The available nodes are filtered based on the constraints or requirements specified. This is done by running each of the nodes through the list of filter functions called 'predicates'. - -=== Step 2: Prioritize the filtered list of nodes -This is achieved by passing each node through a series of 'priority' functions that assign it a score between 0 - 10, with 0 indicating a bad fit and 10 indicating a good fit to host the pod. The scheduler configuration can also take in a simple "weight" (positive numeric value) for each priority function. The node score provided by each priority function is multiplied by the "weight" (default weight is 1) and then combined by just adding the scores for each node provided by all the priority functions. This weight attribute can be used by administrators to give higher importance to some priority functions. - -=== Step 3: Select the best fit node -The nodes are sorted based on their scores and the node with the highest score is selected to host the pod. If multiple nodes have the same high score, then one of them is selected at random. - - -== Scheduler Policy -The selection of the predicate and priority functions defines the policy for the scheduler. Several predicate and priority functions are provided within Kubernetes. There are two ways to define the scheduler policy. Administrators can either specify an algorithm provider or they can provide a JSON configuration file that specifies the predicates and priority functions to configure the scheduler. Both the algorithm provider and the configuration file can be specified using the command line for the scheduler executable. - - -=== Algorithm Providers -'Providers' can be created to specify a pre-determined combination of the predicates and priority functions along with their weights. A few providers, including a default provider, have been created for providing the scheduler configuration. Creating a provider is not something that a user/administrator can do. However, administrators can select a particular algorithm provider for their scheduler configuration via the command line. - - -=== Available Predicates -A custom set of predicates can be specified to configure the scheduler. There are several predicates provided out-of-the-box in Kubernetes. Some of these predicates can be customized by providing certain parameters. Multiple predicates can be combined to provide additional filtering of nodes. - -==== Fixed Predicates - -1. 'PodFitsPorts' deems a node to be fit for hosting a pod based on the absence of port conflicts -1. 'PodFitsResources' determines a fit baased resource availability. The nodes can declare their resource capacities and then pods can specify what resources they require. Fit is based on requested, rather than used resources. -1. 'NoDiskConflict' determines fit based on non-conflicting disk volumes. It evaluates if a pod can fit due to the volumes it requests, and those that are already mounted. -1. 'MatchNodeSelector' determines fit based on node selector query that is defined in the pod. -1. 'HostName' determines fit based on the presence of the Host parameter and a string match with the name of the host - -==== Configurable Predicates - -1. 'ServiceAffinity' filters out nodes that do not belong to the specified topological level defined by the provided labels. This predicate takes in a list of labels and ensures affinity within the nodes (that have the same label values) for pods belonging to the same service. If the pod specifies a value for the labels in its NodeSelector, then the nodes matching those labels are the ones where the pod is scheduled. If the pod does not specify the labels in its NodeSelector, then the first pod can be placed on any node based on availability and all subsequent pods of the service will be scheduled on nodes that have the same label values. -1. 'LabelsPresence' checks whether a particular node has a certain label defined or not, regardless of value - - -=== Available Priority Functions -A custom set of priority functionis can be specified to configure the scheduler. There are several priority functions provided out-of-the-box in Kubernetes. Some of these priority functions can be customized by providing certain parameters. Multiple priority functions can be combined and weights can be given to each in order to impact the prioritization. The default weight is 1. - -==== Fixed Priority Functions - -1. 'LeastRequestedPriority' favors nodes with fewer requested resources. It calculates the percentage of memory and CPU requested by pods scheduled on the node, and prioritizes nodes that have the highest available/remaining capacity. -1. 'ServiceSpreadingPriority' spreads pods by minimizing the number of pods belonging to the same service onto the same machine -1. 'EqualPriority' gives an equal weight of one to all nodes - -==== Configurable Priority Functions - -1. 'ServiceAntiAffinity' takes a label and ensures a good spread of the pods belonging to the same service across the group of nodes based on the label values. It gives the same score to all nodes that have the same value for the specified label. It gives a highter score to nodes within a group with the least concentration of pods. - - -== Affinity and Anti Affinity -One of the important use cases for scheduling within OpenShift is to support flexible affinity and anti-affinity policies. Administrators can define multiple topological levels for their infrastructure (nodes). This is done by specifying labels on nodes (eg: region = r1, zone = z1, rack = s1). These label names have no particular meaning and administrators are free to name their infrastructure levels anything (eg, city/building/room). Also, administrators can define any number of nested levels for their infrastructure topology, with three levels being usually adequate (eg. regions --> zones --> racks). Lastly, administrators can specify affinity and anti-affinity rules at each of these levels in any combination. - - -== Sample Policy Configurations -Three topological levels defined as region (affinity) --> zone (affinity) --> rack (anti-affinity) ----- -{ - "predicates" : [ - {"name" : "RegionZoneAffinity", "argument" : {"serviceAffinity" : {"labels" : ["region", "zone"]}}} - ], - "priorities" : [ - {"name" : "RackSpread", "weight" : 1, "argument" : {"serviceAntiAffinity" : {"label" : "rack"}}} - ] -} ----- - -Three topological levels defined as city (affinity) --> building (anti-affinity) --> room (anti-affinity) ----- -{ - "predicates" : [ - {"name" : "CityAffinity", "argument" : {"serviceAffinity" : {"labels" : ["city"]}}} - ], - "priorities" : [ - {"name" : "BuildingSpread", "weight" : 1, "argument" : {"serviceAntiAffinity" : {"label" : "building"}}}, - {"name" : "RoomSpread", "weight" : 1, "argument" : {"serviceAntiAffinity" : {"label" : "room"}}} - ] -} ----- - -Only use nodes with the 'region' label defined and prefer nodes with the 'zone' label defined ----- -{ - "predicates" : [ - {"name" : "RequireRegion", "argument" : {"labelsPresence" : {"labels" : ["region"], "presence" : true}}}, - - ], - "priorities" : [ - {"name" : "ZonePreferred", "weight" : 1, "argument" : {"labelPreference" : {"label" : "zone", "presence" : true}}}, - ] -} ----- - -Combining fixed and configurable predicates and priority functions ----- -{ - "predicates" : [ - {"name" : "RegionAffinity", "argument" : {"serviceAffinity" : {"labels" : ["region"]}}}, - {"name" : "RequireRegion", "argument" : {"labelsPresence" : {"labels" : ["region"], "presence" : true}}}, - {"name" : "BuildingNodesAvoid", "argument" : {"labelsPresence" : {"labels" : ["building"], "presence" : false}}}, - {"name" : "PodFitsPorts"}, - {"name" : "MatchNodeSelector} - ], - "priorities" : [ - {"name" : "ZoneSpread", "weight" : 2, "argument" : {"serviceAntiAffinity" : {"label" : "zone"}}}, - {"name" : "ZonePreferred", "weight" : 1, "argument" : {"labelPreference" : {"label" : "zone", "presence" : true}}}, - {"name" : "ServiceSpreadingPriority", "weight" : 1} - ] -} -----