@@ -93,9 +93,9 @@ but slightly different non-goals.
9393 stable and continue to operate reliably, but may respond more slowly.
9494- ** This enhancement does not address mixing nodes with pinning and without,
9595 this feature will be enabled cluster wide and encapsulate both master and
96- worker pools. If it's not desired then the setting will still be turned on but
97- the management workloads will run on the whole CPU set for that desired
98- pool.**
96+ worker pools. If it's not desired then the default behavior will still be
97+ turned on but the management workloads will run on the whole CPU set for that
98+ desired pool.**
9999- ** This enhancement assumes that the configuration of a management CPU pool is
100100 done as part of installing the cluster. It can be changed after the fact but
101101 we will need to stipulate that, that is currently not supported. The intent
@@ -105,16 +105,29 @@ but slightly different non-goals.
105105
106106We will need to maintain a global identifier that is set during installation and
107107can not be easily removed after the fact. This approach will help remove
108- exposing this feature via an API and limiting the chances that a
109- misconfiguration can cause un-recoverable scenarios for our customers. At
110- install time we will also apply an initial machine config for workload
111- partitioning that sets a default CPUSet for the whole CPUSet. Effectively this
112- will behave as if workload partitioning is not turned on. With this approach we
113- eliminate the race condition that can occur if we apply a machine config after
114- the fact as nodes join the cluster. When a customer wishes to pin the management
108+ exposing this feature via an API and limit the chances that a misconfiguration
109+ can cause un-recoverable scenarios for our customers. At install time we will
110+ also apply an initial machine config for workload partitioning that sets a
111+ default CPUSet for the whole CPUSet. Effectively this will behave as if workload
112+ partitioning is not turned on. When a customer wishes to pin the management
115113workloads they will be able to do that via the existing Performance Profile.
116114Resizing partition size will not cause any issues after installation.
117115
116+ With this approach we eliminate the race condition that can occur if we apply
117+ the machine config after bootstrap via NTO. Since we create a "default" cri-o
118+ and kubelet configuration that does not specify the CPUSet customers do not have
119+ to worry about configuring correct bounds for their machines and risk
120+ misconfiguration.
121+
122+ Furthermore, as machines join the cluster they will have the feature turned on
123+ before kubelet and the api-server come up as they query the
124+ ` machine-config-server ` for their configurations before joining. This also
125+ allows us more flexibility and an easier interface for the customer since
126+ customers only need to interact with the Performance Profile to set their
127+ ` reserved ` and ` isolated ` CPUSet. This makes things less prone to error as not
128+ only can the CPUSets be different for workers and masters but the machines
129+ themselves might have vastly different core counts.
130+
118131In order to implement this enhancement we are proposing changing 3 components
119132defined below.
120133
@@ -129,8 +142,8 @@ defined below.
129142 in openshift/kubernetes.
130143 - This change will be in support of checking the global identifier in order
131144 to modify the pod spec with the correct ` requests ` .
132- 3 . The
133- [ Performance Profile Controller] ( https://github.com/openshift/cluster-node-tuning-operator/blob/master/docs/performanceprofile/performance_profile.md )
145+ 3 . The [ Performance Profile
146+ Controller] ( https://github.com/openshift/cluster-node-tuning-operator/blob/master/docs/performanceprofile/performance_profile.md )
134147 part of [ Cluster Node Tuning
135148 Operator] ( https://github.com/openshift/cluster-node-tuning-operator )
136149 - This change will support adding the ability to explicitly pin
@@ -161,7 +174,7 @@ type InfrastructureStatus struct {
161174 // set via the Node Tuning Operator and the Performance Profile API
162175 // +kubebuilder:default=None
163176 // +kubebuilder:validation:Enum=None;AllNodes
164- CPUPartitioning PartitioningMode ` json:"cpuPartitioning"`
177+ CPUPartitioning CPUPartitioningMode ` json:"cpuPartitioning"`
165178}
166179
167180// PartitioningMode defines the CPU partitioning mode of the nodes.
@@ -175,48 +188,22 @@ const (
175188)
176189```
177190
178- ### Openshift Installer
179-
180- We will need to modify the Openshift installer to set and generate the machine
181- configs for the initial setup. The generated machine config manifests will be
182- set to be wide open to the whole CPU set. However, because these manifests are
183- applied early on in the process we avoid race condition situations that might
184- arise if these are applied after installation.
185-
186- In the similar approach to the ` openshift/api ` change we propose adding a new
187- feature to the install configuration that will flag a cluster for CPU
188- partitioning during installation.
189-
190- ``` go
191- // CPUPartitioningMode is a strategy for how various endpoints for the cluster are exposed.
192- // +kubebuilder:validation:Enum=None;AllNodes;MasterNodes;WorkerNodes
193- type CPUPartitioningMode string
194-
195- const (
196- CPUPartitioningNone CPUPartitioningMode = " None"
197- CPUPartitioningAllNodes CPUPartitioningMode = " AllNodes"
198- )
199-
200- type InstallConfig struct {
201- // CPUPartitioning configures if a cluster will be used with CPU partitioning
202- //
203- // +kubebuilder:default=None
204- // +optional
205- CPUPartitioning CPUPartitioningMode ` json:"cpuPartitioning,omitempty"`
206- }
207- ```
208-
209191### Admission Controller
210192
211193We want to remove the checks in the admission controller which specifically
212- verifies that partitioning is only applied to single node topology configuration.
213- The design and configuration for any pod modification will remain the same, we
214- simply will allow you to apply partitioning on non single node topologies.
194+ verifies that partitioning is only applied to single node topology
195+ configuration. The design and configuration for any pod modification will remain
196+ the same, we simply will allow you to apply partitioning on non single node
197+ topologies.
215198
216199We will use the global identifier to correctly modify the pod spec with the
217200` requests.cpu ` for the new ` requests[management.workload.openshift.io/cores] `
218201that are used by the workload partitioning feature.
219202
203+ However, for Single-Node we will continue to check the conventional way to be
204+ able to support the upgrade flow from 4.11 -> 4.12. After 4.12 release that
205+ logic should no longer be needed and will be removed.
206+
220207### Performance Profile Controller
221208
222209Currently workload partitioning involves configuring CRI-O and Kubelet earlier
@@ -252,12 +239,11 @@ spec:
252239
253240To support upgrades and maintain better signaling for the cluster, the
254241Performance Profile Controller will also inspect the Nodes to update a global
255- identifier at start up. We will only update the identifier to true if our
256- criteria is met, otherwise we will not change the state to "off" as disabling this
257- feature is not supported. We will determine to be in workload pinning mode if
258- all the master nodes are configured with a capacity
259- (` management.workload.openshift.io/cores`) for workload partitioning. If that is
260- true we will set the global identifier otherwise we leave it as is.
242+ identifier at start up. We will only update the identifier to ` AllNodes` if we
243+ are running in Single Node and our Node has the capacity resource
244+ (`management.workload.openshift.io/cores`) for our 4.11 -> 4.12 upgrades. This
245+ should not be needed after 4.12 for all clusters. This should have no baring on
246+ 4.11 HA/3NC clusters as this feature will not be back ported.
261247
262248# ## Workflow Description
263249
@@ -313,8 +299,8 @@ Node-->>MCO: Finished Restart
3132993. Alice adds the default machine configs and desired PerformanceProfile
314300 manifest for workload partitioning to the `openshift` folder that was
315301 generated by the installer.
316- 4. Alice updates the `Infrastructure` CR status
317- to denote that workload partitioning is turned on.
302+ 4. Alice updates the `Infrastructure` CR status to denote that workload
303+ partitioning is turned on.
3183045. Alice then creates the cluster via the installer.
3193056. The installer will apply the manifests and during the bootstrapping process
320306 the MCO will apply the default configurations for workload partitioning, and
@@ -324,7 +310,8 @@ Node-->>MCO: Finished Restart
3243108. The MCO applies the updated workload partitioning configurations and restarts
325311 the relevant nodes.
3263129. Alice will now have a cluster that has been setup with workload partitioning
327- and the desired workloads pinned to the specified CPUSet in the PerformanceProfile.
313+ and the desired workloads pinned to the specified CPUSet in the
314+ PerformanceProfile.
328315
329316Applying CPU Partitioning Size Change
330317
@@ -357,8 +344,8 @@ NTO-->>API: PerformanceProfile Applied
357344 workloads to pin.
3583452. The NTO will generate the appropriate machine configs that include the
359346 Kubelet config and the CRIO config and apply the machine configs.
360- 3. Once the MCO applies the configs, the node is restarted and the cluster
361- has been updated with the desired workload pinning.
347+ 3. Once the MCO applies the configs, the node is restarted and the cluster has
348+ been updated with the desired workload pinning.
3623494. Alice will now have a cluster that has been setup with workload pinning.
363350
364351# ### Variation [optional]
@@ -373,34 +360,36 @@ management CPU pool.
373360> [Management Workload Partitioning](management-workload-partitioning.md)
374361
3753621. User sits down at their computer.
376- 2. **The user creates a `PerformanceProfile` resource with the desired `isolated`
377- and `reserved` CPUSet with the `cpu.workloads[Infrastructure]` added to the
378- enum list.**
379- 3. **The user will set the installer configuration for CPU partitioning to
380- AllNodes, `cpuPartitioning : AllNodes` .**
363+ 2. **The user creates a `PerformanceProfile` resource with the desired
364+ ` isolated ` and `reserved` CPUSet with the `cpu.workloads[Infrastructure]`
365+ added to the enum list.**
366+ 3. **Alice updates the `Infrastructure` CR status to denote that workload
367+ partitioning is turned on .**
3813684. The user runs the installer to create the standard manifests, adds their
382369 extra manifests from steps 2, then creates the cluster.
383- 5. **NTO will generate the machine config manifests and apply them.**
384- 6. The kubelet starts up and finds the configuration file enabling the new
370+ 5. The kubelet starts up and finds the configuration file enabling the new
385371 feature.
386- 7 . The kubelet advertises `management.workload.openshift.io/cores` extended
372+ 6 . The kubelet advertises `management.workload.openshift.io/cores` extended
387373 resources on the node based on the number of CPUs in the host.
388- 8 . The kubelet reads static pod definitions. It replaces the `cpu` requests with
374+ 7 . The kubelet reads static pod definitions. It replaces the `cpu` requests with
389375 ` management.workload.openshift.io/cores` requests of the same value and adds
390376 the `resources.workload.openshift.io/{container-name} : {"cpushares": 400}`
391377 annotations for CRI-O with the same values.
392- 9. Something schedules a regular pod with the
393- ` target.workload.openshift.io/management` annotation in a namespace with the
394- `workload.openshift.io/allowed : management` annotation.
395- 10. The admission hook modifies the pod, replacing the CPU requests with
378+ 8. **NTO will generate the machine config manifests and apply them.**
379+ 9. **MCO modifies kubelet and cri-o configurations of the relevant machine pools
380+ to the updated `reserved` CPU cores and restarts the nodes**
381+ 10. Something schedules a regular pod with the
382+ ` target.workload.openshift.io/management` annotation in a namespace with the
383+ `workload.openshift.io/allowed : management` annotation.
384+ 11. The admission hook modifies the pod, replacing the CPU requests with
396385 ` management.workload.openshift.io/cores` requests and adding the
397386 `resources.workload.openshift.io/{container-name} : {"cpushares": 400}`
398387 annotations for CRI-O.
399- 11 . The scheduler sees the new pod and finds available
400- ` management.workload.openshift.io/cores` resources on the node. The scheduler
401- places the pod on the node.
402- 12 . Repeat steps 8-10 until all pods are running.
403- 13 . Cluster deployment comes up with management components constrained to subset
388+ 12 . The scheduler sees the new pod and finds available
389+ ` management.workload.openshift.io/cores` resources on the node. The
390+ scheduler places the pod on the node.
391+ 13 . Repeat steps 10-12 until all pods are running.
392+ 14 . Cluster deployment comes up with management components constrained to subset
404393 of available CPUs.
405394
406395# #### Partition Resize workflow
@@ -453,16 +442,16 @@ spec:
453442# ### Changes to NTO
454443
455444The NTO PerformanceProfile will be updated to support a new flag which will
456- toggle the workload pinning to the `isolated ` cores. The idea here being to
445+ toggle the workload pinning to the `reserved ` cores. The idea here being to
457446simplify the approach for how customers set this configuration. With PAO being
458447part of NTO now ([see here for more info](../node-tuning/pao-in-nto.md)) this
459448affords us the chance to consolidate the configuration for `kubelet` and `crio`.
460449
461450We will modify the code path that generates the [new machine
462451config](https://github.com/openshift/cluster-node-tuning-operator/blob/a780dfe07962ad07e4d50c852047ef8cf7b287da/pkg/performanceprofile/controller/performanceprofile/components/machineconfig/machineconfig.go#L91-L127)
463- using the performance profile. With the new `spec.workloads[Infrastructure]` enum we
464- will add the configuration for `crio` and `kubelet` to the final machine config
465- manifest. Then the existing code path will apply the change as normal.
452+ using the performance profile. With the new `spec.workloads[Infrastructure]`
453+ enum we will add the configuration for `crio` and `kubelet` to the final machine
454+ config manifest. Then the existing code path will apply the change as normal.
466455
467456# ### API Server Admission Hook
468457
@@ -488,8 +477,11 @@ Changed Path:
488477
4894781. Check if `pod` is a static pod
490479 - Skips modification attempt if it is static.
491- 2. Checks if currently running cluster has global identifier for partitioning set
492- - Skips modification if identifier partitioning set to `None`
480+ 2. Checks if currently running cluster has global identifier for partitioning
481+ set
482+ - Skips modification if identifier partitioning set to `None` unless Single
483+ Node, will check with old logic to maintain upgrade for Single-Node 4.11 ->
484+ 4.12.
4934853. Checks what resource limits and requests are set on the pod
494486 - Skips modification if QoS is guaranteed or both limits and requests are set
495487 - Skips modification if after update the QoS is changed
@@ -509,17 +501,15 @@ A risk we can run into is that a customer can apply a CPU set that is too small
509501or out of bounds can cause problems such as extremely poor performance or start
510502up errors. Mitigation of this scenario will be to provide proper guidance and
511503guidelines for customers who enable this enhancement. As mentioned in our goal
512- we do support re-configuring the CPUSet partition size after installation. The
513- performance team would need to be reached out to for more specific information
514- around upper and lower bounds of CPU sets for running an Openshift cluster
504+ we do support re-configuring the CPUSet partition size after installation.
515505
516506It is possible to build a cluster with the feature enabled and then add a node
517507in a way that does not configure the workload partitions only for that node. We
518508do not support this configuration as all nodes must have the feature turned on.
519- The risk that a customer will run into here is that if that node is not in a pool
520- configured with workload partitioning, then it might not be able to correctly
521- function at all. Things such as networking pods might not work
522- as those pods will have the custom `requests`
509+ The risk that a customer will run into here is that if that node is not in a
510+ pool configured with workload partitioning, then it might not be able to
511+ correctly function at all. Things such as networking pods might not work as
512+ those pods will have the custom `requests`
523513` management.workload.openshift.io/cores` . In this situation the mitigation is
524514for the customer to add the node to a pool that contains the configuration for
525515workload partitioning.
@@ -587,21 +577,22 @@ the `workload.openshift.io/allowed` annotation.
587577This new behavior will be added in 4.12 as part of the installation
588578configurations for customers to utilize.
589579
590- Enabling the feature after installation for HA/3NC is not supported in 4.12, so we do not
591- need to address what happens if an older cluster upgrades and then the feature
592- is turned on.
580+ Enabling the feature after installation for HA/3NC is not supported in 4.12, so
581+ we do not need to address what happens if an older cluster upgrades and then the
582+ feature is turned on.
593583
594- When upgrades occur for current single node deployments we will need to set the global
595- identifier during the upgrade. We will do this via the NTO and the trigger for
596- this event will be :
584+ When upgrades occur for current single node deployments we will need to set the
585+ global identifier during the upgrade. We will do this via the NTO and the
586+ trigger for this event will be :
597587
598- - If the `capacity` field set on the master node
588+ - If the `capacity` field set on the master node and is running in Single Node
599589
600590We will not change the current machine configs for single node deployments if
601591they are already set, this will be done to avoid extra restarts. We will need to
602592be clear with customers however, if they add the
603- ` spec.workloads[Infrastructure]` we will then take that opportunity to
604- consolidate the machine configs and clean up the old way of deploying things.
593+ ` spec.workloads[Infrastructure]` we will then generate the new machine config
594+ and an extra restart will happen. They will need to delete the old machine
595+ configs afterwards.
605596
606597# ## Version Skew Strategy
607598
0 commit comments