|
| 1 | +--- |
| 2 | +title: wide-availability-workload-partitioning |
| 3 | +authors: |
| 4 | + - "@eggfoobar" |
| 5 | +reviewers: |
| 6 | + - TBD |
| 7 | +approvers: |
| 8 | + - TBD |
| 9 | +api-approvers: |
| 10 | + - TBD |
| 11 | +creation-date: 2022-08-03 |
| 12 | +last-updated: 2022-08-08 |
| 13 | +tracking-link: |
| 14 | + - https://issues.redhat.com/browse/CNF-5562 |
| 15 | +see-also: |
| 16 | + - "/enhancements/workload-partitioning" |
| 17 | + - "/enhancements/node-tuning/pao-in-nto.md" |
| 18 | +--- |
| 19 | + |
| 20 | +# Wide Availability Workload Partitioning |
| 21 | + |
| 22 | +## Summary |
| 23 | + |
| 24 | +This enhancements builds on top of the [Management Workload |
| 25 | +Partitioning](management-workload-partitioning.md) and the [move of PAO in |
| 26 | +NTO](../node-tuning/pao-in-nto.md) enhancement to provide the ability to do |
| 27 | +workload partitioning to our wider cluster configurations. The previous workload |
| 28 | +partitioning work is isolated to Single Node cluster configurations, this |
| 29 | +enhancement will seek to allow customers to configure workload partitioning on |
| 30 | +HA as well as Compact(3NC) clusters. |
| 31 | + |
| 32 | +## Motivation |
| 33 | + |
| 34 | +Customers who want us to reduce the resource consumption of management workloads |
| 35 | +have a fixed budget of CPU cores in mind. We want to use normal scheduling |
| 36 | +capabilities of kubernetes to manage the number of pods that can be placed onto |
| 37 | +those cores, and we want to avoid mixing management and normal workloads there. |
| 38 | +Expanding on the already build workload partitioning we should be able to supply |
| 39 | +the same functionality to HA and 3NC clusters. |
| 40 | + |
| 41 | +### User Stories |
| 42 | + |
| 43 | +As a cluster creator I want to isolate the management pods of Openshift in |
| 44 | +compact(3NC) and HA clusters to specific CPU sets so that I can isolate the |
| 45 | +platform workloads from the application workload due to the high performance and |
| 46 | +determinism required of my applications. |
| 47 | + |
| 48 | +### Goals |
| 49 | + |
| 50 | +- This enhancement describes an approach for configuring OpenShift clusters to |
| 51 | + run with management workloads on a restricted set of CPUs. |
| 52 | +- Clusters built in this way should pass the same kubernetes and OpenShift |
| 53 | + conformance and functional end-to-end tests as similar deployments that are |
| 54 | + not isolating the management workloads. |
| 55 | +- We want to run different workload partitioning on masters and workers. |
| 56 | +- Customers will be supplied with the advice of 4 hyperthreaded cores for |
| 57 | + masters and for workers, 2 hyperthreaded cores. |
| 58 | +- We want a general approach, that can be applied to all OpenShift control plane |
| 59 | + and per-node components via the PerformenceProfile |
| 60 | + |
| 61 | +### Non-Goals |
| 62 | + |
| 63 | +This enhancement expands on the existing [Management Workload |
| 64 | +Partitioning](management-workload-partitioning.md) and as such shares similar |
| 65 | +but slightly different non-goals |
| 66 | + |
| 67 | +- This enhancement is focused on CPU resources. Other compressible resource |
| 68 | + types may need to be managed in the future, and those are likely to need |
| 69 | + different approaches. |
| 70 | +- This enhancement does not address mixing node partitioning, this feature will |
| 71 | + be enabled cluster wide and encapsulate both master and worker pools. If it's |
| 72 | + not desired then the setting will still be turned on but the management |
| 73 | + workloads will run on the whole CPU set for that desired node. |
| 74 | +- This enhancement does not address non-compressible resource requests, such as |
| 75 | + for memory. |
| 76 | +- This enhancement does not address ways to disable operators or operands |
| 77 | + entirely. |
| 78 | +- This enhancement does not address reducing actual utilization, beyond |
| 79 | + providing a way to have a predictable upper-bounds. There is no expectation |
| 80 | + that a cluster configured to use a small number of cores for management |
| 81 | + services would offer exactly the same performance as the default. It must be |
| 82 | + stable and continue to operate reliably, but may respond more slowly. |
| 83 | +- This enhancement assumes that the configuration of a management CPU pool is |
| 84 | + done as part of installing the cluster. It can be changed after the fact but |
| 85 | + we will need to stipulate that, that is currently not supported. The intent |
| 86 | + here is for this to be supported as a day 0 feature. |
| 87 | +- This enhancement describes partitioning concepts that could be expanded to be |
| 88 | + used for other purposes. Use cases for partitioning workloads for other |
| 89 | + purposes may be addressed by future enhancements. |
| 90 | + |
| 91 | +## Proposal |
| 92 | + |
| 93 | +In order to implement this enhancement we are focused on changing 2 components. |
| 94 | + |
| 95 | +1. Admission controller ([management cpus |
| 96 | + override](https://github.com/openshift/kubernetes/blob/a9d6306a701d8fa89a548aa7e132603f6cd89275/openshift-kube-apiserver/admission/autoscaling/managementcpusoverride/doc.go)) |
| 97 | + in openshift/kubernetes. |
| 98 | +1. The |
| 99 | + [PerformanceProfile](https://github.com/openshift/cluster-node-tuning-operator/blob/master/docs/performanceprofile/performance_profile.md) |
| 100 | + part of [Cluster Node Tuning |
| 101 | + Operator](https://github.com/openshift/cluster-node-tuning-operator) |
| 102 | + |
| 103 | +We want to remove the checks in the admission controller which specifically |
| 104 | +checks that partitioning is only applied to single node topology configuration. |
| 105 | +The design and configuration for any pod modification will remain the same, we |
| 106 | +simply will allow you to apply partitioning on non single node topologies. |
| 107 | + |
| 108 | +Workload pinning involves configuring CRI-O and Kubelet. Currently, this is done |
| 109 | +through a machine config that contains both of those configurations. This can |
| 110 | +pose problems as the CPU set value has to be copied to these other two |
| 111 | +configurations. We want to simplify the current implementation and apply both of |
| 112 | +these configurations via the `PerformanceProfile` CRD. |
| 113 | + |
| 114 | +We want to add a new `Workloads` field to the `CPU` field that contains the |
| 115 | +configuration information for `enablePinning`. We are not sure where we would |
| 116 | +want to take workload pinning in the future and to allow flexibility we want to |
| 117 | +place the configuration in `cpu.workloads`. |
| 118 | + |
| 119 | +```yaml |
| 120 | +apiVersion: performance.openshift.io/v2 |
| 121 | +kind: PerformanceProfile |
| 122 | +metadata: |
| 123 | + name: openshift-node-workload-partitioning-custom |
| 124 | +spec: |
| 125 | + cpu: |
| 126 | + isolated: 2-3 |
| 127 | + reserved: 0,1 |
| 128 | + # New addition |
| 129 | + workloads: |
| 130 | + enablePinning: true |
| 131 | +``` |
| 132 | +
|
| 133 | +### Workflow Description |
| 134 | +
|
| 135 | +The end user will be expected to provide a `PerformanceProfile` manifest that |
| 136 | +describes their desired `isolated` and `reserved` CPUSet and the |
| 137 | +`workloads.enablePinning` flag set to true. This manifest will be applied during |
| 138 | +the installation process. |
| 139 | + |
| 140 | +**High level sequence diagram:** |
| 141 | + |
| 142 | +```mermaid |
| 143 | +sequenceDiagram |
| 144 | +Alice->>Installer: Provide PerformanceProfile manifest |
| 145 | +Installer-->>NTO: Apply |
| 146 | +NTO-->>MCO: Generated Machine Manifests |
| 147 | +MCO-->>Node: Configure node |
| 148 | +loop Apply |
| 149 | + Node->>Node: Set kubelet config |
| 150 | + Node->>Node: Set crio config |
| 151 | + Node->>Node: Kubelet advertises cores |
| 152 | +end |
| 153 | +Node-->>MCO: Finished Restart |
| 154 | +MCO-->>NTO: Machine Manifests Applied |
| 155 | +NTO-->>Installer: PerformanceProfile Applied |
| 156 | +Installer-->>Alice: Cluster is Up! |
| 157 | +``` |
| 158 | + |
| 159 | +- **Alice** is a human user who creates an Openshift cluster. |
| 160 | +- **Installer** is assisted installer that applies the user manifest. |
| 161 | +- **NTO** is the node tuning operator. |
| 162 | +- **MCO** is the machine config operator. |
| 163 | +- **Node** is the kubernetes node. |
| 164 | + |
| 165 | +1. Alice sits down and provides the desired performance profile as an extra |
| 166 | + manifest to the installer. |
| 167 | +1. The installer applies the manifest. |
| 168 | +1. The NTO will generate the appropriate machine configs that include the |
| 169 | + kubelet config and the crio config to be applied as well as the existing |
| 170 | + operation. |
| 171 | +1. Once the MCO applies the configs, the node is restarted and the cluster |
| 172 | + installation continues to completion. |
| 173 | +1. Alice will now have a cluster that has been setup with workload pinning. |
| 174 | + |
| 175 | +#### Variation [optional] |
| 176 | + |
| 177 | +This section outlines an end-to-end workflow for deploying a cluster with |
| 178 | +workload partitioning enabled and how pods are correctly scheduled to run on the |
| 179 | +management CPU pool. |
| 180 | + |
| 181 | +1. User sits down at their computer. |
| 182 | +1. The user creates a `PerformanceProfile` resource with the desired `isolated` |
| 183 | + and `reserved` CPUSet with the `cpu.workloads.enablePinning` set to true. |
| 184 | +1. The user runs the installer to create the standard manifests, adds their |
| 185 | + extra manifests from steps 2, then creates the cluster. |
| 186 | +1. NTO will generate the machine config manifests and apply them. |
| 187 | +1. The kubelet starts up and finds the configuration file enabling the new |
| 188 | + feature. |
| 189 | +1. The kubelet advertises `management.workload.openshift.io/cores` extended |
| 190 | + resources on the node based on the number of CPUs in the host. |
| 191 | +1. The kubelet reads static pod definitions. It replaces the `cpu` requests with |
| 192 | + `management.workload.openshift.io/cores` requests of the same value and adds |
| 193 | + the `resources.workload.openshift.io/{container-name}: {"cpushares": 400}` |
| 194 | + annotations for CRI-O with the same values. |
| 195 | +1. Something schedules a regular pod with the |
| 196 | + `target.workload.openshift.io/management` annotation in a namespace with the |
| 197 | + `workload.openshift.io/allowed: management` annotation. |
| 198 | +1. The admission hook modifies the pod, replacing the CPU requests with |
| 199 | + `management.workload.openshift.io/cores` requests and adding the |
| 200 | + `resources.workload.openshift.io/{container-name}: {"cpushares": 400}` |
| 201 | + annotations for CRI-O. |
| 202 | +1. The scheduler sees the new pod and finds available |
| 203 | + `management.workload.openshift.io/cores` resources on the node. The scheduler |
| 204 | + places the pod on the node. |
| 205 | +1. Repeat steps 8-10 until all pods are running. |
| 206 | +1. Cluster deployment comes up with management components constrained to subset |
| 207 | + of available CPUs. |
| 208 | + |
| 209 | +### API Extensions |
| 210 | + |
| 211 | +- We want to extend the `PerformanceProfile` API to include the addition of a |
| 212 | + new `workloads` configuration under the `cpu` field. |
| 213 | +- The behavior of existing resources should not change with this addition. |
| 214 | +- New resources that make use of this new field will have the current machine |
| 215 | + config generated with the additional machine config manifests. |
| 216 | + - Uses the `isolated` to add the CRI-O and Kubelet configuration files to the |
| 217 | + currently generated machine config. |
| 218 | + - If no `isolated` and `enablePinning` is set to true, the default behavior is |
| 219 | + to use the full CPUSet as if workloads were not pinned. |
| 220 | + |
| 221 | +Example change: |
| 222 | + |
| 223 | +```yaml |
| 224 | +apiVersion: performance.openshift.io/v2 |
| 225 | +kind: PerformanceProfile |
| 226 | +metadata: |
| 227 | + name: openshift-node-workload-partitioning-custom |
| 228 | +spec: |
| 229 | + cpu: |
| 230 | + isolated: 2-3 |
| 231 | + reserved: 0,1 |
| 232 | + # New addition |
| 233 | + workloads: |
| 234 | + enablePinning: true |
| 235 | +``` |
| 236 | + |
| 237 | +### Implementation Details/Notes/Constraints [optional] |
| 238 | + |
| 239 | +#### Changes to NTO |
| 240 | + |
| 241 | +The NTO PerformanceProfile will be updated to support a new flag which will |
| 242 | +toggle the workload pinning to the `isolated` cores. The idea here being to |
| 243 | +simplify the approach for how customers set this configuration. With PAO being |
| 244 | +part of NTO now ([see here for more info](../node-tuning/pao-in-nto.md)) this |
| 245 | +affords us the chance to consolidate the configuration for `kubelet` and `crio`. |
| 246 | + |
| 247 | +We will modify the code path that generates the [new machine |
| 248 | +config](https://github.com/openshift/cluster-node-tuning-operator/blob/a780dfe07962ad07e4d50c852047ef8cf7b287da/pkg/performanceprofile/controller/performanceprofile/components/machineconfig/machineconfig.go#L91-L127) |
| 249 | +using the performance profile. With the new `workloads.enablePinning` flag we |
| 250 | +will add the configuration for `crio` and `kubelet` to the final machine config |
| 251 | +manifest. Then the existing code path will apply the change as normal. |
| 252 | + |
| 253 | +#### API Server Admission Hook |
| 254 | + |
| 255 | +The existing admission hook has 4 checks when it comes to workload pinning. |
| 256 | + |
| 257 | +1. Check if `pod` is a static pod |
| 258 | + - Skips modification attempt if it is static. |
| 259 | +1. Checks if currently running cluster topology is Single Node |
| 260 | + - Skips modification if it is anything other than Single Node |
| 261 | +1. Checks if all running nodes are managed |
| 262 | + - Skips modification if any of the nodes are not managed |
| 263 | +1. Checks what resource limits and requests are set on the pod |
| 264 | + - Skips modification if QoS is guaranteed or both limits and requests are set |
| 265 | + - Skips modification if after update the QoS is changed |
| 266 | + |
| 267 | +We will need to alter the code in the admission controller to remove the check |
| 268 | +for Single Node Topology, the other configurations should remain untouched. |
| 269 | + |
| 270 | +### Risks and Mitigations |
| 271 | + |
| 272 | +The sames risks and mitigations highlighted in [Management Workload |
| 273 | +Partitioning](management-workload-partitioning.md) apply to this enhancement as |
| 274 | +well. |
| 275 | + |
| 276 | +We need to make it very clear to customers that this feature is supported as a |
| 277 | +day 0 configuration and day n+1 alterations are not be supported with this |
| 278 | +enhancement. Part of that messaging should involve a clear indication that this |
| 279 | +should be a cluster wide feature. |
| 280 | + |
| 281 | +### Drawbacks |
| 282 | + |
| 283 | +This feature contains the same drawbacks as the [Management Workload |
| 284 | +Partitioning](management-workload-partitioning.md). |
| 285 | + |
| 286 | +Several of the changes described above are patches that we may end up carrying |
| 287 | +downstream indefinitely. Some version of a more general "CPU pool" feature may |
| 288 | +be acceptable upstream, and we could reimplement management workload |
| 289 | +partitioning to use that new implementation. |
| 290 | + |
| 291 | +## Design Details |
| 292 | + |
| 293 | +### Open Questions [optional] |
| 294 | + |
| 295 | +N/A |
| 296 | + |
| 297 | +### Test Plan |
| 298 | + |
| 299 | +We will add a CI job with a cluster configuration that reflects the minimum of |
| 300 | +2CPU/4vCPU masters and 1CPU/2vCPU worker configuration. This job should ensure |
| 301 | +that cluster deployments configured with management workload partitioning pass |
| 302 | +the compliance tests. |
| 303 | + |
| 304 | +We will add a CI job to ensure that all release payload workloads have the |
| 305 | +`target.workload.openshift.io/management` annotation and their namespaces have |
| 306 | +the `workload.openshift.io/allowed` annotation. |
| 307 | + |
| 308 | +### Graduation Criteria |
| 309 | + |
| 310 | +#### Dev Preview -> Tech Preview |
| 311 | + |
| 312 | +- Ability to utilize the enhancement end to end |
| 313 | +- End user documentation, relative API stability |
| 314 | +- Sufficient test coverage |
| 315 | +- Gather feedback from users rather than just developers |
| 316 | +- Enumerate service level indicators (SLIs), expose SLIs as metrics |
| 317 | +- Write symptoms-based alerts for the component(s) |
| 318 | + |
| 319 | +#### Tech Preview -> GA |
| 320 | + |
| 321 | +- More testing (upgrade, downgrade, scale) |
| 322 | +- Sufficient time for feedback |
| 323 | +- Available by default |
| 324 | +- Backhaul SLI telemetry |
| 325 | +- Document SLOs for the component |
| 326 | +- Conduct load testing |
| 327 | +- User facing documentation created in |
| 328 | + [openshift-docs](https://github.com/openshift/openshift-docs/) |
| 329 | + |
| 330 | +#### Removing a deprecated feature |
| 331 | + |
| 332 | +- Announce deprecation and support policy of the existing feature |
| 333 | +- Deprecate the feature |
| 334 | + |
| 335 | +### Upgrade / Downgrade Strategy |
| 336 | + |
| 337 | +This new behavior will be added in 4.12 as part of the installation |
| 338 | +configurations for customers to utilize. |
| 339 | + |
| 340 | +Enabling the feature after installation is not supported in 4.12, so we do not |
| 341 | +need to address what happens if an older cluster upgrades and then the feature |
| 342 | +is turned on. |
| 343 | + |
| 344 | +### Version Skew Strategy |
| 345 | + |
| 346 | +N/A |
| 347 | + |
| 348 | +### Operational Aspects of API Extensions |
| 349 | + |
| 350 | +The addition to the API is an optional field which should not require any |
| 351 | +conversion admission webhook changes. This change will only be used to allow the |
| 352 | +user to explicitly define their intent and simplify the machine manifest by |
| 353 | +generating the extra machine manifests that are currently being created |
| 354 | +independently of the `PerformanceProfile` CRD. |
| 355 | + |
| 356 | +Futhermore the design and scope of this enhancement will mean that the existing |
| 357 | +Admission webhook will continue to apply the same warnings and error messages to |
| 358 | +Pods as described in the [failure modes](#failure-modes). |
| 359 | + |
| 360 | +#### Failure Modes |
| 361 | + |
| 362 | +In a failure situation, we want to try to keep the cluster operational. |
| 363 | +Therefore, there are a few conditions under which the admission hook will strip |
| 364 | +the workload annotations and add an annotation `workload.openshift.io/warning` |
| 365 | +with a message warning the user that their partitioning instructions were |
| 366 | +ignored. These conditions are: |
| 367 | + |
| 368 | +1. When a pod has the Guaranteed QoS class |
| 369 | +1. When mutation would change the QoS class for the pod |
| 370 | +1. When the feature is inactive because not all nodes are reporting the |
| 371 | + management resource |
| 372 | + |
| 373 | +#### Support Procedures |
| 374 | + |
| 375 | +N/A |
| 376 | + |
| 377 | +## Implementation History |
| 378 | + |
| 379 | +WIP |
| 380 | + |
| 381 | +## Alternatives |
| 382 | + |
| 383 | +N/A |
| 384 | + |
| 385 | +## Infrastructure Needed [optional] |
| 386 | + |
| 387 | +N/A |
0 commit comments