From 5829d9c543a487841d4d36401e2daa8c44228c04 Mon Sep 17 00:00:00 2001 From: Doug Hellmann Date: Mon, 26 Apr 2021 13:10:08 -0400 Subject: [PATCH 01/16] workload-partitioning: add plan for configuration and enablement ownership Signed-off-by: Doug Hellmann --- .../rationalizing-configuration.md | 212 ++++++++++++++++++ 1 file changed, 212 insertions(+) create mode 100644 enhancements/workload-partitioning/rationalizing-configuration.md diff --git a/enhancements/workload-partitioning/rationalizing-configuration.md b/enhancements/workload-partitioning/rationalizing-configuration.md new file mode 100644 index 0000000000..255254af27 --- /dev/null +++ b/enhancements/workload-partitioning/rationalizing-configuration.md @@ -0,0 +1,212 @@ +--- +title: rationalizing-configuration +authors: + - "@dhellmann" + - "@mrunalp" + - "@browsell" + - "@MarSik" + - "@fromanirh" +reviewers: + - "@deads2k" + - "@sttts" + - maintainers of PAO +approvers: + - "@markmc" + - "@derekwaynecarr" +creation-date: 2021-04-26 +status: implementable +see-also: + - "/enhancements/management-workload-partitioning.md" +--- + +# Rationalizing Configuration of Workload Partitioning + +## Release Signoff Checklist + +- [x] Enhancement is `implementable` +- [ ] Design details are appropriately documented from clear requirements +- [ ] Test plan is defined +- [ ] Operational readiness criteria is defined +- [ ] Graduation criteria for dev preview, tech preview, GA +- [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/) + +## Summary + +The initial iteration of workload partitioning focused on a short path +to a minimum viable implementation. This enhancement describes the +loose ends for preparing the feature for GA at a high level, and +explains the set of other design documents that need to be written +separately during the next iteration. + +## Motivation + +The management workload partitioning feature introduced in OCP 4.8 +enables pod workloads to be pinned to the kubelet reserved cpuset. The +feature is enabled at install time and requires the reserved set of +CPUs as input. That configuration is currently passed in via a +manually created manifest containing a machine config resource. The +set of CPUs used for management workload partitioning has to align +with the reserved set configured by the performance-addon-operator +(PAO) during day 2. For OCP 4.8, the onus is on the user to ensure +these align, there is no interlock. + +The current implementation is prone to user error, because it has two +entities configuring and managing the same data. This design document +describes phase 2 of the implementation plan for workload partitioning +to improve the user experience when configuring the feature. + +### Goals + +1. Describe the remaining work at a high level for architectural + review and planning. +2. Document the division of responsibility between workload management + in components in the OpenShift release payload and PAO. +3. List the other enhancements that need to be written, and their + scopes. + +### Non-Goals + +1. Resolve all of the details for each of the next steps. We're not + going to talk a lot about names or implementation details here. + +## Proposal + +### API + +The initial implementation is limited to single-node deployments. To +support multi-node clusters, we need to solve a couple of concurrency +problems. The first is providing a way for the admission hook +responsible for mutating pod definitions to reliably know when the +feature should be enabled. The admission hook cannot store state on +its own, so we need a new API to indicate when the feature is +enabled. + +We eventually want to extend the partitioning feature to support +non-management workloads to include workload types defined by cluster +admins. For now, however, we want to keep the interface as simple as +possible. Therefore, the new API will be read-only, and only used so +that the admission hook (and other consumers) can tell when workload +partitioning is active. + +The new API will need a controller to manage it. We propose to add a +controller to the existing cluster-config-operator, to avoid adding +the overhead of another image, pod, etc. to the cluster. + +We propose that the API be owned by core OpenShift, and defined in the +`openshift/api` repository. A separate enhancement will be written to +describe the API and controller in more detail. + +### Owner for Enablement + +Workload management needs two node configuration steps which overlap +with work PAO is doing today: + +* Set reserved CPUs in kubelet config and enable the CPU manager + static policy +* Set systemd affinity to match the reserved CPU set + +PAO already has logic for managing CPU pools and configuring kubelet +and CRI-O. Since it already does a lot of what is needed to enable +workload partitioning, we propose to extend it to handle the +additional configuration currently being done manually by users, +including the machine config with settings for kubelet and CRI-O. + +The effect of this decision is that the easiest way to use the +workload partitioning feature will be through the +performance-addon-operator. This means that the implementation of the +feature is split across a few components, and not all of them are +delivered in all clusters, but it avoids duplicating a lot of the work +already done in the PAO. + +Workload partitioning is currently enabled as part of installing a +cluster. This provides predictable behavior from the beginning of the +life-cycle, and avoids extra reboots required to enable it in a +running cluster (an important consideration in environments with short +maintenance windows). To continue to support enabling the feature this +way, and to simplify the configuration, we propose to extend the PAO +with a `render` command, similar to the one in other operators, to +have it generate manifests for the OpenShift installer to consume. + +### Future Work + +1. Write an enhancement to describe the API the admission hook will + use to determine when workload partitioning is enabled. +2. Write an enhancement to describe the changes in + performance-addon-operator to manage the kubelet and CRI-O + configuration when workload partitioning is enabled, including the + `render` command. +3. Write an enhancement to describe an API for enabling workload + partitioning in an existing cluster. + +### User Stories + +N/A + +### Risks and Mitigations + +There is some risk of shipping a feature with part of the +implementation in the OpenShift release payload but the enabling tool +delivered separately. PAO is considered to be a "core" OpenShift +component, even though it is delivered separately. There is not a +separate SKU for it, for example. The PAO team has been working +closely with the node team on the implementation of this feature, so +we do not anticipate any issues delivering the finished work in this +way. + +## Design Details + +### Test Plan + +The other enhancements will provide details of the test plan(s) +needed. + +### Graduation Criteria + +N/A + +#### Dev Preview -> Tech Preview + +- Ability to utilize the enhancement end to end +- End user documentation, relative API stability +- Sufficient test coverage +- Gather feedback from users rather than just developers +- Enumerate service level indicators (SLIs), expose SLIs as metrics +- Write symptoms-based alerts for the component(s) + +#### Tech Preview -> GA + +- More testing (upgrade, downgrade, scale) +- Sufficient time for feedback +- Available by default +- Backhaul SLI telemetry +- Document SLOs for the component +- Conduct load testing + +#### Removing a deprecated feature + +N/A + +### Upgrade / Downgrade Strategy + +N/A + +### Version Skew Strategy + +N/A + +## Implementation History + +- https://issues.redhat.com/browse/OCPPLAN-6065 +- https://issues.redhat.com/browse/CNF-2084 + +## Drawbacks + +None + +## Alternatives + +We could move more of the PAO implementation into the release payload, +so that users who want the workload partitioning feature do not need +another component. This would either duplicate a lot of the existing +PAO work or make future deliver of PAO updates more complicated by +tying them to OpenShift releases. From 6e5c932cffe98dcf30181328e198a5e7501314d5 Mon Sep 17 00:00:00 2001 From: Doug Hellmann Date: Mon, 26 Apr 2021 14:59:16 -0400 Subject: [PATCH 02/16] workload-partitioning/rationalizing-configuration: link to PAO epic Link to the PAO epic for adding the render command Signed-off-by: Doug Hellmann --- .../workload-partitioning/rationalizing-configuration.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/enhancements/workload-partitioning/rationalizing-configuration.md b/enhancements/workload-partitioning/rationalizing-configuration.md index 255254af27..7a513716c5 100644 --- a/enhancements/workload-partitioning/rationalizing-configuration.md +++ b/enhancements/workload-partitioning/rationalizing-configuration.md @@ -134,7 +134,8 @@ have it generate manifests for the OpenShift installer to consume. 2. Write an enhancement to describe the changes in performance-addon-operator to manage the kubelet and CRI-O configuration when workload partitioning is enabled, including the - `render` command. + `render` command. See https://issues.redhat.com/browse/CNF-2164 to + track that work. 3. Write an enhancement to describe an API for enabling workload partitioning in an existing cluster. From f48dee97f9443b7c3303934c1fd6b7b5ff61466b Mon Sep 17 00:00:00 2001 From: Doug Hellmann Date: Tue, 1 Jun 2021 16:33:52 -0400 Subject: [PATCH 03/16] workload-partitioning/rationalizing-configuration: assume external orchestration Change the configuration design to always assume something outside of the cluster is involved in enabling partitioning. Signed-off-by: Doug Hellmann --- .../rationalizing-configuration.md | 289 ++++++++++++++++-- 1 file changed, 267 insertions(+), 22 deletions(-) diff --git a/enhancements/workload-partitioning/rationalizing-configuration.md b/enhancements/workload-partitioning/rationalizing-configuration.md index 7a513716c5..ba8ec306d4 100644 --- a/enhancements/workload-partitioning/rationalizing-configuration.md +++ b/enhancements/workload-partitioning/rationalizing-configuration.md @@ -50,7 +50,7 @@ with the reserved set configured by the performance-addon-operator (PAO) during day 2. For OCP 4.8, the onus is on the user to ensure these align, there is no interlock. -The current implementation is prone to user error, because it has two +The current implementation is prone to user error because it has two entities configuring and managing the same data. This design document describes phase 2 of the implementation plan for workload partitioning to improve the user experience when configuring the feature. @@ -63,11 +63,18 @@ to improve the user experience when configuring the feature. in components in the OpenShift release payload and PAO. 3. List the other enhancements that need to be written, and their scopes. +4. Minimize the complexity of installing a "normal" cluster while + still supporting this very special case. ### Non-Goals 1. Resolve all of the details for each of the next steps. We're not - going to talk a lot about names or implementation details here. + going to talk a lot about names, API fields, or implementation + details here. +2. Describe an API for (re)configuring workload partitioning in a + running cluster. We are going to continue to require the user to + enable partitioning as part of deploying the cluster for another + implementation phase. ## Proposal @@ -84,17 +91,19 @@ enabled. We eventually want to extend the partitioning feature to support non-management workloads to include workload types defined by cluster admins. For now, however, we want to keep the interface as simple as -possible. Therefore, the new API will be read-only, and only used so -that the admission hook (and other consumers) can tell when workload -partitioning is active. +possible. Therefore, the new API will only have status fields and +will only be used so that the admission hook (and other consumers) can +tell when workload partitioning is active. -The new API will need a controller to manage it. We propose to add a -controller to the existing cluster-config-operator, to avoid adding -the overhead of another image, pod, etc. to the cluster. +To manage the risk associated with rolling out a complicated feature +like this, we are going to continue to require it to be enabled as +part of deploying the cluster. The API does not need to support +configuring the workload partitioning in a running cluster, so no +controller is needed, for now. We propose that the API be owned by core OpenShift, and defined in the `openshift/api` repository. A separate enhancement will be written to -describe the API and controller in more detail. +describe the API in more detail. ### Owner for Enablement @@ -114,18 +123,39 @@ including the machine config with settings for kubelet and CRI-O. The effect of this decision is that the easiest way to use the workload partitioning feature will be through the performance-addon-operator. This means that the implementation of the -feature is split across a few components, and not all of them are -delivered in all clusters, but it avoids duplicating a lot of the work -already done in the PAO. +feature is split across a few components, which avoids duplicating a +lot of the work already done in the PAO. Workload partitioning is currently enabled as part of installing a -cluster. This provides predictable behavior from the beginning of the -life-cycle, and avoids extra reboots required to enable it in a -running cluster (an important consideration in environments with short -maintenance windows). To continue to support enabling the feature this -way, and to simplify the configuration, we propose to extend the PAO -with a `render` command, similar to the one in other operators, to -have it generate manifests for the OpenShift installer to consume. +cluster, and that will not change as part of the implementation of +this design. Enabling partitioning early in this way provides +predictable behavior from the beginning of the life-cycle, and avoids +extra reboots required to enable it in a running cluster (an important +consideration in environments with short, fixed maintenance +windows). To continue to support enabling the feature during +deployment, and to simplify the configuration, we propose to extend +the PAO with a `render` command, similar to the one in other +operators, to have it generate manifests for the OpenShift installer +to consume. This avoids any need to change the OpenShift installer to +support this uncommon use case. + +The `render` command has an option to enable the management +partition. When the option is given, the `render` command creates a +manifest for the new enablement API with status data that includes the +`management` workload type to enable partitioning. In the future, it +might enable partitioning for all of the types mentioned in the input +PerformanceProfiles, or based on some other input. + +For each PerformanceProfile, the PAO will render MachineConfig +manifests with higher precedence than the default ones created by the +installer during bootstrapping. These manifests will include all of +the details for configuring `kubelet`, CRI-O, and `tuned` to know +about and partition workloads, including the cpusets for each workload +type. + +The PAO `render` command also generates a manifest with the CRD for +PerformanceProfile so bootstrapping does not block when trying to +install the PerformanceProfile manifests. ### Future Work @@ -136,12 +166,212 @@ have it generate manifests for the OpenShift installer to consume. configuration when workload partitioning is enabled, including the `render` command. See https://issues.redhat.com/browse/CNF-2164 to track that work. -3. Write an enhancement to describe an API for enabling workload - partitioning in an existing cluster. ### User Stories -N/A +#### Enabling and Configuring at the Same Time + +In this workflow, the user provides all of the information needed to +fully partition the workloads from the very beginning. + +1. User runs the installer to `create manifests` to get standard + manifests without any partitioning. +2. The user creates PerformanceProfile manifests for each known + MachineConfigPool and adds them to the installer inputs. +3. The user runs the PAO `render` command to read the + PerformanceProfile manifests and create additional manifests, as + described above. +4. User runs the installer to `create cluster`. +5. Installer bootstraps the cluster. +6. Bootstrapping runs the machine-config-operator (MCO) `render` + command, which generates MachineConfig manifests with low + precedence. +7. Bootstrapping uploads both sets of MachineConfig manifests, one + after the other. +8. The machine-config-operator (MCO) applies MachineConfigs to nodes + in the cluster. +9. Bootstrapping finishes and the cluster is launched. +10. If the MCO does not apply KubeletConfig in step 8, it must do it + here. +11. The cluster has complete partitioning configuration for management + workloads. +12. Kubelet starts and sees the config file enabling partitioning + (delivered in the MachineConfig manifest generated by + PAO). Kubelet advertises the workload resource type on the Node + and mutates static pods with partitioning annotations. +13. The admission plugin uses the workload types on the new enablement + API to decide when to mutate pods. +14. CRI-O sees pods with workload annotations and uses the resource + request to set cpushares and cpuset. +15. On day 2 PAO, is installed and takes ownership of the + PerformanceProfile CRs. + +#### Enabling During Installation, Configuring Later + +In this workflow, the user provides enough information to *enable* +workload partitioning but not enough to actually *configure* all nodes +to partition the workloads into specific CPUs. + +1. User runs the installer to `create manifests` to get standard + manifests without any partitioning. +2. The user runs the PAO `render` command without any + PerformanceProfile manifests to create the additional manifests, as + described above. The manifests for CRI-O do not include cpuset + configuration, because there are no PerformanceProfiles. +3. User runs the installer to `create cluster`. +4. Installer bootstraps the cluster. +5. Bootstrapping runs the machine-config-operator (MCO) `render` + command, which generates MachineConfig manifests with low + precedence. +6. Bootstrapping uploads both sets of MachineConfig manifests, one + after the other. +7. The machine-config-operator (MCO) applies MachineConfigs to nodes + in the cluster. +8. Bootstrapping finishes and the cluster is launched. +9. If the MCO does not apply KubeletConfig in step 7, it must do it + here. +11. The cluster has partial partitioning enabled for `management` + workloads. Pods for management workloads are mutated, but not + actually partitioned. +12. Kubelet starts and sees the config file enabling partitioning. It + advertises the workload resource type on the Node and mutates + static pods with partitioning annotations. +13. Admission plugin uses the workload types on the workloads CR to + decide when to mutate pods. +14. CRI-O sees pods with workload annotations and uses the resource + request to set cpushares but not cpuset. +15. User installs the performance-addon-operator into the cluster. +16. User creates PerformanceProfiles and adds them to the cluster. +17. PAO generates new MachineConfigs, adding the cpuset information + from the PerformanceProfiles to the other partitioning info from + the new enablement API. +18. MCO rolls out new MachineConfigs, rebooting nodes as it goes. +19. The cluster has complete partitioning configuration for management + workloads and the management workloads are partitioned due to the + nodes rebooting. +20. Kubelet starts and sees the config file enabling partitioning. It + advertises the workload resource type on the Node and mutates + static pods with partitioning annotations. +21. Admission plugin uses the workload types on the workloads CR to + decide when to mutate pods. +22. CRI-O sees pods with workload annotations and uses the resource + request to set cpushares and cpuset. + +#### Enabling and Configuring Through the Assisted Installer Workflow + +This section describes how a user will enable and configure workload +partitioning for clusters deployed using the assisted installer +workflow. We expect this workflow to be the most common approach used, +especially for bulk deployments. + +1. The user creates PerformanceProfile manifests for each known + MachineConfigPool name + + * The MachineConfig (MC) for the generic worker MachineConfigPool + (MCP) that includes partitioning without cpusets + * Other MCPs inherit from the worker MCP and include partitioning + with cpusets + +2. *something* runs the PAO `render` command to read the + PerformanceProfile manifests and create additional manifests, as + described above. + + * The user may perform this step, or an orchestration system + managing the assisted installer workflow automatically may run + it. + +3. User generates a set of assisted installer CRs to deploy the + cluster. +4. The assisted installer services start the cluster installation with + the artifacts from steps 1-3 as input (PerformanceProfile CRD, + PerformanceProfiles, MachineConfig, and enablement API). +5. The assisted installer (or hive?) invokes the installer to `create + cluster`. +6. The installer bootstraps the cluster. +7. Bootstrapping runs the machine-config-operator (MCO) `render` + command, which generates MachineConfig manifests with low + precedence. +8. Bootstrapping uploads both sets of MachineConfig manifests, one + after the other. +9. The machine-config-operator (MCO) applies MachineConfigs to nodes + in the cluster. +10. Bootstrapping finishes and the cluster is launched. +11. If the MCO does not apply KubeletConfig in step 9, it must do it + here. +12. The cluster has complete partitioning configuration for management + workloads. +13. Kubelet starts and sees the config file enabling partitioning + (delivered in the MachineConfig manifest generated by + PAO). Kubelet advertises the workload resource type on the Node + and mutates static pods with partitioning annotations. +14. The admission plugin uses the workload types on the new enablement + API to decide when to mutate pods. +15. CRI-O sees pods with workload annotations and uses the resource + request to set cpushares and cpuset. +16. On day 2 PAO, is installed and takes ownership of the + PerformanceProfile CRs. + +#### Day 2: Modify Reserved cpuset for a PerformanceProfile + +In a cluster with partitioning enabled and fully configured, modifying +the reserved cpuset for a PerformanceProfile is safe because the node +is rebooted when the new MachineConfig is applied. + +1. The user updates the cpuset in the PerformanceProfile(s). +2. PAO generates new MachineConfigs, adding the cpuset information + from the PerformanceProfiles to the other partitioning info from + the new enablement API. +3. The MCO rolls out new MachineConfigs, rebooting nodes as it goes. +4. The cluster has complete partitioning configuration for management + workloads and the management workloads are partitioned due to the + nodes rebooting. +5. Kubelet starts and sees the config file enabling partitioning. It + advertises the workload resource type on the Node and mutates + static pods with partitioning annotations. +6. Admission plugin uses the workload types on the workloads CR to + decide when to mutate pods. +7. CRI-O sees pods with workload annotations and uses the resource + request to set cpushares and cpuset. + +#### Day 2: Add a New Node to an Existing MachineConfigPool + +In a cluster with partitioning enabled and fully configured, adding a +new node to an existing MachineConfigPool is safe because the +MachineConfig will set up kubelet and CRI-O on the host correctly. + +1. The MCO applies the MachineConfigs to the node. +2. Kubelet starts and sees the config file enabling partitioning. It + advertises the workload resource type on the Node and mutates + static pods with partitioning annotations. +3. CRI-O sees pods with workload annotations and uses the resource + request to set cpushares and cpuset. + +#### Day 2: Add a New MachineConfigPool + +Adding a new MachineConfigPool is safe only if the MachineConfigPool +has a PerformanceProfile associated and the PAO is installed. + +1. PAO reads PerformanceProfile to generate new MachineConfig for the + pool with CRI-O and `kubelet` settings. +2. MachineConfigs are applied to appropriate nodes, rebooting them in + the process. +3. Kubelet starts and sees the config file enabling partitioning. It + advertises the workload resource type on the Node and mutates + static pods with partitioning annotations. +4. CRI-O sees pods with workload annotations and uses the resource + request to set cpushares and cpuset. + +If the MachineConfigPool does not match a PerformanceProfile, there +will be no cpuset information and the PAO will generate MachineConfigs +with partitioning enabled but not tied to a cpuset. This is somewhat +safe, but may lead to unexpected behavior. + +If the PAO is not installed, it will not generate the overriding +MachineConfigs for the new MachineConfigPool. The nodes in the pool +will not enable partitioning at all. This may not be safe, since +kubelet will not advertise the workload resource type but the +admission plugin will mutate pods to require it. The scheduler may +refuse to place management workloads on the nodes in the pool. ### Risks and Mitigations @@ -154,8 +384,23 @@ closely with the node team on the implementation of this feature, so we do not anticipate any issues delivering the finished work in this way. +If the PAO is not installed into a cluster with workload partitioning +enabled and configured, adding a new MachineConfigPool can be +unsafe. See the "Add a New MachineConfigPool" use case for details. + ## Design Details +### Open Questions + +1. What runs the PAO `render` command in the assisted installer + workflow? +2. How hard do we need to work to ensure that the PAO version matches + the release payload version for the cluster being deployed? + What/who is responsible for that? +3. We need to ensure MCO render handles the KubeletConfig CR generated + by PAO so kubelet has partitioning enabled without requiring an + extra reboot. + ### Test Plan The other enhancements will provide details of the test plan(s) From 90aae82ca896f264625c815eaac7b89faa7a4be0 Mon Sep 17 00:00:00 2001 From: Doug Hellmann Date: Tue, 1 Jun 2021 16:45:17 -0400 Subject: [PATCH 04/16] workload-partitioning/rationalizing-configuration: fix verb tense Signed-off-by: Doug Hellmann --- .../rationalizing-configuration.md | 14 ++++++++------ 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/enhancements/workload-partitioning/rationalizing-configuration.md b/enhancements/workload-partitioning/rationalizing-configuration.md index ba8ec306d4..dbd062d397 100644 --- a/enhancements/workload-partitioning/rationalizing-configuration.md +++ b/enhancements/workload-partitioning/rationalizing-configuration.md @@ -139,12 +139,14 @@ operators, to have it generate manifests for the OpenShift installer to consume. This avoids any need to change the OpenShift installer to support this uncommon use case. -The `render` command has an option to enable the management -partition. When the option is given, the `render` command creates a -manifest for the new enablement API with status data that includes the -`management` workload type to enable partitioning. In the future, it -might enable partitioning for all of the types mentioned in the input -PerformanceProfiles, or based on some other input. +The `render` command will have an option to enable the management +partition. When the option is given, the `render` command will create +a manifest for the new enablement API with status data that includes +the `management` workload type to enable partitioning (the details of +that API will be defined in a separate design document). In the +future, the PAO might enable partitioning for all of the types +mentioned in the input PerformanceProfiles, or based on some other +input. For each PerformanceProfile, the PAO will render MachineConfig manifests with higher precedence than the default ones created by the From 187e37932f7f9356acc9a8494b1b70020c6d94aa Mon Sep 17 00:00:00 2001 From: Doug Hellmann Date: Wed, 2 Jun 2021 11:40:32 -0400 Subject: [PATCH 05/16] workload-partitioning/rationalizing-configuration: update motivation Expand the motivation section to mention that we also want to support multi-node clusters. Signed-off-by: Doug Hellmann --- .../rationalizing-configuration.md | 21 ++++++++++++------- 1 file changed, 14 insertions(+), 7 deletions(-) diff --git a/enhancements/workload-partitioning/rationalizing-configuration.md b/enhancements/workload-partitioning/rationalizing-configuration.md index dbd062d397..3064c5f483 100644 --- a/enhancements/workload-partitioning/rationalizing-configuration.md +++ b/enhancements/workload-partitioning/rationalizing-configuration.md @@ -48,17 +48,24 @@ manually created manifest containing a machine config resource. The set of CPUs used for management workload partitioning has to align with the reserved set configured by the performance-addon-operator (PAO) during day 2. For OCP 4.8, the onus is on the user to ensure -these align, there is no interlock. +these align, there is no interlock. The current implementation is +prone to user error because it has two entities configuring and +managing the same data. -The current implementation is prone to user error because it has two -entities configuring and managing the same data. This design document -describes phase 2 of the implementation plan for workload partitioning -to improve the user experience when configuring the feature. +The 4.8 implementation is also limited to single-node deployments. We +anticipate users of highly-available clusters also wanting to minimize +the overhead of the management components to take full advantage of +compute capacity for their own workloads. + +This design document describes phase 2 of the implementation plan for +workload partitioning to improve the user experience when configuring +the feature and to extend support to multi-node clusters. ### Goals -1. Describe the remaining work at a high level for architectural - review and planning. +1. Describe, at a high level, the remaining work to improve the user + experience and expand support to multi-node deployments for + architectural review and planning. 2. Document the division of responsibility between workload management in components in the OpenShift release payload and PAO. 3. List the other enhancements that need to be written, and their From 5c69b28a4e9f3b779a11bac9b00f06579a087ee3 Mon Sep 17 00:00:00 2001 From: Doug Hellmann Date: Wed, 2 Jun 2021 11:43:49 -0400 Subject: [PATCH 06/16] workload-partitioning/rationalizing-configuration: expand risk description Better explain the effect of adding a new MachineConfigPool without a PerformanceProfile. Signed-off-by: Doug Hellmann --- .../workload-partitioning/rationalizing-configuration.md | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/enhancements/workload-partitioning/rationalizing-configuration.md b/enhancements/workload-partitioning/rationalizing-configuration.md index 3064c5f483..cdfc0dc926 100644 --- a/enhancements/workload-partitioning/rationalizing-configuration.md +++ b/enhancements/workload-partitioning/rationalizing-configuration.md @@ -373,7 +373,10 @@ has a PerformanceProfile associated and the PAO is installed. If the MachineConfigPool does not match a PerformanceProfile, there will be no cpuset information and the PAO will generate MachineConfigs with partitioning enabled but not tied to a cpuset. This is somewhat -safe, but may lead to unexpected behavior. +safe, but may lead to unexpected behavior. The mutated pods would +float across the full cpuset in the same way as if the feature +enabled, however the cpushares would not be deducted from available +CPUs, potentially leading to over commit scenarios. If the PAO is not installed, it will not generate the overriding MachineConfigs for the new MachineConfigPool. The nodes in the pool From 4ace6249c15b38c2915cc4a5743291c827b355f9 Mon Sep 17 00:00:00 2001 From: Doug Hellmann Date: Wed, 2 Jun 2021 11:46:07 -0400 Subject: [PATCH 07/16] workload-partitioning/rationalizing-configuration: add MCO team as reviewers Signed-off-by: Doug Hellmann --- .../workload-partitioning/rationalizing-configuration.md | 1 + 1 file changed, 1 insertion(+) diff --git a/enhancements/workload-partitioning/rationalizing-configuration.md b/enhancements/workload-partitioning/rationalizing-configuration.md index cdfc0dc926..1bc5518159 100644 --- a/enhancements/workload-partitioning/rationalizing-configuration.md +++ b/enhancements/workload-partitioning/rationalizing-configuration.md @@ -10,6 +10,7 @@ reviewers: - "@deads2k" - "@sttts" - maintainers of PAO + - maintainers of MCO approvers: - "@markmc" - "@derekwaynecarr" From 8797677569ea46630aa091a182d38b74140005b0 Mon Sep 17 00:00:00 2001 From: Doug Hellmann Date: Fri, 4 Jun 2021 10:11:23 -0400 Subject: [PATCH 08/16] workload-partitioning/rationalizing-configuration: clarify risks Signed-off-by: Doug Hellmann --- .../rationalizing-configuration.md | 34 ++++++++++++------- 1 file changed, 21 insertions(+), 13 deletions(-) diff --git a/enhancements/workload-partitioning/rationalizing-configuration.md b/enhancements/workload-partitioning/rationalizing-configuration.md index 1bc5518159..e927787512 100644 --- a/enhancements/workload-partitioning/rationalizing-configuration.md +++ b/enhancements/workload-partitioning/rationalizing-configuration.md @@ -249,7 +249,9 @@ to partition the workloads into specific CPUs. 13. Admission plugin uses the workload types on the workloads CR to decide when to mutate pods. 14. CRI-O sees pods with workload annotations and uses the resource - request to set cpushares but not cpuset. + request to set cpushares but not cpuset. See the [Risks + section](#risks-and-mitigations) for details of the effects this + may have on the cluster. 15. User installs the performance-addon-operator into the cluster. 16. User creates PerformanceProfiles and adds them to the cluster. 17. PAO generates new MachineConfigs, adding the cpuset information @@ -373,18 +375,12 @@ has a PerformanceProfile associated and the PAO is installed. If the MachineConfigPool does not match a PerformanceProfile, there will be no cpuset information and the PAO will generate MachineConfigs -with partitioning enabled but not tied to a cpuset. This is somewhat -safe, but may lead to unexpected behavior. The mutated pods would -float across the full cpuset in the same way as if the feature -enabled, however the cpushares would not be deducted from available -CPUs, potentially leading to over commit scenarios. +with partitioning enabled but not tied to a cpuset. See the [Risks +section](#risks-and-mitigations) for details. If the PAO is not installed, it will not generate the overriding -MachineConfigs for the new MachineConfigPool. The nodes in the pool -will not enable partitioning at all. This may not be safe, since -kubelet will not advertise the workload resource type but the -admission plugin will mutate pods to require it. The scheduler may -refuse to place management workloads on the nodes in the pool. +MachineConfigs for the new MachineConfigPool. See the [Risks +section](#risks-and-mitigations) for details. ### Risks and Mitigations @@ -397,9 +393,21 @@ closely with the node team on the implementation of this feature, so we do not anticipate any issues delivering the finished work in this way. +If a MachineConfigPool exists without a matching PerformanceProfile, +there will be no cpuset information and the PAO will generate +MachineConfigs with partitioning enabled but not tied to a cpuset. +This is somewhat safe, but may lead to unexpected behavior. The +mutated pods would float across the full cpuset in the same way as if +partitioning was not enabled and the cpushares would not be deducted +from available CPUs, potentially leading to over commit scenarios. + If the PAO is not installed into a cluster with workload partitioning -enabled and configured, adding a new MachineConfigPool can be -unsafe. See the "Add a New MachineConfigPool" use case for details. +enabled and configured, adding a new MachineConfigPool can be unsafe. +The nodes in the pool will not enable partitioning at all. This may +not be safe, since kubelet will not advertise the workload resource +type but the admission plugin will mutate pods to require it. The +scheduler may refuse to place management workloads on the nodes in the +pool. ## Design Details From 45c76172d972dfc8d1c596f4c176af95e0eab289 Mon Sep 17 00:00:00 2001 From: Doug Hellmann Date: Tue, 15 Jun 2021 13:22:30 -0400 Subject: [PATCH 09/16] workload-partitioning/rationalizing-configuration: remove markmc as approver Mark asked to be removed as an approver for this enhancement. Signed-off-by: Doug Hellmann --- .../workload-partitioning/rationalizing-configuration.md | 1 - 1 file changed, 1 deletion(-) diff --git a/enhancements/workload-partitioning/rationalizing-configuration.md b/enhancements/workload-partitioning/rationalizing-configuration.md index e927787512..c3b88a13af 100644 --- a/enhancements/workload-partitioning/rationalizing-configuration.md +++ b/enhancements/workload-partitioning/rationalizing-configuration.md @@ -12,7 +12,6 @@ reviewers: - maintainers of PAO - maintainers of MCO approvers: - - "@markmc" - "@derekwaynecarr" creation-date: 2021-04-26 status: implementable From 6d3a60a6f539e4bb14a68b102a70824fd024b543 Mon Sep 17 00:00:00 2001 From: Doug Hellmann Date: Mon, 21 Jun 2021 15:41:37 -0400 Subject: [PATCH 10/16] workload-partitioning/rationalizing-configuration: non-goal for external control plane Signed-off-by: Doug Hellmann --- .../workload-partitioning/rationalizing-configuration.md | 1 + 1 file changed, 1 insertion(+) diff --git a/enhancements/workload-partitioning/rationalizing-configuration.md b/enhancements/workload-partitioning/rationalizing-configuration.md index c3b88a13af..1659370bf6 100644 --- a/enhancements/workload-partitioning/rationalizing-configuration.md +++ b/enhancements/workload-partitioning/rationalizing-configuration.md @@ -82,6 +82,7 @@ the feature and to extend support to multi-node clusters. running cluster. We are going to continue to require the user to enable partitioning as part of deploying the cluster for another implementation phase. +3. Support for external control plane topologies. ## Proposal From 51024f8693989e09a77ffd337d0544305d7c9f62 Mon Sep 17 00:00:00 2001 From: Doug Hellmann Date: Mon, 21 Jun 2021 15:44:47 -0400 Subject: [PATCH 11/16] workload-partitioning/rationalizing-configuration: precedent note Add a note based on Derek's comment that installing an OLM-based operator via an extra manifest is how third-party networking operators work. Signed-off-by: Doug Hellmann --- .../workload-partitioning/rationalizing-configuration.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/enhancements/workload-partitioning/rationalizing-configuration.md b/enhancements/workload-partitioning/rationalizing-configuration.md index 1659370bf6..06f1f20a64 100644 --- a/enhancements/workload-partitioning/rationalizing-configuration.md +++ b/enhancements/workload-partitioning/rationalizing-configuration.md @@ -145,7 +145,8 @@ deployment, and to simplify the configuration, we propose to extend the PAO with a `render` command, similar to the one in other operators, to have it generate manifests for the OpenShift installer to consume. This avoids any need to change the OpenShift installer to -support this uncommon use case. +support this uncommon use case, and is consistent with the way +third-party networking solutions are installed. The `render` command will have an option to enable the management partition. When the option is given, the `render` command will create From f43d95e33f9934c33f274dee715a907c6da7e1ed Mon Sep 17 00:00:00 2001 From: Doug Hellmann Date: Mon, 21 Jun 2021 15:48:06 -0400 Subject: [PATCH 12/16] workload-partitioning/rationalizing-configuration: clarify wording Clarify the wording about how the installer uses the PAO manifests. Signed-off-by: Doug Hellmann --- .../workload-partitioning/rationalizing-configuration.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/enhancements/workload-partitioning/rationalizing-configuration.md b/enhancements/workload-partitioning/rationalizing-configuration.md index 06f1f20a64..dfaa55c666 100644 --- a/enhancements/workload-partitioning/rationalizing-configuration.md +++ b/enhancements/workload-partitioning/rationalizing-configuration.md @@ -144,7 +144,7 @@ windows). To continue to support enabling the feature during deployment, and to simplify the configuration, we propose to extend the PAO with a `render` command, similar to the one in other operators, to have it generate manifests for the OpenShift installer -to consume. This avoids any need to change the OpenShift installer to +as extra manifests. This avoids any need to change the OpenShift installer to support this uncommon use case, and is consistent with the way third-party networking solutions are installed. From 2e5eb765d304e40b5f38157c4859d3a50ae67fba Mon Sep 17 00:00:00 2001 From: Doug Hellmann Date: Mon, 21 Jun 2021 15:49:36 -0400 Subject: [PATCH 13/16] workload-partitioning/rationalizing-configuration: clarify PAO wording Make it clear that the PAO is already generating some of the manifests being described here. Signed-off-by: Doug Hellmann --- .../workload-partitioning/rationalizing-configuration.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/enhancements/workload-partitioning/rationalizing-configuration.md b/enhancements/workload-partitioning/rationalizing-configuration.md index dfaa55c666..7361c437a5 100644 --- a/enhancements/workload-partitioning/rationalizing-configuration.md +++ b/enhancements/workload-partitioning/rationalizing-configuration.md @@ -159,10 +159,10 @@ input. For each PerformanceProfile, the PAO will render MachineConfig manifests with higher precedence than the default ones created by the -installer during bootstrapping. These manifests will include all of -the details for configuring `kubelet`, CRI-O, and `tuned` to know -about and partition workloads, including the cpusets for each workload -type. +installer during bootstrapping, in the same way it does when running +in the cluster. These manifests will include all of the details for +configuring `kubelet`, CRI-O, and `tuned` to know about and partition +workloads, including the cpusets for each workload type. The PAO `render` command also generates a manifest with the CRD for PerformanceProfile so bootstrapping does not block when trying to From e18c863161666c02b54e7140fff3e01c9ff118c8 Mon Sep 17 00:00:00 2001 From: Doug Hellmann Date: Mon, 21 Jun 2021 15:56:17 -0400 Subject: [PATCH 14/16] workload-partitioning/rationalizing-configuration: clarify MCO step Clarify that a step performed by the MCO happens inside the cluster as a normal part of reconciliation. Signed-off-by: Doug Hellmann --- .../workload-partitioning/rationalizing-configuration.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/enhancements/workload-partitioning/rationalizing-configuration.md b/enhancements/workload-partitioning/rationalizing-configuration.md index 7361c437a5..8b175d4eff 100644 --- a/enhancements/workload-partitioning/rationalizing-configuration.md +++ b/enhancements/workload-partitioning/rationalizing-configuration.md @@ -240,7 +240,8 @@ to partition the workloads into specific CPUs. in the cluster. 8. Bootstrapping finishes and the cluster is launched. 9. If the MCO does not apply KubeletConfig in step 7, it must do it - here. + here after the MCO is running in the cluster and can reconcile + normally. 11. The cluster has partial partitioning enabled for `management` workloads. Pods for management workloads are mutated, but not actually partitioned. From 160aa5d53d648f4d043c475a44a94a17db3d1ae8 Mon Sep 17 00:00:00 2001 From: Doug Hellmann Date: Tue, 22 Jun 2021 14:10:36 -0400 Subject: [PATCH 15/16] workload-partitioning/rationalizing-configuration: more detail about read-only API Update the description of the API to include spec and status fields, and add a note about using an admission plugin to prevent changes or deletions. Signed-off-by: Doug Hellmann --- .../rationalizing-configuration.md | 15 ++++++++++++--- 1 file changed, 12 insertions(+), 3 deletions(-) diff --git a/enhancements/workload-partitioning/rationalizing-configuration.md b/enhancements/workload-partitioning/rationalizing-configuration.md index 8b175d4eff..f3ea04bf99 100644 --- a/enhancements/workload-partitioning/rationalizing-configuration.md +++ b/enhancements/workload-partitioning/rationalizing-configuration.md @@ -99,9 +99,10 @@ enabled. We eventually want to extend the partitioning feature to support non-management workloads to include workload types defined by cluster admins. For now, however, we want to keep the interface as simple as -possible. Therefore, the new API will only have status fields and -will only be used so that the admission hook (and other consumers) can -tell when workload partitioning is active. +possible. Therefore, even though the new API will have both spec and +status fields, only the status fields will be used so that the +admission hook (and other consumers) can tell when workload +partitioning is active. To manage the risk associated with rolling out a complicated feature like this, we are going to continue to require it to be enabled as @@ -109,6 +110,14 @@ part of deploying the cluster. The API does not need to support configuring the workload partitioning in a running cluster, so no controller is needed, for now. +Because of the extensive hardware-specific planning needed to use +partitioning successfully, there is no clear business case for ever +allowing the feature to be enabled in a running cluster. We will +therefore also include an admission hook to prevent anyone from +changing or deleting the API resource after it is created, following +the pattern established in +. + We propose that the API be owned by core OpenShift, and defined in the `openshift/api` repository. A separate enhancement will be written to describe the API in more detail. From fe703ad7924eccf5eb27fcc50bc2ba44235a7727 Mon Sep 17 00:00:00 2001 From: Doug Hellmann Date: Tue, 22 Jun 2021 17:28:29 -0400 Subject: [PATCH 16/16] workload-partitioning/rationalizing-configuration: add risk about multiple MCPs Signed-off-by: Doug Hellmann --- .../workload-partitioning/rationalizing-configuration.md | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/enhancements/workload-partitioning/rationalizing-configuration.md b/enhancements/workload-partitioning/rationalizing-configuration.md index f3ea04bf99..a9fb2fd32d 100644 --- a/enhancements/workload-partitioning/rationalizing-configuration.md +++ b/enhancements/workload-partitioning/rationalizing-configuration.md @@ -420,6 +420,15 @@ type but the admission plugin will mutate pods to require it. The scheduler may refuse to place management workloads on the nodes in the pool. +When a custom MachineConfigPool is used it is possible to introduce +conflicts in content ownership between the custom pool and the base +worker pool. This affects workload partitioning because it relies on +kubelet and CRI-O configuration files that are already potential +sources of contention from different inputs. To mitigate this risk, we +recommend not placing any kubelet or CRI-O configuration in the base +worker settings and instead creating a separate pool for each +configuration variant. + ## Design Details ### Open Questions