Skip to content

Commit 9a20342

Browse files
committed
feat: add wide cluster configuration for workload partitioning
initial commit of proposal Signed-off-by: ehila <[email protected]>
1 parent eb2df01 commit 9a20342

File tree

1 file changed

+387
-0
lines changed

1 file changed

+387
-0
lines changed
Lines changed: 387 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,387 @@
1+
---
2+
title: wide-availability-workload-partitioning
3+
authors:
4+
- "@eggfoobar"
5+
reviewers:
6+
- TBD
7+
approvers:
8+
- TBD
9+
api-approvers:
10+
- TBD
11+
creation-date: 2022-08-03
12+
last-updated: 2022-08-08
13+
tracking-link:
14+
- https://issues.redhat.com/browse/CNF-5562
15+
see-also:
16+
- "/enhancements/workload-partitioning"
17+
- "/enhancements/node-tuning/pao-in-nto.md"
18+
---
19+
20+
# Wide Availability Workload Partitioning
21+
22+
## Summary
23+
24+
This enhancements builds on top of the [Management Workload
25+
Partitioning](management-workload-partitioning.md) and the [move of PAO in
26+
NTO](../node-tuning/pao-in-nto.md) enhancement to provide the ability to do
27+
workload partitioning to our wider cluster configurations. The previous workload
28+
partitioning work is isolated to Single Node cluster configurations, this
29+
enhancement will seek to allow customers to configure workload partitioning on
30+
HA as well as Compact(3NC) clusters.
31+
32+
## Motivation
33+
34+
Customers who want us to reduce the resource consumption of management workloads
35+
have a fixed budget of CPU cores in mind. We want to use normal scheduling
36+
capabilities of kubernetes to manage the number of pods that can be placed onto
37+
those cores, and we want to avoid mixing management and normal workloads there.
38+
Expanding on the already build workload partitioning we should be able to supply
39+
the same functionality to HA and 3NC clusters.
40+
41+
### User Stories
42+
43+
As a cluster creator I want to isolate the management pods of Openshift in
44+
compact(3NC) and HA clusters to specific CPU sets so that I can isolate the
45+
platform workloads from the application workload due to the high performance and
46+
determinism required of my applications.
47+
48+
### Goals
49+
50+
- This enhancement describes an approach for configuring OpenShift clusters to
51+
run with management workloads on a restricted set of CPUs.
52+
- Clusters built in this way should pass the same kubernetes and OpenShift
53+
conformance and functional end-to-end tests as similar deployments that are
54+
not isolating the management workloads.
55+
- We want to run different workload partitioning on masters and workers.
56+
- Customers will be supplied with the advice of 4 hyperthreaded cores for
57+
masters and for workers, 2 hyperthreaded cores.
58+
- We want a general approach, that can be applied to all OpenShift control plane
59+
and per-node components via the PerformenceProfile
60+
61+
### Non-Goals
62+
63+
This enhancement expands on the existing [Management Workload
64+
Partitioning](management-workload-partitioning.md) and as such shares similar
65+
but slightly different non-goals
66+
67+
- This enhancement is focused on CPU resources. Other compressible resource
68+
types may need to be managed in the future, and those are likely to need
69+
different approaches.
70+
- This enhancement does not address mixing node partitioning, this feature will
71+
be enabled cluster wide and encapsulate both master and worker pools. If it's
72+
not desired then the setting will still be turned on but the management
73+
workloads will run on the whole CPU set for that desired node.
74+
- This enhancement does not address non-compressible resource requests, such as
75+
for memory.
76+
- This enhancement does not address ways to disable operators or operands
77+
entirely.
78+
- This enhancement does not address reducing actual utilization, beyond
79+
providing a way to have a predictable upper-bounds. There is no expectation
80+
that a cluster configured to use a small number of cores for management
81+
services would offer exactly the same performance as the default. It must be
82+
stable and continue to operate reliably, but may respond more slowly.
83+
- This enhancement assumes that the configuration of a management CPU pool is
84+
done as part of installing the cluster. It can be changed after the fact but
85+
we will need to stipulate that, that is currently not supported. The intent
86+
here is for this to be supported as a day 0 feature.
87+
- This enhancement describes partitioning concepts that could be expanded to be
88+
used for other purposes. Use cases for partitioning workloads for other
89+
purposes may be addressed by future enhancements.
90+
91+
## Proposal
92+
93+
In order to implement this enhancement we are focused on changing 2 components.
94+
95+
1. Admission controller ([management cpus
96+
override](https://github.com/openshift/kubernetes/blob/a9d6306a701d8fa89a548aa7e132603f6cd89275/openshift-kube-apiserver/admission/autoscaling/managementcpusoverride/doc.go))
97+
in openshift/kubernetes.
98+
1. The
99+
[PerformanceProfile](https://github.com/openshift/cluster-node-tuning-operator/blob/master/docs/performanceprofile/performance_profile.md)
100+
part of [Cluster Node Tuning
101+
Operator](https://github.com/openshift/cluster-node-tuning-operator)
102+
103+
We want to remove the checks in the admission controller which specifically
104+
checks that partitioning is only applied to single node topology configuration.
105+
The design and configuration for any pod modification will remain the same, we
106+
simply will allow you to apply partitioning on non single node topologies.
107+
108+
Workload pinning involves configuring CRI-O and Kubelet. Currently, this is done
109+
through a machine config that contains both of those configurations. This can
110+
pose problems as the CPU set value has to be copied to these other two
111+
configurations. We want to simplify the current implementation and apply both of
112+
these configurations via the `PerformanceProfile` CRD.
113+
114+
We want to add a new `Workloads` field to the `CPU` field that contains the
115+
configuration information for `enablePinning`. We are not sure where we would
116+
want to take workload pinning in the future and to allow flexibility we want to
117+
place the configuration in `cpu.workloads`.
118+
119+
```yaml
120+
apiVersion: performance.openshift.io/v2
121+
kind: PerformanceProfile
122+
metadata:
123+
name: openshift-node-workload-partitioning-custom
124+
spec:
125+
cpu:
126+
isolated: 2-3
127+
reserved: 0,1
128+
# New addition
129+
workloads:
130+
enablePinning: true
131+
```
132+
133+
### Workflow Description
134+
135+
The end user will be expected to provide a `PerformanceProfile` manifest that
136+
describes their desired `isolated` and `reserved` CPUSet and the
137+
`workloads.enablePinning` flag set to true. This manifest will be applied during
138+
the installation process.
139+
140+
**High level sequence diagram:**
141+
142+
```mermaid
143+
sequenceDiagram
144+
Alice->>Installer: Provide PerformanceProfile manifest
145+
Installer-->>NTO: Apply
146+
NTO-->>MCO: Generated Machine Manifests
147+
MCO-->>Node: Configure node
148+
loop Apply
149+
Node->>Node: Set kubelet config
150+
Node->>Node: Set crio config
151+
Node->>Node: Kubelet advertises cores
152+
end
153+
Node-->>MCO: Finished Restart
154+
MCO-->>NTO: Machine Manifests Applied
155+
NTO-->>Installer: PerformanceProfile Applied
156+
Installer-->>Alice: Cluster is Up!
157+
```
158+
159+
- **Alice** is a human user who creates an Openshift cluster.
160+
- **Installer** is assisted installer that applies the user manifest.
161+
- **NTO** is the node tuning operator.
162+
- **MCO** is the machine config operator.
163+
- **Node** is the kubernetes node.
164+
165+
1. Alice sits down and provides the desired performance profile as an extra
166+
manifest to the installer.
167+
1. The installer applies the manifest.
168+
1. The NTO will generate the appropriate machine configs that include the
169+
kubelet config and the crio config to be applied as well as the existing
170+
operation.
171+
1. Once the MCO applies the configs, the node is restarted and the cluster
172+
installation continues to completion.
173+
1. Alice will now have a cluster that has been setup with workload pinning.
174+
175+
#### Variation [optional]
176+
177+
This section outlines an end-to-end workflow for deploying a cluster with
178+
workload partitioning enabled and how pods are correctly scheduled to run on the
179+
management CPU pool.
180+
181+
1. User sits down at their computer.
182+
1. The user creates a `PerformanceProfile` resource with the desired `isolated`
183+
and `reserved` CPUSet with the `cpu.workloads.enablePinning` set to true.
184+
1. The user runs the installer to create the standard manifests, adds their
185+
extra manifests from steps 2, then creates the cluster.
186+
1. NTO will generate the machine config manifests and apply them.
187+
1. The kubelet starts up and finds the configuration file enabling the new
188+
feature.
189+
1. The kubelet advertises `management.workload.openshift.io/cores` extended
190+
resources on the node based on the number of CPUs in the host.
191+
1. The kubelet reads static pod definitions. It replaces the `cpu` requests with
192+
`management.workload.openshift.io/cores` requests of the same value and adds
193+
the `resources.workload.openshift.io/{container-name}: {"cpushares": 400}`
194+
annotations for CRI-O with the same values.
195+
1. Something schedules a regular pod with the
196+
`target.workload.openshift.io/management` annotation in a namespace with the
197+
`workload.openshift.io/allowed: management` annotation.
198+
1. The admission hook modifies the pod, replacing the CPU requests with
199+
`management.workload.openshift.io/cores` requests and adding the
200+
`resources.workload.openshift.io/{container-name}: {"cpushares": 400}`
201+
annotations for CRI-O.
202+
1. The scheduler sees the new pod and finds available
203+
`management.workload.openshift.io/cores` resources on the node. The scheduler
204+
places the pod on the node.
205+
1. Repeat steps 8-10 until all pods are running.
206+
1. Cluster deployment comes up with management components constrained to subset
207+
of available CPUs.
208+
209+
### API Extensions
210+
211+
- We want to extend the `PerformanceProfile` API to include the addition of a
212+
new `workloads` configuration under the `cpu` field.
213+
- The behavior of existing resources should not change with this addition.
214+
- New resources that make use of this new field will have the current machine
215+
config generated with the additional machine config manifests.
216+
- Uses the `isolated` to add the CRI-O and Kubelet configuration files to the
217+
currently generated machine config.
218+
- If no `isolated` and `enablePinning` is set to true, the default behavior is
219+
to use the full CPUSet as if workloads were not pinned.
220+
221+
Example change:
222+
223+
```yaml
224+
apiVersion: performance.openshift.io/v2
225+
kind: PerformanceProfile
226+
metadata:
227+
name: openshift-node-workload-partitioning-custom
228+
spec:
229+
cpu:
230+
isolated: 2-3
231+
reserved: 0,1
232+
# New addition
233+
workloads:
234+
enablePinning: true
235+
```
236+
237+
### Implementation Details/Notes/Constraints [optional]
238+
239+
#### Changes to NTO
240+
241+
The NTO PerformanceProfile will be updated to support a new flag which will
242+
toggle the workload pinning to the `isolated` cores. The idea here being to
243+
simplify the approach for how customers set this configuration. With PAO being
244+
part of NTO now ([see here for more info](../node-tuning/pao-in-nto.md)) this
245+
affords us the chance to consolidate the configuration for `kubelet` and `crio`.
246+
247+
We will modify the code path that generates the [new machine
248+
config](https://github.com/openshift/cluster-node-tuning-operator/blob/a780dfe07962ad07e4d50c852047ef8cf7b287da/pkg/performanceprofile/controller/performanceprofile/components/machineconfig/machineconfig.go#L91-L127)
249+
using the performance profile. With the new `workloads.enablePinning` flag we
250+
will add the configuration for `crio` and `kubelet` to the final machine config
251+
manifest. Then the existing code path will apply the change as normal.
252+
253+
#### API Server Admission Hook
254+
255+
The existing admission hook has 4 checks when it comes to workload pinning.
256+
257+
1. Check if `pod` is a static pod
258+
- Skips modification attempt if it is static.
259+
1. Checks if currently running cluster topology is Single Node
260+
- Skips modification if it is anything other than Single Node
261+
1. Checks if all running nodes are managed
262+
- Skips modification if any of the nodes are not managed
263+
1. Checks what resource limits and requests are set on the pod
264+
- Skips modification if QoS is guaranteed or both limits and requests are set
265+
- Skips modification if after update the QoS is changed
266+
267+
We will need to alter the code in the admission controller to remove the check
268+
for Single Node Topology, the other configurations should remain untouched.
269+
270+
### Risks and Mitigations
271+
272+
The sames risks and mitigations highlighted in [Management Workload
273+
Partitioning](management-workload-partitioning.md) apply to this enhancement as
274+
well.
275+
276+
We need to make it very clear to customers that this feature is supported as a
277+
day 0 configuration and day n+1 alterations are not be supported with this
278+
enhancement. Part of that messaging should involve a clear indication that this
279+
should be a cluster wide feature.
280+
281+
### Drawbacks
282+
283+
This feature contains the same drawbacks as the [Management Workload
284+
Partitioning](management-workload-partitioning.md).
285+
286+
Several of the changes described above are patches that we may end up carrying
287+
downstream indefinitely. Some version of a more general "CPU pool" feature may
288+
be acceptable upstream, and we could reimplement management workload
289+
partitioning to use that new implementation.
290+
291+
## Design Details
292+
293+
### Open Questions [optional]
294+
295+
N/A
296+
297+
### Test Plan
298+
299+
We will add a CI job with a cluster configuration that reflects the minimum of
300+
2CPU/4vCPU masters and 1CPU/2vCPU worker configuration. This job should ensure
301+
that cluster deployments configured with management workload partitioning pass
302+
the compliance tests.
303+
304+
We will add a CI job to ensure that all release payload workloads have the
305+
`target.workload.openshift.io/management` annotation and their namespaces have
306+
the `workload.openshift.io/allowed` annotation.
307+
308+
### Graduation Criteria
309+
310+
#### Dev Preview -> Tech Preview
311+
312+
- Ability to utilize the enhancement end to end
313+
- End user documentation, relative API stability
314+
- Sufficient test coverage
315+
- Gather feedback from users rather than just developers
316+
- Enumerate service level indicators (SLIs), expose SLIs as metrics
317+
- Write symptoms-based alerts for the component(s)
318+
319+
#### Tech Preview -> GA
320+
321+
- More testing (upgrade, downgrade, scale)
322+
- Sufficient time for feedback
323+
- Available by default
324+
- Backhaul SLI telemetry
325+
- Document SLOs for the component
326+
- Conduct load testing
327+
- User facing documentation created in
328+
[openshift-docs](https://github.com/openshift/openshift-docs/)
329+
330+
#### Removing a deprecated feature
331+
332+
- Announce deprecation and support policy of the existing feature
333+
- Deprecate the feature
334+
335+
### Upgrade / Downgrade Strategy
336+
337+
This new behavior will be added in 4.12 as part of the installation
338+
configurations for customers to utilize.
339+
340+
Enabling the feature after installation is not supported in 4.12, so we do not
341+
need to address what happens if an older cluster upgrades and then the feature
342+
is turned on.
343+
344+
### Version Skew Strategy
345+
346+
N/A
347+
348+
### Operational Aspects of API Extensions
349+
350+
The addition to the API is an optional field which should not require any
351+
conversion admission webhook changes. This change will only be used to allow the
352+
user to explicitly define their intent and simplify the machine manifest by
353+
generating the extra machine manifests that are currently being created
354+
independently of the `PerformanceProfile` CRD.
355+
356+
Futhermore the design and scope of this enhancement will mean that the existing
357+
Admission webhook will continue to apply the same warnings and error messages to
358+
Pods as described in the [failure modes](#failure-modes).
359+
360+
#### Failure Modes
361+
362+
In a failure situation, we want to try to keep the cluster operational.
363+
Therefore, there are a few conditions under which the admission hook will strip
364+
the workload annotations and add an annotation `workload.openshift.io/warning`
365+
with a message warning the user that their partitioning instructions were
366+
ignored. These conditions are:
367+
368+
1. When a pod has the Guaranteed QoS class
369+
1. When mutation would change the QoS class for the pod
370+
1. When the feature is inactive because not all nodes are reporting the
371+
management resource
372+
373+
#### Support Procedures
374+
375+
N/A
376+
377+
## Implementation History
378+
379+
WIP
380+
381+
## Alternatives
382+
383+
N/A
384+
385+
## Infrastructure Needed [optional]
386+
387+
N/A

0 commit comments

Comments
 (0)