Skip to content

Commit 2af83e6

Browse files
Merge pull request kubernetes#48 from smarterclayton/controlplane
Compact Clusters enhancement
2 parents 84dfcdd + aba2d0f commit 2af83e6

File tree

1 file changed

+267
-0
lines changed

1 file changed

+267
-0
lines changed

enhancements/compact-clusters.md

+267
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,267 @@
1+
---
2+
title: Compact Clusters
3+
authors:
4+
- "@smarterclayton"
5+
reviewers:
6+
- "@derekwaynecarr"
7+
- "@ironcladlou"
8+
- "@crawford"
9+
- "@hexfusion"
10+
approvers:
11+
- "@derekwaynecarr"
12+
creation-date: "2019-09-26"
13+
last-updated: "2019-09-26"
14+
status: implementable
15+
see-also:
16+
replaces:
17+
superseded-by:
18+
---
19+
20+
# Compact Clusters
21+
22+
## Release Signoff Checklist
23+
24+
- [x] Enhancement is `implementable`
25+
- [x] Design details are appropriately documented from clear requirements
26+
- [x] Test plan is defined
27+
- [x] Graduation criteria for dev preview, tech preview, GA
28+
- [ ] User-facing documentation is created in [openshift/docs]
29+
30+
## Summary
31+
32+
OpenShift is intended to run in a wide variety of environments. Over the years
33+
we have refined and reduced the default required footprint - first by removing
34+
the hard requirement for infrastructure nodes in 4.1, and second by preparing
35+
for 3-node clusters in 4.2 with the introduction of the masterSchedulable flag.
36+
OpenShift should continue to take steps to reduce the default footprint and
37+
incentivize improvements that allow it to work with smaller 3-node and eventually
38+
one node clusters.
39+
40+
## Motivation
41+
42+
This proposal covers the majority of near term improvements to OpenShift to
43+
allow it to fit within smaller footprints.
44+
45+
Our near term goal is to continue to drive down the control plane and node
46+
footprint and make three node clusters a viable production deployment strategy
47+
on both cloud and metal, as well as exert engineering pressure on reducing our default
48+
resource usage in terms of CPU and memory to fit on smaller machines.
49+
50+
At the same time, it should be easy on cloud providers to deploy 3-node clusters,
51+
and since we prefer to test as our users will run, our CI environments should test
52+
that way as well.
53+
54+
### Goals
55+
56+
* Ensure 3-node clusters work correctly in cloud and metal enviroments in UPI and IPI
57+
* Enforce guard rails via testing that prevent regressions in 3-node environments
58+
* Describe the prerequisites to support single-node clusters in a future release
59+
* Clarify the supported behavior for single-node today
60+
* Identify future improvements that may be required for more resource constrained environments
61+
* Incentivize end users to leverage our dynamic compute and autoscaling capabilities in cloud environments
62+
63+
### Non-Goals
64+
65+
* Supporting any topology other than three masters for production clusters at this time
66+
67+
## Proposal
68+
69+
### User Stories
70+
71+
#### Story 1
72+
73+
As a user of OpenShift, I should be able to install a 3-node cluster (no workers)
74+
on cloud environments via the IPI or UPI installation paths and have a fully conformant
75+
OpenShift cluster (all features enabled).
76+
77+
#### Story 2
78+
79+
As a developer of OpenShift, I should be able to run smaller clusters that are more
80+
resource efficient for iterative testing while still being conformant.
81+
82+
#### Story 3
83+
84+
As a developer of OpenShift, I should be exposed to bugs and failures caused by
85+
assumptions about cluster size or the topology of cluster early, so that my feature
86+
additions do not break small cluster topologies.
87+
88+
#### Story 4
89+
90+
As an admin of an OpenShift cluster deployed onto 3-nodes, I should be able to easily
91+
transition to a larger cluster and make the control plane unschedulable.
92+
93+
### Risks and Mitigations
94+
95+
This proposal is low risk because it depends only on already approved upstream
96+
functionality and is already supportable on bare-metal.
97+
98+
## Design Details
99+
100+
### Enabling three-node clusters in the cloud
101+
102+
OpenShift 4.3 should pass a full e2e run on a 3-node cluster in the cloud.
103+
104+
The primary limitation blocking this in 4.2 and earlier was a limitation in
105+
Kubernetes that prevented service load balancers from targeting the master nodes
106+
when the node-role for masters was set. This was identified as a bug and in
107+
Kube 1.16 we [clarified that this should be fixed by moving the functionality to
108+
explicit labels](https://github.com/kubernetes/enhancements/issues/1143) and
109+
[added two new alpha labels](https://github.com/kubernetes/kubernetes/pull/80238)
110+
that made the old behavior still achievable. The new alpha feature gate `LegacyNodeRoleBehavior`
111+
disables the old check, which would allow service load balancers to target masters.
112+
The risk of regression is low because the code path is simple, and all usage of
113+
the gate would be internal. We would enable this gate by default in 4.3.
114+
115+
The IPI and UPI install and documentation must ensure that service load balancers
116+
can correctly target master machines if a 3-node cluster is created by ensuring
117+
security groups are correctly targeted. It may be desirable to switch the default
118+
ingress port policy to be NodePort to remove the need to listen on the host in
119+
these environments in 4.3 across all appropriate clouds, but that may not be
120+
strictly required.
121+
122+
In order to ensure this is successful and testable by default, some of the default
123+
e2e-aws runs should be switched to a 3-node cluster configuration, no worker pool
124+
defined. Tests that fail should be corrected. Resource limits that are insufficient
125+
to handle the workload should be bumped. Any hardcoded logic that breaks in this
126+
configuration should be fixed.
127+
128+
At the end of this work, we should be able to declare a 3-node 4.3 cluster in the
129+
cloud as fully supported.
130+
131+
### Support for one-node control planes
132+
133+
OpenShift 3.x supported single node "all-in-one" clusters via both `oc cluster up`,
134+
`openshift-ansible`, and other more opinionated configurations like pre-baked VMs.
135+
OpenShift 4, with its different control plane strategy, intentionally diverges from
136+
the static preconfiguration of 3.x but is geared towards highly available control
137+
planes and upgrades. To tolerate single control-plane node upgrades we would need
138+
to fix a number of assumptions in core operators. In addition, we wish to introduce
139+
the cluster-etcd-operator to allow for master replacement, and the capabilities
140+
that operator will introduce
141+
142+
For the 4.3 timeframe the following statements are true for single-node clusters:
143+
144+
* The installer allows creation of single-master control planes
145+
* Some components may not completely roll out because they require separate nodes
146+
* Users may be required to disable certain operators and override their replicas count
147+
* Upgrades may hang or pause because operators assume at least 2 control plane members are available at all times
148+
149+
In general it will remain possible with some work to get a single-node cluster in
150+
4.3 but it is not supported for upgrade or production use and will continue to have
151+
a list of known caveats.
152+
153+
The current rollout plan for replacement of masters is via the cluster-etcd-operator
154+
which will dynamically manage the quorum of a cluster from single instance during
155+
bootstrap to a full HA cluster, and updating the quorum as new masters are created.
156+
Because we expect the operator to work in a single etcd member configuration during
157+
bootstrap and then grow to three nodes, we must ensure single member continues to
158+
work. That opens the door for future versions of OpenShift to run a larger and
159+
more complete control plane on the bootstrap node, and we anticipate leveraging that
160+
flow to enable support of single-node control planes. We would prefer not to provide
161+
workarounds for single-node clusters in advance of that functionality.
162+
163+
A future release would gradually reduce restrictions on single-node clusters to
164+
ensure all operators function correctly and to allow upgrades, possibly including
165+
but not limited to the following work:
166+
167+
* A pattern for core operators that have components that require >1 instance to:
168+
* Only require one instance
169+
* Colocate both instances on the single machine
170+
* Survive machine reboot
171+
* Allowing components like ingress that depend on HostNetwork to move to NodePorts
172+
* Ensuring that when outages are required (restarting apiserver) that the duration is minimal and reliably comes back up
173+
* Ensuring any components that check membership can successfully upgrade single instances (1 instance machine config pool can upgrade)
174+
175+
In the short term, we recommend users who want single machines running containers
176+
to use RHCoS machines with podman, static pods, and system units, and Ignition to
177+
configure those machines. You can also create machine pools and have these remote
178+
machines launch static pods and system units without running workload pods by
179+
excluding normal infrastructure containers.
180+
181+
182+
### Reducing resource requirements of the core cluster
183+
184+
The OpenShift control plane cpu, memory, and disk requirements are largely dominated
185+
by etcd (io and memory), kube-apiserver (cpu and memory), and prometheus (cpu and memory).
186+
Per node components scale with cluster size `O(node)` in terms of cpu and memory on
187+
the apiservers, and components like the scheduler, controller manager, and other
188+
core controllers scale with `O(workload)` on themselves and the apiservers. Almost all
189+
controller/operator components have some fixed overhead in CPU and memory, and requests
190+
made to the apiserver.
191+
192+
In small clusters the fixed overhead dominates and is the primary optimization target.
193+
These overheads include:
194+
195+
* memory use by etcd and apiserver objects (caches) for all default objects
196+
* memory use by controllers (caches) for all default objects
197+
* fixed watches for controller config `O(config_types * controllers)`
198+
* kubelet and cri-o maintenance CPU and memory costs monitoring control plane pods
199+
* cluster monitoring CPU and memory scraping the control plane components
200+
* components that are made highly available but which could be rescheduled if one machine goes down, such that true HA may not be necessary
201+
202+
Given the choice between dropping function that is valuable but expensive, OR optimizing that function to be more efficient by default, **we should prefer to optimize that valuable function first before making it optional.**
203+
204+
The short term wins are:
205+
206+
* Continue to optimize components to reduce writes per second, memory in use, and excess caching
207+
* Investigate default usage of top components - kube-apiserver and prometheus - and fix issues
208+
* In small clusters, consider automatically switching to single instance for prometheus AND ensure prometheus is rescheduled if machine fails
209+
210+
The following optimizations are unlikely to dramatically improve resource usage or platform health and are non-goals:
211+
212+
* Disabling many semi-optional components (like ingress or image-registry) that have low resource usage
213+
* Running fewer instances of core Kube control plane components and adding complexity to recovery or debuggability
214+
215+
### Test Plan
216+
217+
We should test in cloud environments with 3-node clusters with machines that represent
218+
our minimum target and ensure our e2e tests (which stress the platform control plane)
219+
pass reliably.
220+
221+
We should ensure that in 3-node clusters disruptive events (losing a machine, recovering
222+
etcd quorum) do not compromise function of the product.
223+
224+
We should add tests that catch regressions in resource usage.
225+
226+
### Graduation Criteria
227+
228+
This is an evolving plan, it is considered a core requirement for the product to remain
229+
at a shippable GA quality.
230+
231+
232+
### Upgrade / Downgrade Strategy
233+
234+
In the future, this section will be updated to describe how single master control planes
235+
might upgrade. At the current time this is not supported.
236+
237+
If a user upgrades to 4.3 on a cloud, then migrates workloads to masters, then reverts
238+
back, some function will break. Since that is an opt-in action, that is acceptable.
239+
240+
241+
### Version Skew Strategy
242+
243+
Enabling the Kubernetes 1.16 feature gates by default for master placement will impact
244+
a customer who ASSUMES that masters cannot use service load balancers. However, nodes
245+
that don't run those workloads are excluded from the load balancer automatically, so
246+
the biggest risk is that customers have configured their scheduling policies inconsistent
247+
with their network topology, which is already out of our control. We believe there is
248+
no impact here for master exclusion.
249+
250+
All other components described here have no skew requirements.
251+
252+
253+
## Implementation History
254+
255+
256+
## Drawbacks
257+
258+
We could conceivably never support clusters in the cloud that run like our compact
259+
on-premise clusters (which we have already started supporting for tech preview).
260+
261+
We may be impacted if the Kubernetes work around master placement is reverted, which
262+
is very unlikely. The gates would be the only impact.
263+
264+
## Alternatives
265+
266+
We considered and rejected running single masters outside of RHCoS, but this loses
267+
a large chunk of value.

0 commit comments

Comments
 (0)