|
| 1 | +--- |
| 2 | +title: Compact Clusters |
| 3 | +authors: |
| 4 | + - "@smarterclayton" |
| 5 | +reviewers: |
| 6 | + - "@derekwaynecarr" |
| 7 | + - "@ironcladlou" |
| 8 | + - "@crawford" |
| 9 | + - "@hexfusion" |
| 10 | +approvers: |
| 11 | + - "@derekwaynecarr" |
| 12 | +creation-date: "2019-09-26" |
| 13 | +last-updated: "2019-09-26" |
| 14 | +status: implementable |
| 15 | +see-also: |
| 16 | +replaces: |
| 17 | +superseded-by: |
| 18 | +--- |
| 19 | + |
| 20 | +# Compact Clusters |
| 21 | + |
| 22 | +## Release Signoff Checklist |
| 23 | + |
| 24 | +- [x] Enhancement is `implementable` |
| 25 | +- [x] Design details are appropriately documented from clear requirements |
| 26 | +- [x] Test plan is defined |
| 27 | +- [x] Graduation criteria for dev preview, tech preview, GA |
| 28 | +- [ ] User-facing documentation is created in [openshift/docs] |
| 29 | + |
| 30 | +## Summary |
| 31 | + |
| 32 | +OpenShift is intended to run in a wide variety of environments. Over the years |
| 33 | +we have refined and reduced the default required footprint - first by removing |
| 34 | +the hard requirement for infrastructure nodes in 4.1, and second by preparing |
| 35 | +for 3-node clusters in 4.2 with the introduction of the masterSchedulable flag. |
| 36 | +OpenShift should continue to take steps to reduce the default footprint and |
| 37 | +incentivize improvements that allow it to work with smaller 3-node and eventually |
| 38 | +one node clusters. |
| 39 | + |
| 40 | +## Motivation |
| 41 | + |
| 42 | +This proposal covers the majority of near term improvements to OpenShift to |
| 43 | +allow it to fit within smaller footprints. |
| 44 | + |
| 45 | +Our near term goal is to continue to drive down the control plane and node |
| 46 | +footprint and make three node clusters a viable production deployment strategy |
| 47 | +on both cloud and metal, as well as exert engineering pressure on reducing our default |
| 48 | +resource usage in terms of CPU and memory to fit on smaller machines. |
| 49 | + |
| 50 | +At the same time, it should be easy on cloud providers to deploy 3-node clusters, |
| 51 | +and since we prefer to test as our users will run, our CI environments should test |
| 52 | +that way as well. |
| 53 | + |
| 54 | +### Goals |
| 55 | + |
| 56 | +* Ensure 3-node clusters work correctly in cloud and metal enviroments in UPI and IPI |
| 57 | +* Enforce guard rails via testing that prevent regressions in 3-node environments |
| 58 | +* Describe the prerequisites to support single-node clusters in a future release |
| 59 | +* Clarify the supported behavior for single-node today |
| 60 | +* Identify future improvements that may be required for more resource constrained environments |
| 61 | +* Incentivize end users to leverage our dynamic compute and autoscaling capabilities in cloud environments |
| 62 | + |
| 63 | +### Non-Goals |
| 64 | + |
| 65 | +* Supporting any topology other than three masters for production clusters at this time |
| 66 | + |
| 67 | +## Proposal |
| 68 | + |
| 69 | +### User Stories |
| 70 | + |
| 71 | +#### Story 1 |
| 72 | + |
| 73 | +As a user of OpenShift, I should be able to install a 3-node cluster (no workers) |
| 74 | +on cloud environments via the IPI or UPI installation paths and have a fully conformant |
| 75 | +OpenShift cluster (all features enabled). |
| 76 | + |
| 77 | +#### Story 2 |
| 78 | + |
| 79 | +As a developer of OpenShift, I should be able to run smaller clusters that are more |
| 80 | +resource efficient for iterative testing while still being conformant. |
| 81 | + |
| 82 | +#### Story 3 |
| 83 | + |
| 84 | +As a developer of OpenShift, I should be exposed to bugs and failures caused by |
| 85 | +assumptions about cluster size or the topology of cluster early, so that my feature |
| 86 | +additions do not break small cluster topologies. |
| 87 | + |
| 88 | +#### Story 4 |
| 89 | + |
| 90 | +As an admin of an OpenShift cluster deployed onto 3-nodes, I should be able to easily |
| 91 | +transition to a larger cluster and make the control plane unschedulable. |
| 92 | + |
| 93 | +### Risks and Mitigations |
| 94 | + |
| 95 | +This proposal is low risk because it depends only on already approved upstream |
| 96 | +functionality and is already supportable on bare-metal. |
| 97 | + |
| 98 | +## Design Details |
| 99 | + |
| 100 | +### Enabling three-node clusters in the cloud |
| 101 | + |
| 102 | +OpenShift 4.3 should pass a full e2e run on a 3-node cluster in the cloud. |
| 103 | + |
| 104 | +The primary limitation blocking this in 4.2 and earlier was a limitation in |
| 105 | +Kubernetes that prevented service load balancers from targeting the master nodes |
| 106 | +when the node-role for masters was set. This was identified as a bug and in |
| 107 | +Kube 1.16 we [clarified that this should be fixed by moving the functionality to |
| 108 | +explicit labels](https://github.com/kubernetes/enhancements/issues/1143) and |
| 109 | +[added two new alpha labels](https://github.com/kubernetes/kubernetes/pull/80238) |
| 110 | +that made the old behavior still achievable. The new alpha feature gate `LegacyNodeRoleBehavior` |
| 111 | +disables the old check, which would allow service load balancers to target masters. |
| 112 | +The risk of regression is low because the code path is simple, and all usage of |
| 113 | +the gate would be internal. We would enable this gate by default in 4.3. |
| 114 | + |
| 115 | +The IPI and UPI install and documentation must ensure that service load balancers |
| 116 | +can correctly target master machines if a 3-node cluster is created by ensuring |
| 117 | +security groups are correctly targeted. It may be desirable to switch the default |
| 118 | +ingress port policy to be NodePort to remove the need to listen on the host in |
| 119 | +these environments in 4.3 across all appropriate clouds, but that may not be |
| 120 | +strictly required. |
| 121 | + |
| 122 | +In order to ensure this is successful and testable by default, some of the default |
| 123 | +e2e-aws runs should be switched to a 3-node cluster configuration, no worker pool |
| 124 | +defined. Tests that fail should be corrected. Resource limits that are insufficient |
| 125 | +to handle the workload should be bumped. Any hardcoded logic that breaks in this |
| 126 | +configuration should be fixed. |
| 127 | + |
| 128 | +At the end of this work, we should be able to declare a 3-node 4.3 cluster in the |
| 129 | +cloud as fully supported. |
| 130 | + |
| 131 | +### Support for one-node control planes |
| 132 | + |
| 133 | +OpenShift 3.x supported single node "all-in-one" clusters via both `oc cluster up`, |
| 134 | +`openshift-ansible`, and other more opinionated configurations like pre-baked VMs. |
| 135 | +OpenShift 4, with its different control plane strategy, intentionally diverges from |
| 136 | +the static preconfiguration of 3.x but is geared towards highly available control |
| 137 | +planes and upgrades. To tolerate single control-plane node upgrades we would need |
| 138 | +to fix a number of assumptions in core operators. In addition, we wish to introduce |
| 139 | +the cluster-etcd-operator to allow for master replacement, and the capabilities |
| 140 | +that operator will introduce |
| 141 | + |
| 142 | +For the 4.3 timeframe the following statements are true for single-node clusters: |
| 143 | + |
| 144 | +* The installer allows creation of single-master control planes |
| 145 | +* Some components may not completely roll out because they require separate nodes |
| 146 | + * Users may be required to disable certain operators and override their replicas count |
| 147 | +* Upgrades may hang or pause because operators assume at least 2 control plane members are available at all times |
| 148 | + |
| 149 | +In general it will remain possible with some work to get a single-node cluster in |
| 150 | +4.3 but it is not supported for upgrade or production use and will continue to have |
| 151 | +a list of known caveats. |
| 152 | + |
| 153 | +The current rollout plan for replacement of masters is via the cluster-etcd-operator |
| 154 | +which will dynamically manage the quorum of a cluster from single instance during |
| 155 | +bootstrap to a full HA cluster, and updating the quorum as new masters are created. |
| 156 | +Because we expect the operator to work in a single etcd member configuration during |
| 157 | +bootstrap and then grow to three nodes, we must ensure single member continues to |
| 158 | +work. That opens the door for future versions of OpenShift to run a larger and |
| 159 | +more complete control plane on the bootstrap node, and we anticipate leveraging that |
| 160 | +flow to enable support of single-node control planes. We would prefer not to provide |
| 161 | +workarounds for single-node clusters in advance of that functionality. |
| 162 | + |
| 163 | +A future release would gradually reduce restrictions on single-node clusters to |
| 164 | +ensure all operators function correctly and to allow upgrades, possibly including |
| 165 | +but not limited to the following work: |
| 166 | + |
| 167 | +* A pattern for core operators that have components that require >1 instance to: |
| 168 | + * Only require one instance |
| 169 | + * Colocate both instances on the single machine |
| 170 | + * Survive machine reboot |
| 171 | +* Allowing components like ingress that depend on HostNetwork to move to NodePorts |
| 172 | +* Ensuring that when outages are required (restarting apiserver) that the duration is minimal and reliably comes back up |
| 173 | +* Ensuring any components that check membership can successfully upgrade single instances (1 instance machine config pool can upgrade) |
| 174 | + |
| 175 | +In the short term, we recommend users who want single machines running containers |
| 176 | +to use RHCoS machines with podman, static pods, and system units, and Ignition to |
| 177 | +configure those machines. You can also create machine pools and have these remote |
| 178 | +machines launch static pods and system units without running workload pods by |
| 179 | +excluding normal infrastructure containers. |
| 180 | + |
| 181 | + |
| 182 | +### Reducing resource requirements of the core cluster |
| 183 | + |
| 184 | +The OpenShift control plane cpu, memory, and disk requirements are largely dominated |
| 185 | +by etcd (io and memory), kube-apiserver (cpu and memory), and prometheus (cpu and memory). |
| 186 | +Per node components scale with cluster size `O(node)` in terms of cpu and memory on |
| 187 | +the apiservers, and components like the scheduler, controller manager, and other |
| 188 | +core controllers scale with `O(workload)` on themselves and the apiservers. Almost all |
| 189 | +controller/operator components have some fixed overhead in CPU and memory, and requests |
| 190 | +made to the apiserver. |
| 191 | + |
| 192 | +In small clusters the fixed overhead dominates and is the primary optimization target. |
| 193 | +These overheads include: |
| 194 | + |
| 195 | +* memory use by etcd and apiserver objects (caches) for all default objects |
| 196 | +* memory use by controllers (caches) for all default objects |
| 197 | +* fixed watches for controller config `O(config_types * controllers)` |
| 198 | +* kubelet and cri-o maintenance CPU and memory costs monitoring control plane pods |
| 199 | +* cluster monitoring CPU and memory scraping the control plane components |
| 200 | +* components that are made highly available but which could be rescheduled if one machine goes down, such that true HA may not be necessary |
| 201 | + |
| 202 | +Given the choice between dropping function that is valuable but expensive, OR optimizing that function to be more efficient by default, **we should prefer to optimize that valuable function first before making it optional.** |
| 203 | + |
| 204 | +The short term wins are: |
| 205 | + |
| 206 | +* Continue to optimize components to reduce writes per second, memory in use, and excess caching |
| 207 | +* Investigate default usage of top components - kube-apiserver and prometheus - and fix issues |
| 208 | +* In small clusters, consider automatically switching to single instance for prometheus AND ensure prometheus is rescheduled if machine fails |
| 209 | + |
| 210 | +The following optimizations are unlikely to dramatically improve resource usage or platform health and are non-goals: |
| 211 | + |
| 212 | +* Disabling many semi-optional components (like ingress or image-registry) that have low resource usage |
| 213 | +* Running fewer instances of core Kube control plane components and adding complexity to recovery or debuggability |
| 214 | + |
| 215 | +### Test Plan |
| 216 | + |
| 217 | +We should test in cloud environments with 3-node clusters with machines that represent |
| 218 | +our minimum target and ensure our e2e tests (which stress the platform control plane) |
| 219 | +pass reliably. |
| 220 | + |
| 221 | +We should ensure that in 3-node clusters disruptive events (losing a machine, recovering |
| 222 | +etcd quorum) do not compromise function of the product. |
| 223 | + |
| 224 | +We should add tests that catch regressions in resource usage. |
| 225 | + |
| 226 | +### Graduation Criteria |
| 227 | + |
| 228 | +This is an evolving plan, it is considered a core requirement for the product to remain |
| 229 | +at a shippable GA quality. |
| 230 | + |
| 231 | + |
| 232 | +### Upgrade / Downgrade Strategy |
| 233 | + |
| 234 | +In the future, this section will be updated to describe how single master control planes |
| 235 | +might upgrade. At the current time this is not supported. |
| 236 | + |
| 237 | +If a user upgrades to 4.3 on a cloud, then migrates workloads to masters, then reverts |
| 238 | +back, some function will break. Since that is an opt-in action, that is acceptable. |
| 239 | + |
| 240 | + |
| 241 | +### Version Skew Strategy |
| 242 | + |
| 243 | +Enabling the Kubernetes 1.16 feature gates by default for master placement will impact |
| 244 | +a customer who ASSUMES that masters cannot use service load balancers. However, nodes |
| 245 | +that don't run those workloads are excluded from the load balancer automatically, so |
| 246 | +the biggest risk is that customers have configured their scheduling policies inconsistent |
| 247 | +with their network topology, which is already out of our control. We believe there is |
| 248 | +no impact here for master exclusion. |
| 249 | + |
| 250 | +All other components described here have no skew requirements. |
| 251 | + |
| 252 | + |
| 253 | +## Implementation History |
| 254 | + |
| 255 | + |
| 256 | +## Drawbacks |
| 257 | + |
| 258 | +We could conceivably never support clusters in the cloud that run like our compact |
| 259 | +on-premise clusters (which we have already started supporting for tech preview). |
| 260 | + |
| 261 | +We may be impacted if the Kubernetes work around master placement is reverted, which |
| 262 | +is very unlikely. The gates would be the only impact. |
| 263 | + |
| 264 | +## Alternatives |
| 265 | + |
| 266 | +We considered and rejected running single masters outside of RHCoS, but this loses |
| 267 | +a large chunk of value. |
0 commit comments