etcd: cluster-etcd-operator #56

hexfusion · 2019-10-01T12:44:10Z

This design may add additional information but the core is stable. I will hold open the option to make changes until the review process is complete.

etcd/cluster-etcd-operator.md

cgwalters · 2020-02-04T22:58:53Z

Today in OpenShift 4 we have a pretty strong separation between "control plane" and "workers", where "control plane" means apiserver+etcd.

This enhancement is proposing to separate apiserver and etcd (right?) which has a lot of impact to how we think of cluster installation and management. If e.g. an admin wants to scale up etcd to 5 nodes, that would become supported...but would things be left in a state where e.g. the original 3 "control plane" nodes would be in the MCO's master pool, and 2 other etcd instances would be in worker? So 2/5 would be scheduable by default - and in fact, the MCO would then feel free to try and drain/reboot potentially 1 master and 1 "etcd worker" at the same time. This wouldn't necessarily work if the CEO's PDB prevented it...but it's a bit ugly because the MCO will sit there trying to drain and not succeed.

(The way both the MCO and machine API both do drains today in a way that doesn't try to intelligently look at PDBs is a problem)

Further, clusters backed by machineAPI would probably want to at least have a different (e.g. more performant) machineset for etcd workers or so.

beekhof · 2020-02-04T23:34:21Z

Today in OpenShift 4 we have a pretty strong separation between "control plane" and "workers", where "control plane" means apiserver+etcd.

This enhancement is proposing to separate apiserver and etcd (right?)

Maybe things have changed but that wasn't the intent at the time.
Any etcd scaling events were still going to be limited to occurring on master nodes (the creation of which would be made significantly easier by this work).

etcd/cluster-etcd-operator.md

Signed-off-by: Sam Batschelet <[email protected]>

hexfusion · 2020-02-05T22:45:00Z

This enhancement is proposing to separate apiserver and etcd (right?) which has a lot of impact to how we think of cluster installation and management. I

We do not intend to separate etcd from master control-plane nodes. In the case where we would choose to support a 5 etcd cluster. All etcd instances would remain on the master nodes, thus requiring 5 masters. The pairing of etcd to apiserver would remain the same on the master nodes.

In the initial design, we looked to leverage MCO and maintain the static pod spec in machine-config. But the reality is we need to have the ability to adjust our spec in the same manner that the kube-apiserver operator iterates with its static pod resources. This allows us to perform actions such as cert rotation and make on the fly adjustments to our staticpod, for example during an upgrade.

To conclude we are going to be making some changes to the initial design to account for these issues.

etcd/cluster-etcd-operator.md

deads2k · 2020-02-06T14:39:06Z

This document is showing the current state. It's merging late, but we cannot argue with reality.

@hexfusion fix location and formatting, we'll merge this as representing the current state and then manipulate from there as we go.

Signed-off-by: Sam Batschelet <[email protected]>

hexfusion · 2020-02-06T16:59:42Z

@deads2k updated

retroflexer · 2020-02-06T18:21:14Z

enhancements/etcd/cluster-etcd-operator.md

+the notReadyAddress state without further investigation. When `etcd-member` Pod starts the init
+containers will wait for various observations to continue. Because we are waiting in a not Ready
+state we can be simply waiting for direction from the operator. To validate this we ensure that
+the Pods containers are not currently in `CrashLoopBackoff`. If we pass the `isPodCrashLoop`[1


Missing reference for [1]. It should be:
https://github.com/openshift/cluster-etcd-operator/blob/release-4.4/pkg/operator/configobservation/etcd/observe_etcd.go#L339

retroflexer · 2020-02-06T18:21:45Z

enhancements/etcd/cluster-etcd-operator.md

+exited 0, because inits containers including certs are now complete we can begin the process of
+scaling etcd via clustermembercontroller and the status is updated to `MemberReady`.
+
+[1] https://github.com/openshift/cluster-etcd-operator/blob/release-4.4/pkg/operator/configobservation/etcd/observe_etcd.go#L192


This should be reference [2], as it points to isPendingReady.

retroflexer · 2020-02-06T18:22:52Z

enhancements/etcd/cluster-etcd-operator.md

+`Cluster` API. The `staticpod` controller will also respond by stopping the static Pod, removing
+the existing data-dir and then starting the static Pod. The discovery init container then will
+wait for status now observed as `Unknown` to change to Add before it continues
+. `configobservation` mcontroller observes the Pod and confirms etcd-member container is no


s/mcontroller/controller/

cgwalters · 2020-02-06T18:24:31Z

All etcd instances would remain on the master nodes, thus requiring 5 masters.

Hum. But today masters are provisioned via the installer, we don't have machineAPI support for scaling them etc. Is this proposal envisioning a future that supports that, and short term anyone who wants to do it can do it manually by booting new machines using the master Ignition, and what CEO is doing here is just ensuring that etcd is taken care of rather than having that be manual too?

deads2k · 2020-02-06T18:55:15Z

Hum. But today masters are provisioned via the installer, we don't have machineAPI support for scaling them etc. Is this proposal envisioning a future that supports that, and short term anyone who wants to do it can do it manually by booting new machines using the master Ignition, and what CEO is doing here is just ensuring that etcd is taken care of rather than having that be manual too?

That's the goal. It hasn't yet been tested.

deads2k · 2020-02-06T18:55:34Z

gotta start somewhere. This reflects actual state.

/lgtm

openshift-ci-robot · 2020-02-06T18:56:46Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k, hexfusion

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [deads2k]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 1, 2019

openshift-ci-robot requested review from chancez and dgoodwin October 1, 2019 12:44

hexfusion force-pushed the add_ceo branch 2 times, most recently from 26db67a to 7dac572 Compare October 3, 2019 01:25

alaypatel07 reviewed Oct 3, 2019

View reviewed changes

etcd/cluster-etcd-operator.md Outdated Show resolved Hide resolved

hexfusion force-pushed the add_ceo branch 3 times, most recently from 1f91764 to ee0bf6c Compare October 3, 2019 13:18

hexfusion changed the title ~~WIP etcd: cluster-etcd-operator~~ etcd: cluster-etcd-operator Oct 3, 2019

openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 3, 2019

hexfusion force-pushed the add_ceo branch 3 times, most recently from aaeb78f to 279f188 Compare October 3, 2019 16:11

hexfusion changed the title ~~etcd: cluster-etcd-operator~~ WIP etcd: cluster-etcd-operator Oct 3, 2019

openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 3, 2019

ericavonb reviewed Oct 3, 2019

View reviewed changes

etcd/cluster-etcd-operator.md Outdated Show resolved Hide resolved

beekhof suggested changes Oct 4, 2019

View reviewed changes

abhinavdahiya mentioned this pull request Oct 10, 2019

oVirt IPI enhancement document #61

Merged

cybertron reviewed Oct 14, 2019

View reviewed changes

etcd/cluster-etcd-operator.md Outdated Show resolved Hide resolved

etcd/cluster-etcd-operator.md Outdated Show resolved Hide resolved

hexfusion mentioned this pull request Oct 25, 2019

*: welcome cluster-etcd-operator openshift/cluster-etcd-operator#21

Merged

hexfusion changed the title ~~WIP etcd: cluster-etcd-operator~~ etcd: cluster-etcd-operator Feb 4, 2020

openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 4, 2020

retroflexer reviewed Feb 4, 2020

View reviewed changes

etcd/cluster-etcd-operator.md Outdated Show resolved Hide resolved

hexfusion force-pushed the add_ceo branch 2 times, most recently from dbcdbff to ab2e906 Compare February 5, 2020 22:29

etcd/cluster-etcd-operator: add design details

f147781

Signed-off-by: Sam Batschelet <[email protected]>

hexfusion force-pushed the add_ceo branch from ab2e906 to f147781 Compare February 5, 2020 22:30

deads2k reviewed Feb 6, 2020

View reviewed changes

etcd/cluster-etcd-operator.md Show resolved Hide resolved

deads2k reviewed Feb 6, 2020

View reviewed changes

etcd/cluster-etcd-operator.md Outdated Show resolved Hide resolved

add hardwrap 100 and move to enhancements/.

2a59214

Signed-off-by: Sam Batschelet <[email protected]>

openshift-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 6, 2020

retroflexer reviewed Feb 6, 2020

View reviewed changes

openshift-ci-robot assigned deads2k Feb 6, 2020

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 6, 2020

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 6, 2020

openshift-merge-robot merged commit 40bf2bf into openshift:master Feb 6, 2020

hexfusion deleted the add_ceo branch March 22, 2020 00:58

etcd: cluster-etcd-operator #56

etcd: cluster-etcd-operator #56

Uh oh!

Conversation

hexfusion commented Oct 1, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cgwalters commented Feb 4, 2020

Uh oh!

beekhof commented Feb 4, 2020

Uh oh!

Uh oh!

hexfusion commented Feb 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

deads2k commented Feb 6, 2020

Uh oh!

hexfusion commented Feb 6, 2020

Uh oh!

retroflexer Feb 6, 2020

Choose a reason for hiding this comment

Uh oh!

retroflexer Feb 6, 2020

Choose a reason for hiding this comment

Uh oh!

retroflexer Feb 6, 2020

Choose a reason for hiding this comment

Uh oh!

cgwalters commented Feb 6, 2020

Uh oh!

deads2k commented Feb 6, 2020

Uh oh!

deads2k commented Feb 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Feb 6, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

hexfusion commented Oct 1, 2019 •

edited

Loading

hexfusion commented Feb 5, 2020 •

edited

Loading

deads2k commented Feb 6, 2020 •

edited

Loading