*: add support for cluster-etcd-operator #2730

hexfusion · 2019-11-27T22:38:51Z

In 4.4 cluster-etcd-operator will take over the process of bootstrapping the etcd cluster. To provide a clear path to disable/revert these changes we have setup the following conditional logic.

MCO: The MCO render command invoked in bootkube has a new optional flag to pass the value of the cluster-etcd-operator image[1]. The availability of this flags value[2] is used to conditionally adjust the etcd-member static pod spec allowing it to use the new bootstrapping method via the operator or fall back to the 4.3 SRV method.

Installer: The installer in 4.4 has a few notable changes introduced by this PR. First of all the render command populates a static pod manifest which creates a single member etcd cluster. After we have the single node cluster we can progress and cluster-operator can be deployed. This speeds up the time it takes for the operators to begin to reconcile as we are no longer waiting for all 3 etcds to bootstrap before we progress the operators.

cluster-etcd-operator: CEO is currently set as Unmanaged[3]. This allows us to include the CEO in CVO operator payload while setting the controllers to perform noop. This short term phase allows us to merge this PR proving that we can at the same time have CEO included in CVO but still use the old SRV bootstrap.

Revert Plan: If a case existed where we had a design error and the operator needed to be pulled from 4.4.
:

[1] https://github.com/openshift/installer/pull/2608/files#diff-ce82c1d8a44f7dfc41dfc024085ccfeeR298
[2] https://github.com/openshift/machine-config-operator/blob/bd846958bc95d049547164046a962054fca093df/templates/master/00-master/_base/files/etc-kubernetes-manifests-etcd-member.yaml#L22
[3] https://github.com/openshift/cluster-etcd-operator/blob/master/manifests/0000_12_etcd-operator_01_operator.cr.yaml#L8

Depends on:

data/data/bootstrap/files/usr/local/bin/bootkube.sh.template

abhinavdahiya · 2019-11-27T22:46:11Z

data/data/bootstrap/files/usr/local/bin/bootkube.sh.template

any reason why this can't be done in the if section above ?

We need to be able to manage retry. There is a circumstance where we can be etcd-bootstrap.done but then we still need to perform the checks below. I felt this was more explicit vs elif. I can add a comment about retry or create a single block with elif if you prefer.

Also once CEO is stable this will be removed as we will only deploy in a managed state.

Going to revisit and craft in a manner where commit can be reverted.

addressed PTAL

data/data/manifests/bootkube/etcd-host-service-endpoints.yaml.template

data/data/bootstrap/files/usr/local/bin/bootkube.sh.template

hexfusion · 2019-11-27T23:56:11Z

/hold

We want to coordinate this going in.

hexfusion · 2019-11-28T00:46:25Z

Waiting for SSH key to propagate.
ssh: connect to host 34.74.221.244 port 22: Connection timed out
ERROR: (gcloud.compute.scp) Could not SSH into the instance. It is possible that your SSH key has not propagated to the instance yet. Try running this command again. If you still cannot connect, verify that the firewall and instance are set to accept ssh traffic.

rc=1

test 1 -eq 0

touch /home/packer/exit

exit 1

Looks like a flake

/test e2e-libvirt

data/data/bootstrap/files/usr/local/bin/bootkube.sh.template

pkg/asset/manifests/operators.go

abhinavdahiya · 2019-11-28T02:13:06Z

/test e2e-gcp

abhinavdahiya · 2019-11-28T02:13:17Z

/test e2e-azure

alaypatel07 · 2019-11-28T03:02:17Z

ssh: connect to host 35.229.48.184 port 22: Connection timed out
ERROR: (gcloud.compute.scp) Could not SSH into the instance.  It is possible that your SSH key has not propagated to the instance yet. Try running this command again.  If you still cannot connect, verify that the firewall and instance are set to accept ssh traffic.

trying libvirt again as suggested by CI

/test e2e-libvirt

metal3ci · 2019-11-28T03:26:44Z

Build SUCCESS, see build http://10.8.144.11:8080/job/dev-tools/1338/

alaypatel07 · 2019-11-28T03:55:58Z

/lgtm

Signed-off-by: Sam Batschelet <[email protected]>

abhinavdahiya · 2019-12-02T16:53:19Z

/test e2e-aws-upi

abhinavdahiya · 2019-12-02T16:53:30Z

/test e2e-gcp-upi

abhinavdahiya · 2019-12-02T16:53:41Z

/test e2e-vpshere

metal3ci · 2019-12-02T17:27:01Z

Build FAILURE, see build http://10.8.144.11:8080/job/dev-tools/1342/

abhinavdahiya · 2019-12-02T17:53:58Z

LGTM

just want to see if any cluster variants are affected.

hexfusion · 2019-12-02T18:21:14Z

/retest

stbenjam · 2019-12-02T18:26:37Z

LGTM. baremetal IPI looks good with this change, thanks for checking. This probably simplifies a bunch of things for us, I filed #2740 to look at that.

hexfusion · 2019-12-02T19:52:32Z

@abhinavdahiya last passing e2e-aws-upi was 9/10[1] is it possible the test has regressed?

[1] https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_installer/2335/pull-ci-openshift-installer-master-e2e-aws-upi/10

hexfusion · 2019-12-02T19:56:47Z

/test e2e-aws-upi

hexfusion · 2019-12-02T20:30:42Z

Although this appears to be etcd related we have made no changes to the way etcd bootstrap should be working,

/test e2e-azure

hexfusion · 2019-12-02T20:34:10Z

GCP bootstrapped fine I see some performance issues with etcd unrelated to this PR.

2019-12-02 19:02:42.736808 W | etcdserver: read-only range request "key:"/kubernetes.io/config.openshift.io/proxies/cluster" " with result "range_response_count:1 size:306" took too long (2.012425047s) to execute

/test e2e-gcp-upi

hexfusion · 2019-12-02T20:49:41Z

/test e2e-gcp-upi

abhinavdahiya · 2019-12-02T22:08:07Z

/lgtm

openshift-ci-robot · 2019-12-02T22:08:32Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhinavdahiya, alaypatel07, crawford, hexfusion

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [abhinavdahiya,crawford]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

hexfusion · 2019-12-02T22:21:01Z

flakey has passed before using the same code

/test e2e-azure

openshift-bot · 2019-12-02T22:24:24Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2019-12-02T22:37:21Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2019-12-02T23:17:51Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2019-12-02T23:41:45Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-ci-robot · 2019-12-03T01:27:50Z

@hexfusion: The following tests failed, say /retest to rerun them all:

Test name	Commit	Details	Rerun command
ci/prow/e2e-aws-upi	`ba3abf6`	link	`/test e2e-aws-upi`
ci/prow/e2e-azure	`ba3abf6`	link	`/test e2e-azure`
ci/prow/e2e-aws-scaleup-rhel7	`ba3abf6`	link	`/test e2e-aws-scaleup-rhel7`
ci/prow/e2e-gcp-upi	`ba3abf6`	link	`/test e2e-gcp-upi`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

rodrigc · 2019-12-05T02:10:42Z

@hexfusion @abhinavdahiya @jhixson74 I built openshift-install from source today

openshift-install unreleased-master-2224-gf55a4aa326115cfc8025cb15f9e11dfd4ff4b859
built from commit f55a4aa326115cfc8025cb15f9e11dfd4ff4b859
release image registry.svc.ci.openshift.org/origin/release:4.3

I tried to create a cluster on AWS, and the bootstrap failed.

I unpacked the logs-bundle tar.gz file, and found the following in all the log files in the bootstrap/pods directory:

Error: unknown flag: --cluster-etcd-operator-image
Usage:
  machine-config bootstrap [flags]

Flags:
      --additional-trust-bundle-config-file string   File containing the additional user provided CA bundle manifest. (default "/assets/manifests/user-ca-bundle-config.yaml")
      --baremetal-runtimecfg-image string            Image for baremetal-runtimecfg.
      --cloud-config-file string                     File containing the config map that contains the cloud config for cloudprovider.
      --config-file string                           ClusterConfig ConfigMap file.
      --coredns-image string                         Image for CoreDNS.
      --dest-dir string                              The destination directory where MCO writes the manifests.
      --etcd-ca string                               path to etcd CA certificate (default "/etc/ssl/etcd/ca.crt")
      --etcd-image string                            Image for Etcd.
      --etcd-metric-ca string                        path to etcd metric CA certificate (default "/assets/tls/etcd-metric-ca-bundle.crt")
      --haproxy-image string                         Image for haproxy.
  -h, --help                                         help for bootstrap
      --infra-config-file string                     File containing infrastructure.config.openshift.io manifest. (default "/assets/manifests/cluster-infrastructure-02-config.yml")
      --infra-image string                           Image for Infra Containers.
      --keepalived-image string                      Image for Keepalived.
      --kube-ca string                               path to kube-apiserver serving-ca bundle
      --kube-client-agent-image string               Image for Kube Client Agent.
      --machine-config-operator-image string         Image for Machine Config Operator.
      --machine-config-oscontent-image string        Image for osImageURL
      --mdns-publisher-image string                  Image for mdns-publisher.
      --network-config-file string                   File containing network.config.openshift.io manifest. (default "/assets/manifests/cluster-network-02-config.yml")
      --oauth-proxy-image string                     Image for origin oauth proxy.
      --proxy-config-file string                     File containing proxy.config.openshift.io manifest. (default "/assets/manifests/cluster-proxy-01-config.yaml")
      --pull-secret string                           path to secret manifest that contains pull secret. (default "/assets/manifests/pull.json")
      --root-ca string                               path to root CA certificate (default "/etc/ssl/kubernetes/ca.crt")

Global Flags:
      --alsologtostderr                  log to standard error as well as files
      --log_backtrace_at traceLocation   when logging hits line file:N, emit a stack trace (default :0)
      --log_dir string                   If non-empty, write log files in this directory
      --logtostderr                      log to standard error instead of files
      --stderrthreshold severity         logs at or above this threshold go to stderr (default 2)
  -v, --v Level                          log level for V logs
      --vmodule moduleSpec               comma-separated list of pattern=N settings for file-filtered logging

ERROR: logging before flag.Parse:

hexfusion · 2019-12-05T02:16:24Z

@rodrigc master installer(4.4) is still using 4.3 release payload your best bet is to use explicit pinning of release payload and installer. For example[1] otherwise you can override installer release with OPENSHIFT_INSTALL_RELEASE_IMAGE_OVERRIDE=<release> <installer> create cluster..

[1] https://openshift-release.svc.ci.openshift.org/releasestream/4.4.0-0.nightly/release/4.4.0-0.nightly-2019-12-04-171536

rodrigc · 2019-12-05T02:22:30Z

@hexfusion ah ok. I periodically build openshift-install from master just to do a quick sanity check to make sure that my stuff still works on the latest pre-release Openshift, so I don't get any surprises when the full release comes out.

rodrigc · 2019-12-05T14:19:03Z

@hexfusion I now built openshift-install from the release-4.3.0 branch:

openshift-install unreleased-master-2185-gb1f25627d286e0352f8dcb3e776e9a1b8266c5e6
built from commit b1f25627d286e0352f8dcb3e776e9a1b8266c5e6
release image registry.svc.ci.openshift.org/origin/release:4.3

I tried to create a cluster in AWS. It looks like the VM's are provisioned, but
the Kubernetes control plane never fully comes up:

DEBUG error: the server doesn't have a resource type "nodes"
DEBUG error: the server doesn't have a resource type "machineconfigpools"
DEBUG error: the server doesn't have a resource type "pods"
DEBUG error: the server doesn't have a resource type "roles"
DEBUG error: the server doesn't have a resource type "rolebindings"
DEBUG error: the server doesn't have a resource type "secrets"
DEBUG error: the server doesn't have a resource type "secrets"
DEBUG Error from server (NotFound): the server could not find the requested resource
DEBUG error: the server doesn't have a resource type "services"
DEBUG Gather remote logs
DEBUG Collecting info from 10.0.142.100
DEBUG lost connection
DEBUG ssh: connect to host 10.0.142.100 port 22: Connection timed out
DEBUG Collecting info from 10.0.148.46
DEBUG lost connection
DEBUG ssh: connect to host 10.0.148.46 port 22: Connection timed out
DEBUG Collecting info from 10.0.161.41
DEBUG ssh: connect to host 10.0.161.41 port 22: Connection timed out
DEBUG Log bundle written to /var/home/core/log-bundle-20191205055220.tar.gz
INFO Bootstrap gather logs captured here "log-bundle-20191205055220.tar.gz"
FATAL Bootstrap failed to complete: waiting for Kubernetes API: context deadline exceeded

In bootstrap/journals/bootkube.log

I see:

Dec 05 13:23:08 ip-10-0-7-4 podman[4475]: 2019-12-05 13:23:08.815993324 +0000 UTC m=+0.087935642 container create 91f6b2723510810af2925dfdce39fb960dcf5331a9cca55f55457047f03fc1f5 (i
Dec 05 13:23:09 ip-10-0-7-4 podman[4475]: 2019-12-05 13:23:09.157144216 +0000 UTC m=+0.429086728 container init 91f6b2723510810af2925dfdce39fb960dcf5331a9cca55f55457047f03fc1f5 (ima
Dec 05 13:23:09 ip-10-0-7-4 podman[4475]: 2019-12-05 13:23:09.17167471 +0000 UTC m=+0.443617216 container start 91f6b2723510810af2925dfdce39fb960dcf5331a9cca55f55457047f03fc1f5 (ima
Dec 05 13:23:09 ip-10-0-7-4 podman[4475]: 2019-12-05 13:23:09.171965599 +0000 UTC m=+0.443908297 container attach 91f6b2723510810af2925dfdce39fb960dcf5331a9cca55f55457047f03fc1f5 (i
Dec 05 13:23:14 ip-10-0-7-4 bootkube.sh[2265]: {"level":"warn","ts":"2019-12-05T13:23:14.181Z","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","
Dec 05 13:23:14 ip-10-0-7-4 bootkube.sh[2265]: {"level":"warn","ts":"2019-12-05T13:23:14.181Z","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","
Dec 05 13:23:14 ip-10-0-7-4 bootkube.sh[2265]: {"level":"warn","ts":"2019-12-05T13:23:14.181Z","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","
Dec 05 13:23:14 ip-10-0-7-4 bootkube.sh[2265]: https://etcd-2.craig-test-6.openshift.portworx.com:2379 is unhealthy: failed to commit proposal: context deadline exceeded
Dec 05 13:23:14 ip-10-0-7-4 bootkube.sh[2265]: https://etcd-0.craig-test-6.openshift.portworx.com:2379 is unhealthy: failed to commit proposal: context deadline exceeded
Dec 05 13:23:14 ip-10-0-7-4 bootkube.sh[2265]: https://etcd-1.craig-test-6.openshift.portworx.com:2379 is unhealthy: failed to commit proposal: context deadline exceeded
Dec 05 13:23:14 ip-10-0-7-4 bootkube.sh[2265]: Error: unhealthy cluster
Dec 05 13:23:14 ip-10-0-7-4 podman[4475]: 2019-12-05 13:23:14.207387438 +0000 UTC m=+5.479330088 container died 91f6b2723510810af2925dfdce39fb960dcf5331a9cca55f55457047f03fc1f5 (ima
Dec 05 13:23:14 ip-10-0-7-4 podman[4475]: 2019-12-05 13:23:14.261652284 +0000 UTC m=+5.533595135 container remove 91f6b2723510810af2925dfdce39fb960dcf5331a9cca55f55457047f03fc1f5 (i
Dec 05 13:23:14 ip-10-0-7-4 bootkube.sh[2265]: etcdctl failed. Retrying in 5 seconds...

Can you recommend a value of OPENSHIFT_INSTALL_RELEASE_IMAGE_OVERRIDE=
to try this out with?

openshift-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 27, 2019

openshift-ci-robot requested review from jcpowermac and patrickdillon November 27, 2019 22:39

hexfusion mentioned this pull request Nov 27, 2019

[WIP] *: add logic for cluster-etcd-operator toggle #2608

Closed

3 tasks

abhinavdahiya reviewed Nov 27, 2019

View reviewed changes