Skip to content

Conversation

@hexfusion
Copy link
Contributor

@hexfusion hexfusion commented Nov 27, 2019

In 4.4 cluster-etcd-operator will take over the process of bootstrapping the etcd cluster. To provide a clear path to disable/revert these changes we have setup the following conditional logic.

MCO: The MCO render command invoked in bootkube has a new optional flag to pass the value of the cluster-etcd-operator image[1]. The availability of this flags value[2] is used to conditionally adjust the etcd-member static pod spec allowing it to use the new bootstrapping method via the operator or fall back to the 4.3 SRV method.

Installer: The installer in 4.4 has a few notable changes introduced by this PR. First of all the render command populates a static pod manifest which creates a single member etcd cluster. After we have the single node cluster we can progress and cluster-operator can be deployed. This speeds up the time it takes for the operators to begin to reconcile as we are no longer waiting for all 3 etcds to bootstrap before we progress the operators.

cluster-etcd-operator: CEO is currently set as Unmanaged[3]. This allows us to include the CEO in CVO operator payload while setting the controllers to perform noop. This short term phase allows us to merge this PR proving that we can at the same time have CEO included in CVO but still use the old SRV bootstrap.

Revert Plan: If a case existed where we had a design error and the operator needed to be pulled from 4.4.
:

[1] https://github.com/openshift/installer/pull/2608/files#diff-ce82c1d8a44f7dfc41dfc024085ccfeeR298
[2] https://github.com/openshift/machine-config-operator/blob/bd846958bc95d049547164046a962054fca093df/templates/master/00-master/_base/files/etc-kubernetes-manifests-etcd-member.yaml#L22
[3] https://github.com/openshift/cluster-etcd-operator/blob/master/manifests/0000_12_etcd-operator_01_operator.cr.yaml#L8

Depends on:

@openshift-ci-robot openshift-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 27, 2019
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason why this can't be done in the if section above ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to be able to manage retry. There is a circumstance where we can be etcd-bootstrap.done but then we still need to perform the checks below. I felt this was more explicit vs elif. I can add a comment about retry or create a single block with elif if you prefer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also once CEO is stable this will be removed as we will only deploy in a managed state.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going to revisit and craft in a manner where commit can be reverted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed PTAL

@hexfusion hexfusion changed the title WIP *: add support for cluster-etcd-operator *: add support for cluster-etcd-operator Nov 27, 2019
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 27, 2019
@hexfusion
Copy link
Contributor Author

/hold

We want to coordinate this going in.

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 27, 2019
@hexfusion
Copy link
Contributor Author

Waiting for SSH key to propagate.
ssh: connect to host 34.74.221.244 port 22: Connection timed out
ERROR: (gcloud.compute.scp) Could not SSH into the instance. It is possible that your SSH key has not propagated to the instance yet. Try running this command again. If you still cannot connect, verify that the firewall and instance are set to accept ssh traffic.

  • rc=1
  • test 1 -eq 0
  • touch /home/packer/exit
  • exit 1

Looks like a flake

/test e2e-libvirt

@abhinavdahiya
Copy link
Contributor

/test e2e-gcp

@abhinavdahiya
Copy link
Contributor

/test e2e-azure

@abhinavdahiya abhinavdahiya added the platform/baremetal IPI bare metal hosts platform label Nov 28, 2019
@alaypatel07
Copy link
Contributor

ssh: connect to host 35.229.48.184 port 22: Connection timed out
ERROR: (gcloud.compute.scp) Could not SSH into the instance.  It is possible that your SSH key has not propagated to the instance yet. Try running this command again.  If you still cannot connect, verify that the firewall and instance are set to accept ssh traffic.

trying libvirt again as suggested by CI

/test e2e-libvirt

@metal3ci
Copy link

Build SUCCESS, see build http://10.8.144.11:8080/job/dev-tools/1338/

@alaypatel07
Copy link
Contributor

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Nov 28, 2019
@abhinavdahiya
Copy link
Contributor

/test e2e-aws-upi

@abhinavdahiya
Copy link
Contributor

/test e2e-gcp-upi

@abhinavdahiya
Copy link
Contributor

/test e2e-vpshere

@metal3ci
Copy link

metal3ci commented Dec 2, 2019

Build FAILURE, see build http://10.8.144.11:8080/job/dev-tools/1342/

@abhinavdahiya
Copy link
Contributor

LGTM

just want to see if any cluster variants are affected.

@hexfusion
Copy link
Contributor Author

/retest

@stbenjam
Copy link
Member

stbenjam commented Dec 2, 2019

LGTM. baremetal IPI looks good with this change, thanks for checking. This probably simplifies a bunch of things for us, I filed #2740 to look at that.

@hexfusion
Copy link
Contributor Author

@hexfusion
Copy link
Contributor Author

/test e2e-aws-upi

@hexfusion
Copy link
Contributor Author

hexfusion commented Dec 2, 2019

Although this appears to be etcd related we have made no changes to the way etcd bootstrap should be working,

/test e2e-azure

@hexfusion
Copy link
Contributor Author

GCP bootstrapped fine I see some performance issues with etcd unrelated to this PR.

2019-12-02 19:02:42.736808 W | etcdserver: read-only range request "key:"/kubernetes.io/config.openshift.io/proxies/cluster" " with result "range_response_count:1 size:306" took too long (2.012425047s) to execute

/test e2e-gcp-upi

@hexfusion
Copy link
Contributor Author

/test e2e-gcp-upi

@abhinavdahiya
Copy link
Contributor

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Dec 2, 2019
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhinavdahiya, alaypatel07, crawford, hexfusion

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [abhinavdahiya,crawford]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@hexfusion
Copy link
Contributor Author

flakey has passed before using the same code

/test e2e-azure

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

3 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit 79dae77 into openshift:master Dec 2, 2019
@openshift-ci-robot
Copy link
Contributor

@hexfusion: The following tests failed, say /retest to rerun them all:

Test name Commit Details Rerun command
ci/prow/e2e-aws-upi ba3abf6 link /test e2e-aws-upi
ci/prow/e2e-azure ba3abf6 link /test e2e-azure
ci/prow/e2e-aws-scaleup-rhel7 ba3abf6 link /test e2e-aws-scaleup-rhel7
ci/prow/e2e-gcp-upi ba3abf6 link /test e2e-gcp-upi

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@rodrigc
Copy link

rodrigc commented Dec 5, 2019

@hexfusion @abhinavdahiya @jhixson74 I built openshift-install from source today

openshift-install unreleased-master-2224-gf55a4aa326115cfc8025cb15f9e11dfd4ff4b859
built from commit f55a4aa326115cfc8025cb15f9e11dfd4ff4b859
release image registry.svc.ci.openshift.org/origin/release:4.3

I tried to create a cluster on AWS, and the bootstrap failed.

I unpacked the logs-bundle tar.gz file, and found the following in all the log files in the bootstrap/pods directory:

Error: unknown flag: --cluster-etcd-operator-image
Usage:
  machine-config bootstrap [flags]

Flags:
      --additional-trust-bundle-config-file string   File containing the additional user provided CA bundle manifest. (default "/assets/manifests/user-ca-bundle-config.yaml")
      --baremetal-runtimecfg-image string            Image for baremetal-runtimecfg.
      --cloud-config-file string                     File containing the config map that contains the cloud config for cloudprovider.
      --config-file string                           ClusterConfig ConfigMap file.
      --coredns-image string                         Image for CoreDNS.
      --dest-dir string                              The destination directory where MCO writes the manifests.
      --etcd-ca string                               path to etcd CA certificate (default "/etc/ssl/etcd/ca.crt")
      --etcd-image string                            Image for Etcd.
      --etcd-metric-ca string                        path to etcd metric CA certificate (default "/assets/tls/etcd-metric-ca-bundle.crt")
      --haproxy-image string                         Image for haproxy.
  -h, --help                                         help for bootstrap
      --infra-config-file string                     File containing infrastructure.config.openshift.io manifest. (default "/assets/manifests/cluster-infrastructure-02-config.yml")
      --infra-image string                           Image for Infra Containers.
      --keepalived-image string                      Image for Keepalived.
      --kube-ca string                               path to kube-apiserver serving-ca bundle
      --kube-client-agent-image string               Image for Kube Client Agent.
      --machine-config-operator-image string         Image for Machine Config Operator.
      --machine-config-oscontent-image string        Image for osImageURL
      --mdns-publisher-image string                  Image for mdns-publisher.
      --network-config-file string                   File containing network.config.openshift.io manifest. (default "/assets/manifests/cluster-network-02-config.yml")
      --oauth-proxy-image string                     Image for origin oauth proxy.
      --proxy-config-file string                     File containing proxy.config.openshift.io manifest. (default "/assets/manifests/cluster-proxy-01-config.yaml")
      --pull-secret string                           path to secret manifest that contains pull secret. (default "/assets/manifests/pull.json")
      --root-ca string                               path to root CA certificate (default "/etc/ssl/kubernetes/ca.crt")

Global Flags:
      --alsologtostderr                  log to standard error as well as files
      --log_backtrace_at traceLocation   when logging hits line file:N, emit a stack trace (default :0)
      --log_dir string                   If non-empty, write log files in this directory
      --logtostderr                      log to standard error instead of files
      --stderrthreshold severity         logs at or above this threshold go to stderr (default 2)
  -v, --v Level                          log level for V logs
      --vmodule moduleSpec               comma-separated list of pattern=N settings for file-filtered logging

ERROR: logging before flag.Parse:

@hexfusion
Copy link
Contributor Author

hexfusion commented Dec 5, 2019

@rodrigc master installer(4.4) is still using 4.3 release payload your best bet is to use explicit pinning of release payload and installer. For example[1] otherwise you can override installer release with OPENSHIFT_INSTALL_RELEASE_IMAGE_OVERRIDE=<release> <installer> create cluster..

[1] https://openshift-release.svc.ci.openshift.org/releasestream/4.4.0-0.nightly/release/4.4.0-0.nightly-2019-12-04-171536

@hexfusion hexfusion deleted the ceo_v2 branch December 5, 2019 02:16
@rodrigc
Copy link

rodrigc commented Dec 5, 2019

@hexfusion ah ok. I periodically build openshift-install from master just to do a quick sanity check to make sure that my stuff still works on the latest pre-release Openshift, so I don't get any surprises when the full release comes out.

@rodrigc
Copy link

rodrigc commented Dec 5, 2019

@hexfusion I now built openshift-install from the release-4.3.0 branch:

openshift-install unreleased-master-2185-gb1f25627d286e0352f8dcb3e776e9a1b8266c5e6
built from commit b1f25627d286e0352f8dcb3e776e9a1b8266c5e6
release image registry.svc.ci.openshift.org/origin/release:4.3

I tried to create a cluster in AWS. It looks like the VM's are provisioned, but
the Kubernetes control plane never fully comes up:

DEBUG error: the server doesn't have a resource type "nodes"
DEBUG error: the server doesn't have a resource type "machineconfigpools"
DEBUG error: the server doesn't have a resource type "pods"
DEBUG error: the server doesn't have a resource type "roles"
DEBUG error: the server doesn't have a resource type "rolebindings"
DEBUG error: the server doesn't have a resource type "secrets"
DEBUG error: the server doesn't have a resource type "secrets"
DEBUG Error from server (NotFound): the server could not find the requested resource
DEBUG error: the server doesn't have a resource type "services"
DEBUG Gather remote logs
DEBUG Collecting info from 10.0.142.100
DEBUG lost connection
DEBUG ssh: connect to host 10.0.142.100 port 22: Connection timed out
DEBUG Collecting info from 10.0.148.46
DEBUG lost connection
DEBUG ssh: connect to host 10.0.148.46 port 22: Connection timed out
DEBUG Collecting info from 10.0.161.41
DEBUG ssh: connect to host 10.0.161.41 port 22: Connection timed out
DEBUG Log bundle written to /var/home/core/log-bundle-20191205055220.tar.gz
INFO Bootstrap gather logs captured here "log-bundle-20191205055220.tar.gz"
FATAL Bootstrap failed to complete: waiting for Kubernetes API: context deadline exceeded

In bootstrap/journals/bootkube.log

I see:

Dec 05 13:23:08 ip-10-0-7-4 podman[4475]: 2019-12-05 13:23:08.815993324 +0000 UTC m=+0.087935642 container create 91f6b2723510810af2925dfdce39fb960dcf5331a9cca55f55457047f03fc1f5 (i
Dec 05 13:23:09 ip-10-0-7-4 podman[4475]: 2019-12-05 13:23:09.157144216 +0000 UTC m=+0.429086728 container init 91f6b2723510810af2925dfdce39fb960dcf5331a9cca55f55457047f03fc1f5 (ima
Dec 05 13:23:09 ip-10-0-7-4 podman[4475]: 2019-12-05 13:23:09.17167471 +0000 UTC m=+0.443617216 container start 91f6b2723510810af2925dfdce39fb960dcf5331a9cca55f55457047f03fc1f5 (ima
Dec 05 13:23:09 ip-10-0-7-4 podman[4475]: 2019-12-05 13:23:09.171965599 +0000 UTC m=+0.443908297 container attach 91f6b2723510810af2925dfdce39fb960dcf5331a9cca55f55457047f03fc1f5 (i
Dec 05 13:23:14 ip-10-0-7-4 bootkube.sh[2265]: {"level":"warn","ts":"2019-12-05T13:23:14.181Z","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","
Dec 05 13:23:14 ip-10-0-7-4 bootkube.sh[2265]: {"level":"warn","ts":"2019-12-05T13:23:14.181Z","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","
Dec 05 13:23:14 ip-10-0-7-4 bootkube.sh[2265]: {"level":"warn","ts":"2019-12-05T13:23:14.181Z","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","
Dec 05 13:23:14 ip-10-0-7-4 bootkube.sh[2265]: https://etcd-2.craig-test-6.openshift.portworx.com:2379 is unhealthy: failed to commit proposal: context deadline exceeded
Dec 05 13:23:14 ip-10-0-7-4 bootkube.sh[2265]: https://etcd-0.craig-test-6.openshift.portworx.com:2379 is unhealthy: failed to commit proposal: context deadline exceeded
Dec 05 13:23:14 ip-10-0-7-4 bootkube.sh[2265]: https://etcd-1.craig-test-6.openshift.portworx.com:2379 is unhealthy: failed to commit proposal: context deadline exceeded
Dec 05 13:23:14 ip-10-0-7-4 bootkube.sh[2265]: Error: unhealthy cluster
Dec 05 13:23:14 ip-10-0-7-4 podman[4475]: 2019-12-05 13:23:14.207387438 +0000 UTC m=+5.479330088 container died 91f6b2723510810af2925dfdce39fb960dcf5331a9cca55f55457047f03fc1f5 (ima
Dec 05 13:23:14 ip-10-0-7-4 podman[4475]: 2019-12-05 13:23:14.261652284 +0000 UTC m=+5.533595135 container remove 91f6b2723510810af2925dfdce39fb960dcf5331a9cca55f55457047f03fc1f5 (i
Dec 05 13:23:14 ip-10-0-7-4 bootkube.sh[2265]: etcdctl failed. Retrying in 5 seconds...

Can you recommend a value of OPENSHIFT_INSTALL_RELEASE_IMAGE_OVERRIDE=
to try this out with?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. platform/baremetal IPI bare metal hosts platform size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants