Skip to content

Conversation

@wking
Copy link
Member

@wking wking commented Dec 11, 2019

We suspect 3c05621 (#2666) made etcd sad, with a jump in leader elections and etcdserver: request timed out. Not clear on why yet, but here's trying the older RHCOS to see how it plays.

We suspect 3c05621 (RHCOS: Bump to 43.81.201911192044.0 for CRI-O
bug fix, 2019-11-13, openshift#2666) made etcd sad, with a jump in leader
elections and 'etcdserver: request timed out' [1].  Not clear on why
yet, but here's trying the older RHCOS to see how it plays.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1775878
@openshift-ci-robot openshift-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Dec 11, 2019
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 11, 2019
@wking
Copy link
Member Author

wking commented Dec 11, 2019

This effectively reverts #2666 and #2714, which obviously had a lot of good stuff too.

/hold

Because we'll get some CI signal out of this as a PR, and @hexfusion can see if it's helping.

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 11, 2019
@openshift-ci-robot
Copy link
Contributor

@wking: The following tests failed, say /retest to rerun them all:

Test name Commit Details Rerun command
ci/prow/e2e-openstack 6a18817 link /test e2e-openstack
ci/prow/e2e-aws 6a18817 link /test e2e-aws
ci/prow/e2e-aws-scaleup-rhel7 6a18817 link /test e2e-aws-scaleup-rhel7
ci/prow/e2e-aws-fips 6a18817 link /test e2e-aws-fips
ci/prow/e2e-libvirt 6a18817 link /test e2e-libvirt

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@wking
Copy link
Member Author

wking commented Dec 11, 2019

e2e-aws:

level=info msg="Waiting up to 30m0s for the Kubernetes API at https://api.ci-op-734yyr9k-1d3f3.origin-ci-int-aws.dev.rhcloud.com:6443..."
level=error msg="Attempted to gather ClusterOperator status after installation failure: listing ClusterOperator objects: the server could not find the requested resource (get clusteroperators.config.openshift.io)"
level=info msg="Pulling debug logs from the bootstrap machine"
level=info msg="Bootstrap gather logs captured here \"/tmp/artifacts/installer/log-bundle-20191211225516.tar.gz\""
level=fatal msg="Bootstrap failed to complete: waiting for Kubernetes API: context deadline exceeded" 

Checking the log bundle's ./bootstrap/journals/bootkube.log:

Dec 11 22:55:11 ip-10-0-10-244 bootkube.sh[2164]: https://etcd-2.ci-op-734yyr9k-1d3f3.origin-ci-int-aws.dev.rhcloud.com:2379 is unhealthy: failed to commit proposal: context deadline exceeded
Dec 11 22:55:11 ip-10-0-10-244 bootkube.sh[2164]: https://etcd-0.ci-op-734yyr9k-1d3f3.origin-ci-int-aws.dev.rhcloud.com:2379 is unhealthy: failed to commit proposal: context deadline exceeded
Dec 11 22:55:11 ip-10-0-10-244 bootkube.sh[2164]: https://etcd-1.ci-op-734yyr9k-1d3f3.origin-ci-int-aws.dev.rhcloud.com:2379 is unhealthy: failed to commit proposal: context deadline exceeded
Dec 11 22:55:11 ip-10-0-10-244 bootkube.sh[2164]: Error: unhealthy cluster

And we have no control-plane logs at all, so probably failed to launch the control plane machines.

@wking
Copy link
Member Author

wking commented Dec 11, 2019

From ./bootstrap/containers/machine-config-server-a39d445845a7e4816dc03de3290896dcdf1330761c135663ab1667cac409dd40.log:

I1211 22:23:35.161947       1 bootstrap.go:37] Version: machine-config-daemon-4.3.0-201910280117-125-g41e16d84-dirty (41e16d8437b42c30312cc63afb4ab56d9095c642)
I1211 22:23:35.162102       1 api.go:51] Launching server on :22624
I1211 22:23:35.162279       1 api.go:51] Launching server on :22623
I1211 22:23:41.039624       1 api.go:97] Pool master requested by 10.0.129.29:54234
I1211 22:23:41.039656       1 bootstrap_server.go:62] reading file "/etc/mcs/bootstrap/machine-pools/master.yaml"
I1211 22:23:41.041429       1 bootstrap_server.go:82] reading file "/etc/mcs/bootstrap/machine-configs/rendered-master-0577cdace605f963d378dccb143fdc20.yaml"
I1211 22:24:02.476863       1 api.go:97] Pool master requested by 10.0.129.29:14110
I1211 22:24:02.476895       1 bootstrap_server.go:62] reading file "/etc/mcs/bootstrap/machine-pools/master.yaml"
I1211 22:24:02.477341       1 bootstrap_server.go:82] reading file "/etc/mcs/bootstrap/machine-configs/rendered-master-0577cdace605f963d378dccb143fdc20.yaml"
I1211 22:24:31.855168       1 api.go:97] Pool master requested by 10.0.147.117:55305
I1211 22:24:31.855200       1 bootstrap_server.go:62] reading file "/etc/mcs/bootstrap/machine-pools/master.yaml"
I1211 22:24:31.856239       1 bootstrap_server.go:82] reading file "/etc/mcs/bootstrap/machine-configs/rendered-master-0577cdace605f963d378dccb143fdc20.yaml"

so they probably died in pivot or some such.

@wking
Copy link
Member Author

wking commented Dec 16, 2019

RHCOS folks have a solid lead, no need for me to poke things anymore.

/close

@openshift-ci-robot
Copy link
Contributor

@wking: Closed this PR.

Details

In response to this:

RHCOS folks have a solid lead, no need for me to poke things anymore.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@wking wking deleted the older-rhcos branch December 16, 2019 21:18
@openshift-ci-robot
Copy link
Contributor

@wking: PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 16, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants