Enable 4.4.10 in fast channel(s) #295

openshift-bot · 2020-06-24T08:28:24Z

Please merge as soon as https://errata.devel.redhat.com/advisory/56069 is shipped live OR if a Cincinnati-first release is approved.

This should provide adequate soak time for candidate channel PR #294

wking · 2020-06-29T17:43:33Z

Reviewing the initial CI updates, failures were:

4.3.22 -> 4.4.10 died in setup with error simulating policy: Throttling: Rate exceeded. Fixed in 4.5+; not backported. Unlikely to affect folks that aren't launching zounds of clusters in the same AWS account each day.
4.3.23 -> 4.4.10 died in setup with failed to initialize the cluster: Cluster operator image-registry has not yet reported success. That seems more serious, but would be a 4.3.23 issue, not a 4.4.10 issue.
4.4.5 -> 4.4.10 hung on Cluster did not complete upgrade: timed out waiting for the condition: Working towards 4.4.10: 14% complete. Needs more investigation.
4.4.6 -> 4.4.10 successfully updated, but with (Kubernetes) API was unreachable during disruption for at least 17m55s of 50m1s (36%). Sounds like rhbz#1850057, although that's technically about 4.4 -> 4.5. rhbz#1852056 is POST for master/4.6 to help in this space.

There were also some 4.4.10 -> 4.5 RC CI failures, but nothing that looked like it was worth pulling candidate edges. And no alarming * -> 4.4.10 Telemetry/Insights either.

Digging into the hung 4.4.5 -> 4.4.10 failure:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-launch-aws/1275681394676207616/artifacts/launch/pods/openshift-cluster-version_cluster-version-operator-7847b46597-wks4s_cluster-version-operator.log | grep 'Running sync.*in state\|Result of work'
I0624 08:00:07.749594       1 sync_worker.go:471] Running sync registry.svc.ci.openshift.org/ocp/release@sha256:0d1ffca302ae55d32574b38438c148d33c2a8a05c8daf97eeb13e9ab948174f7 (force=true) on generation 2 in state Updating at attempt 0
I0624 08:05:52.801425       1 task_graph.go:596] Result of work: [Cluster operator openshift-apiserver is reporting a failure: APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable]
I0624 08:06:15.588329       1 sync_worker.go:471] Running sync registry.svc.ci.openshift.org/ocp/release@sha256:0d1ffca302ae55d32574b38438c148d33c2a8a05c8daf97eeb13e9ab948174f7 (force=true) on generation 2 in state Updating at attempt 1
I0624 08:12:00.639878       1 task_graph.go:596] Result of work: [Cluster operator kube-apiserver is reporting a failure: NodeInstallerDegraded: 1 nodes are failing on revision 10:
I0624 08:12:46.300005       1 sync_worker.go:471] Running sync registry.svc.ci.openshift.org/ocp/release@sha256:0d1ffca302ae55d32574b38438c148d33c2a8a05c8daf97eeb13e9ab948174f7 (force=true) on generation 2 in state Updating at attempt 2
I0624 08:18:31.351840       1 task_graph.go:596] Result of work: [Cluster operator kube-apiserver is reporting a failure: NodeInstallerDegraded: 1 nodes are failing on revision 10:
I0624 08:19:58.985733       1 sync_worker.go:471] Running sync registry.svc.ci.openshift.org/ocp/release@sha256:0d1ffca302ae55d32574b38438c148d33c2a8a05c8daf97eeb13e9ab948174f7 (force=true) on generation 2 in state Updating at attempt 3
I0624 08:25:44.037572       1 task_graph.go:596] Result of work: [Cluster operator kube-apiserver is reporting a failure: NodeInstallerDegraded: 1 nodes are failing on revision 10:
I0624 08:28:51.164473       1 sync_worker.go:471] Running sync registry.svc.ci.openshift.org/ocp/release@sha256:0d1ffca302ae55d32574b38438c148d33c2a8a05c8daf97eeb13e9ab948174f7 (force=true) on generation 2 in state Updating at attempt 4
I0624 08:34:36.216272       1 task_graph.go:596] Result of work: [Cluster operator kube-apiserver is reporting a failure: NodeInstallerDegraded: 1 nodes are failing on revision 10:
I0624 08:37:37.075758       1 sync_worker.go:471] Running sync registry.svc.ci.openshift.org/ocp/release@sha256:0d1ffca302ae55d32574b38438c148d33c2a8a05c8daf97eeb13e9ab948174f7 (force=true) on generation 2 in state Updating at attempt 5

Drilling into the ClusterOperator:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-launch-aws/1275681394676207616/artifacts/launch/clusteroperators.json | jq -r '.items[] | select(.metadata.name == "kube-apiserver").status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + (.reason // "-") + ": " + (.message // "-")'
2020-06-24T08:02:17Z Degraded=True NodeInstaller_InstallerPodFailed: NodeInstallerDegraded: 1 nodes are failing on revision 10:
NodeInstallerDegraded: pods "installer-10-ip-10-0-145-36.us-west-1.compute.internal" not found
2020-06-24T07:51:55Z Progressing=True NodeInstaller: NodeInstallerProgressing: 1 nodes are at revision 9; 2 nodes are at revision 10
2020-06-24T07:10:33Z Available=True AsExpected: StaticPodsAvailable: 3 nodes are active; 1 nodes are at revision 9; 2 nodes are at revision 10
2020-06-24T07:08:15Z Upgradeable=True AsExpected: -

Indeed, that node-installer pod seems to be missing:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-launch-aws/1275681394676207616/artifacts/launch/pods.json | jq -r '.items[] | select(.metadata | ((.name | contains("installer")) and .namespace == "openshift-kube-apiserver")) | .status.phase + " " + .metadata.name'
Succeeded installer-10-ip-10-0-131-183.us-west-1.compute.internal
Succeeded installer-10-ip-10-0-143-5.us-west-1.compute.internal

Node itself seems fine:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-launch-aws/1275681394676207616/artifacts/launch/nodes.json | jq -r '.items[] | select(.metadata.name == "ip-10-0-145-36.us-west-1.compute.internal").status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + (.reason // "-") + ": " + (.message // "-")'
2020-06-24T08:04:39Z MemoryPressure=False KubeletHasSufficientMemory: kubelet has sufficient memory available
2020-06-24T08:04:39Z DiskPressure=False KubeletHasNoDiskPressure: kubelet has no disk pressure
2020-06-24T08:04:39Z PIDPressure=False KubeletHasSufficientPID: kubelet has sufficient PID available
2020-06-24T08:04:39Z Ready=True KubeletReady: kubelet is posting ready status

Checking events:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-launch-aws/1275681394676207616/artifacts/launch/events.json | jq -r '.items[] | select(.metadata.namespace == "openshift-kube-apiserver" and (.involvedObject.name == "installer-10-ip-10-0-145-36.us-west-1.compute.internal")) | .firstTimestamp + " " + (.count | tostring) + " " + .message'
2020-06-24T07:59:47Z 1 Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:fd5dcdb63afcda01fc0cb2eba9a642f20948258c536e516ec1f6bd46256cf33f" already present on machine
2020-06-24T07:59:48Z 1 Created container installer
2020-06-24T07:59:48Z 1 Started container installer
2020-06-24T07:59:54Z 1 Successfully installed revision 10

So everything seems fine with the pod, but then it was deleted by something and since then the kube-apiserver operator is freaking out and refusing further progress. I'll see if I can find an existing bug around this...

wking · 2020-06-29T18:00:41Z

Bug for NodeInstaller_InstallerPodFailed: NodeInstallerDegraded: ...: NodeInstallerDegraded: pods ... not found is rhbz#1817419, which seems to be very rare in CI.

/lgtm
/retest

openshift-ci-robot · 2020-06-29T18:00:56Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: openshift-bot, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

wking · 2020-06-29T18:25:28Z

HTTP Error 429: Too Many Requests. The fact that we got into label-pushing is enough for me.

/override ci/prow/publish

openshift-ci-robot · 2020-06-29T18:25:45Z

@wking: Overrode contexts on behalf of wking: ci/prow/publish

Details

In response to this:

HTTP Error 429: Too Many Requests. The fact that we got into label-pushing is enough for me.

/override ci/prow/publish

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

wking · 2020-06-29T18:26:50Z

/hold cancel

Errata is public.

Enable 4.4.10 in fast channel(s)

e7e93b9

openshift-bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 24, 2020

openshift-bot mentioned this pull request Jun 24, 2020

Enable 4.4.10 in stable channel(s) #296

Merged

openshift-ci-robot requested review from derekwaynecarr and jottofar June 24, 2020 08:28

openshift-ci-robot assigned wking Jun 29, 2020

openshift-ci-robot added lgtm Indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Jun 29, 2020

openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 29, 2020

openshift-merge-robot merged commit acf4f93 into master Jun 29, 2020

sdodson deleted the pr-fast-4.4.10 branch October 12, 2020 16:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enable 4.4.10 in fast channel(s) #295

Enable 4.4.10 in fast channel(s) #295

Uh oh!

openshift-bot commented Jun 24, 2020

Uh oh!

wking commented Jun 29, 2020

Uh oh!

wking commented Jun 29, 2020

Uh oh!

openshift-ci-robot commented Jun 29, 2020

Uh oh!

wking commented Jun 29, 2020

Uh oh!

openshift-ci-robot commented Jun 29, 2020

Uh oh!

wking commented Jun 29, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Enable 4.4.10 in fast channel(s) #295

Enable 4.4.10 in fast channel(s) #295

Uh oh!

Conversation

openshift-bot commented Jun 24, 2020

Uh oh!

wking commented Jun 29, 2020

Uh oh!

wking commented Jun 29, 2020

Uh oh!

openshift-ci-robot commented Jun 29, 2020

Uh oh!

wking commented Jun 29, 2020

Uh oh!

openshift-ci-robot commented Jun 29, 2020

Uh oh!

wking commented Jun 29, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants