Skip to content

Conversation

@openshift-bot
Copy link
Contributor

Please merge as soon as https://errata.devel.redhat.com/advisory/56069 is shipped live OR if a Cincinnati-first release is approved.

This should provide adequate soak time for candidate channel PR #294

@openshift-bot openshift-bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 24, 2020
@wking
Copy link
Member

wking commented Jun 29, 2020

Reviewing the initial CI updates, failures were:

  • 4.3.22 -> 4.4.10 died in setup with error simulating policy: Throttling: Rate exceeded. Fixed in 4.5+; not backported. Unlikely to affect folks that aren't launching zounds of clusters in the same AWS account each day.
  • 4.3.23 -> 4.4.10 died in setup with failed to initialize the cluster: Cluster operator image-registry has not yet reported success. That seems more serious, but would be a 4.3.23 issue, not a 4.4.10 issue.
  • 4.4.5 -> 4.4.10 hung on Cluster did not complete upgrade: timed out waiting for the condition: Working towards 4.4.10: 14% complete. Needs more investigation.
  • 4.4.6 -> 4.4.10 successfully updated, but with (Kubernetes) API was unreachable during disruption for at least 17m55s of 50m1s (36%). Sounds like rhbz#1850057, although that's technically about 4.4 -> 4.5. rhbz#1852056 is POST for master/4.6 to help in this space.

There were also some 4.4.10 -> 4.5 RC CI failures, but nothing that looked like it was worth pulling candidate edges. And no alarming * -> 4.4.10 Telemetry/Insights either.

Digging into the hung 4.4.5 -> 4.4.10 failure:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-launch-aws/1275681394676207616/artifacts/launch/pods/openshift-cluster-version_cluster-version-operator-7847b46597-wks4s_cluster-version-operator.log | grep 'Running sync.*in state\|Result of work'
I0624 08:00:07.749594       1 sync_worker.go:471] Running sync registry.svc.ci.openshift.org/ocp/release@sha256:0d1ffca302ae55d32574b38438c148d33c2a8a05c8daf97eeb13e9ab948174f7 (force=true) on generation 2 in state Updating at attempt 0
I0624 08:05:52.801425       1 task_graph.go:596] Result of work: [Cluster operator openshift-apiserver is reporting a failure: APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable]
I0624 08:06:15.588329       1 sync_worker.go:471] Running sync registry.svc.ci.openshift.org/ocp/release@sha256:0d1ffca302ae55d32574b38438c148d33c2a8a05c8daf97eeb13e9ab948174f7 (force=true) on generation 2 in state Updating at attempt 1
I0624 08:12:00.639878       1 task_graph.go:596] Result of work: [Cluster operator kube-apiserver is reporting a failure: NodeInstallerDegraded: 1 nodes are failing on revision 10:
I0624 08:12:46.300005       1 sync_worker.go:471] Running sync registry.svc.ci.openshift.org/ocp/release@sha256:0d1ffca302ae55d32574b38438c148d33c2a8a05c8daf97eeb13e9ab948174f7 (force=true) on generation 2 in state Updating at attempt 2
I0624 08:18:31.351840       1 task_graph.go:596] Result of work: [Cluster operator kube-apiserver is reporting a failure: NodeInstallerDegraded: 1 nodes are failing on revision 10:
I0624 08:19:58.985733       1 sync_worker.go:471] Running sync registry.svc.ci.openshift.org/ocp/release@sha256:0d1ffca302ae55d32574b38438c148d33c2a8a05c8daf97eeb13e9ab948174f7 (force=true) on generation 2 in state Updating at attempt 3
I0624 08:25:44.037572       1 task_graph.go:596] Result of work: [Cluster operator kube-apiserver is reporting a failure: NodeInstallerDegraded: 1 nodes are failing on revision 10:
I0624 08:28:51.164473       1 sync_worker.go:471] Running sync registry.svc.ci.openshift.org/ocp/release@sha256:0d1ffca302ae55d32574b38438c148d33c2a8a05c8daf97eeb13e9ab948174f7 (force=true) on generation 2 in state Updating at attempt 4
I0624 08:34:36.216272       1 task_graph.go:596] Result of work: [Cluster operator kube-apiserver is reporting a failure: NodeInstallerDegraded: 1 nodes are failing on revision 10:
I0624 08:37:37.075758       1 sync_worker.go:471] Running sync registry.svc.ci.openshift.org/ocp/release@sha256:0d1ffca302ae55d32574b38438c148d33c2a8a05c8daf97eeb13e9ab948174f7 (force=true) on generation 2 in state Updating at attempt 5

Drilling into the ClusterOperator:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-launch-aws/1275681394676207616/artifacts/launch/clusteroperators.json | jq -r '.items[] | select(.metadata.name == "kube-apiserver").status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + (.reason // "-") + ": " + (.message // "-")'
2020-06-24T08:02:17Z Degraded=True NodeInstaller_InstallerPodFailed: NodeInstallerDegraded: 1 nodes are failing on revision 10:
NodeInstallerDegraded: pods "installer-10-ip-10-0-145-36.us-west-1.compute.internal" not found
2020-06-24T07:51:55Z Progressing=True NodeInstaller: NodeInstallerProgressing: 1 nodes are at revision 9; 2 nodes are at revision 10
2020-06-24T07:10:33Z Available=True AsExpected: StaticPodsAvailable: 3 nodes are active; 1 nodes are at revision 9; 2 nodes are at revision 10
2020-06-24T07:08:15Z Upgradeable=True AsExpected: -

Indeed, that node-installer pod seems to be missing:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-launch-aws/1275681394676207616/artifacts/launch/pods.json | jq -r '.items[] | select(.metadata | ((.name | contains("installer")) and .namespace == "openshift-kube-apiserver")) | .status.phase + " " + .metadata.name'
Succeeded installer-10-ip-10-0-131-183.us-west-1.compute.internal
Succeeded installer-10-ip-10-0-143-5.us-west-1.compute.internal

Node itself seems fine:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-launch-aws/1275681394676207616/artifacts/launch/nodes.json | jq -r '.items[] | select(.metadata.name == "ip-10-0-145-36.us-west-1.compute.internal").status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + (.reason // "-") + ": " + (.message // "-")'
2020-06-24T08:04:39Z MemoryPressure=False KubeletHasSufficientMemory: kubelet has sufficient memory available
2020-06-24T08:04:39Z DiskPressure=False KubeletHasNoDiskPressure: kubelet has no disk pressure
2020-06-24T08:04:39Z PIDPressure=False KubeletHasSufficientPID: kubelet has sufficient PID available
2020-06-24T08:04:39Z Ready=True KubeletReady: kubelet is posting ready status

Checking events:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-launch-aws/1275681394676207616/artifacts/launch/events.json | jq -r '.items[] | select(.metadata.namespace == "openshift-kube-apiserver" and (.involvedObject.name == "installer-10-ip-10-0-145-36.us-west-1.compute.internal")) | .firstTimestamp + " " + (.count | tostring) + " " + .message'
2020-06-24T07:59:47Z 1 Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:fd5dcdb63afcda01fc0cb2eba9a642f20948258c536e516ec1f6bd46256cf33f" already present on machine
2020-06-24T07:59:48Z 1 Created container installer
2020-06-24T07:59:48Z 1 Started container installer
2020-06-24T07:59:54Z 1 Successfully installed revision 10

So everything seems fine with the pod, but then it was deleted by something and since then the kube-apiserver operator is freaking out and refusing further progress. I'll see if I can find an existing bug around this...

@wking
Copy link
Member

wking commented Jun 29, 2020

Bug for NodeInstaller_InstallerPodFailed: NodeInstallerDegraded: ...: NodeInstallerDegraded: pods ... not found is rhbz#1817419, which seems to be very rare in CI.

/lgtm
/retest

@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: openshift-bot, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added lgtm Indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Jun 29, 2020
@wking
Copy link
Member

wking commented Jun 29, 2020

HTTP Error 429: Too Many Requests. The fact that we got into label-pushing is enough for me.

/override ci/prow/publish

@openshift-ci-robot
Copy link

@wking: Overrode contexts on behalf of wking: ci/prow/publish

Details

In response to this:

HTTP Error 429: Too Many Requests. The fact that we got into label-pushing is enough for me.

/override ci/prow/publish

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@wking
Copy link
Member

wking commented Jun 29, 2020

/hold cancel

Errata is public.

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 29, 2020
@openshift-merge-robot openshift-merge-robot merged commit acf4f93 into master Jun 29, 2020
@sdodson sdodson deleted the pr-fast-4.4.10 branch October 12, 2020 16:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants