vendor/k8s.io/kubernetes/test/e2e/upgrades/apps/job: List Pods in failure message #23161

wking · 2019-06-13T15:58:59Z

Currently, this test can fail with the not-very-helpful:

fail [k8s.io/kubernetes/test/e2e/upgrades/apps/job.go:58]: Expected
    <bool>: false
to be true

This pull request backports kubernetes/kubernetes@96b04bfeac (kubernetes/kubernetes#77716) to get a more useful error message.

The backport didn't apply cleanly. I think I made the appropriate adjustments, but give it a careful eyeball to make sure ;).

openshift-ci-robot · 2019-06-13T15:59:02Z

@wking: This pull request references a valid Bugzilla bug.

Details

In response to this:

Bug 1708454: vendor/k8s.io/kubernetes/test/e2e/upgrades/apps/job: List Pods in failure message

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot · 2019-06-13T15:59:47Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: wking
To complete the pull request process, please assign marun
You can assign the PR to them by writing /assign @marun in a comment when ready.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

vendor/k8s.io/kubernetes/test/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

smarterclayton · 2019-06-13T16:03:59Z

/hold

only to debug why this might be failing, we would merge to master first before this

(expect to get data from test runs)

…lure message Currently, this test can fail with the not-very-helpful [1,2]: fail [k8s.io/kubernetes/test/e2e/upgrades/apps/job.go:58]: Expected <bool>: false to be true Since this test is the only CheckForAllJobPodsRunning consumer, and has been since CheckForAllJobPodsRunning landed in 116eda0 (Implements an upgrade test for Job, 2017-02-22, #41271), this commit refactors the function to EnsureJobPodsRunning, dropping the opaque boolean, and constructing a useful error summarizing the divergence from the expected parallelism and the status of listed Pods. Thanks to Maciej Szulik for the fixups [3] :). Backports kubernetes/kubernetes@96b04bfeac (test/e2e/upgrades/apps/job: List Pods in failure message, 2019-05-09, kubernetes/kubernetes#77716). [1]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1434/build-log.txt [2]: https://bugzilla.redhat.com/show_bug.cgi?id=1708454#c0 [3]: wking/kubernetes#1 ---

wking · 2019-06-13T16:52:18Z

Fixed the compilation error with d44ff45 -> 795c7cb.

openshift-ci-robot · 2019-06-13T18:24:22Z

@wking: The following test failed, say /retest to rerun them all:

Test name	Commit	Details	Rerun command
ci/prow/e2e-aws-upgrade	`795c7cb`	link	`/test e2e-aws-upgrade`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

wking · 2019-06-13T18:39:34Z

Hey, first run :).

fail [k8s.io/kubernetes/test/e2e/upgrades/apps/job.go:49]: Expected error:
    <*errors.errorString | 0xc0022b26e0>: {
        s: "job has 0 of 2 expected running pods: ",
    }
    job has 0 of 2 expected running pods: 
not to have occurred

So that's "no Pods", not "Pods waiting to be scheduled and just not running yet". Dunno what to do about that :/

soltysh · 2019-06-13T19:58:51Z

This failure was caused by problems with node:

Jun 13 18:15:29.003: INFO: Waiting up to 3m0s for all (but 100) nodes to be ready
Jun 13 18:15:29.024: INFO: Condition Ready of node ip-10-0-148-26.ec2.internal is false, but Node is tainted by NodeController with [{node-role.kubernetes.io/master  NoSchedule <nil>} {node.kubernetes.io/unschedulable  NoSchedule 2019-06-13 18:14:29 +0000 UTC} {node.kubernetes.io/unreachable  NoSchedule 2019-06-13 18:15:20 +0000 UTC} {node.kubernetes.io/unreachable  NoExecute 2019-06-13 18:15:26 +0000 UTC}].

Let's keep re-testing it.
/retest

wking · 2019-06-13T20:20:44Z

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/23161/pull-ci-openshift-origin-release-4.1-e2e-aws-upgrade/51/artifacts/e2e-aws-upgrade/nodes.json | jq -r '.items[] | .conditions = ([.status.conditions[] | {key: .type, value: .}] | from_entries) | .conditions.Ready.lastTransitionTime + " " + .conditions.Ready.status + " " + .metadata.name' | sort
2019-06-13T17:35:31Z True ip-10-0-160-149.ec2.internal
2019-06-13T17:40:33Z True ip-10-0-169-80.ec2.internal
2019-06-13T17:41:09Z True ip-10-0-130-244.ec2.internal
2019-06-13T18:13:38Z True ip-10-0-138-20.ec2.internal
2019-06-13T18:15:01Z True ip-10-0-150-197.ec2.internal
2019-06-13T18:15:20Z Unknown ip-10-0-148-26.ec2.internal

Looks like that node was just gone, with no sign of return by that cluster-teardown log collection.

sjenning · 2019-06-13T20:38:38Z

I can't find evidence of the pods being created. There are no pods from a foo Job in the events at all, much less that have a FailedScheduling event. Maybe the KCM isn't running or doesn't have a leader?

wking · 2019-06-13T23:37:04Z

Follow-up in openshift/machine-config-operator#855

deads2k · 2019-06-14T15:27:11Z

In this case, kubelet ip-10-0-169-80 had two scheduled job pods at 18:12 and there are more kubelet logs going back for a considerable time showing they were available. grep the worker's journals for job-upgrade to see them.

The namespace wasn't removed until 18:11 or so.

openshift-ci-robot · 2019-06-25T18:33:28Z

@wking: No Bugzilla bug is referenced in the title of this pull request.
To reference a bug, add 'Bug XXX:' to the title of this pull request and request another bug refresh with /bugzilla refresh.

Details

In response to this:

vendor/k8s.io/kubernetes/test/e2e/upgrades/apps/job: List Pods in failure message

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

eparis · 2019-06-25T18:34:39Z

we have a better patch in 4.2 and more jobs. So I'm going to close this.

openshift-ci-robot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Jun 13, 2019

openshift-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Jun 13, 2019

openshift-ci-robot requested review from smarterclayton and sttts June 13, 2019 15:59

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 13, 2019

wking force-pushed the backport-job-e2e-error-logging branch from d44ff45 to 795c7cb Compare June 13, 2019 16:51

eparis changed the title ~~Bug 1708454: vendor/k8s.io/kubernetes/test/e2e/upgrades/apps/job: List Pods in failure message~~ vendor/k8s.io/kubernetes/test/e2e/upgrades/apps/job: List Pods in failure message Jun 25, 2019

openshift-ci-robot removed the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Jun 25, 2019

eparis closed this Jun 25, 2019

vendor/k8s.io/kubernetes/test/e2e/upgrades/apps/job: List Pods in failure message #23161

vendor/k8s.io/kubernetes/test/e2e/upgrades/apps/job: List Pods in failure message #23161

Uh oh!

Conversation

wking commented Jun 13, 2019

Uh oh!

openshift-ci-robot commented Jun 13, 2019

Uh oh!

openshift-ci-robot commented Jun 13, 2019

Uh oh!

smarterclayton commented Jun 13, 2019

Uh oh!

wking commented Jun 13, 2019

Uh oh!

openshift-ci-robot commented Jun 13, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wking commented Jun 13, 2019

Uh oh!

soltysh commented Jun 13, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wking commented Jun 13, 2019

Uh oh!

sjenning commented Jun 13, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wking commented Jun 13, 2019

Uh oh!

deads2k commented Jun 14, 2019

Uh oh!

openshift-ci-robot commented Jun 25, 2019

Uh oh!

eparis commented Jun 25, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

openshift-ci-robot commented Jun 13, 2019 •

edited

Loading

soltysh commented Jun 13, 2019 •

edited

Loading

sjenning commented Jun 13, 2019 •

edited

Loading