Skip to content

Conversation

@wking
Copy link
Member

@wking wking commented Jun 13, 2019

Currently, this test can fail with the not-very-helpful:

fail [k8s.io/kubernetes/test/e2e/upgrades/apps/job.go:58]: Expected
    <bool>: false
to be true

This pull request backports kubernetes/kubernetes@96b04bfeac (kubernetes/kubernetes#77716) to get a more useful error message.

The backport didn't apply cleanly. I think I made the appropriate adjustments, but give it a careful eyeball to make sure ;).

@openshift-ci-robot openshift-ci-robot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Jun 13, 2019
@openshift-ci-robot
Copy link

@wking: This pull request references a valid Bugzilla bug.

Details

In response to this:

Bug 1708454: vendor/k8s.io/kubernetes/test/e2e/upgrades/apps/job: List Pods in failure message

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Jun 13, 2019
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: wking
To complete the pull request process, please assign marun
You can assign the PR to them by writing /assign @marun in a comment when ready.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@smarterclayton
Copy link
Contributor

/hold

only to debug why this might be failing, we would merge to master first before this

(expect to get data from test runs)

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 13, 2019
…lure message

Currently, this test can fail with the not-very-helpful [1,2]:

  fail [k8s.io/kubernetes/test/e2e/upgrades/apps/job.go:58]: Expected
      <bool>: false
  to be true

Since this test is the only CheckForAllJobPodsRunning consumer, and
has been since CheckForAllJobPodsRunning landed in 116eda0
(Implements an upgrade test for Job, 2017-02-22, #41271), this commit
refactors the function to EnsureJobPodsRunning, dropping the opaque
boolean, and constructing a useful error summarizing the divergence
from the expected parallelism and the status of listed Pods.

Thanks to Maciej Szulik for the fixups [3] :).

Backports kubernetes/kubernetes@96b04bfeac
(test/e2e/upgrades/apps/job: List Pods in failure message, 2019-05-09,
kubernetes/kubernetes#77716).

[1]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1434/build-log.txt
[2]: https://bugzilla.redhat.com/show_bug.cgi?id=1708454#c0
[3]: wking/kubernetes#1
---
@wking wking force-pushed the backport-job-e2e-error-logging branch from d44ff45 to 795c7cb Compare June 13, 2019 16:51
@wking
Copy link
Member Author

wking commented Jun 13, 2019

Fixed the compilation error with d44ff45 -> 795c7cb.

@openshift-ci-robot
Copy link

openshift-ci-robot commented Jun 13, 2019

@wking: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
ci/prow/e2e-aws-upgrade 795c7cb link /test e2e-aws-upgrade

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@wking
Copy link
Member Author

wking commented Jun 13, 2019

Hey, first run :).

fail [k8s.io/kubernetes/test/e2e/upgrades/apps/job.go:49]: Expected error:
    <*errors.errorString | 0xc0022b26e0>: {
        s: "job has 0 of 2 expected running pods: ",
    }
    job has 0 of 2 expected running pods: 
not to have occurred

So that's "no Pods", not "Pods waiting to be scheduled and just not running yet". Dunno what to do about that :/

@soltysh
Copy link
Contributor

soltysh commented Jun 13, 2019

This failure was caused by problems with node:

Jun 13 18:15:29.003: INFO: Waiting up to 3m0s for all (but 100) nodes to be ready
Jun 13 18:15:29.024: INFO: Condition Ready of node ip-10-0-148-26.ec2.internal is false, but Node is tainted by NodeController with [{node-role.kubernetes.io/master  NoSchedule <nil>} {node.kubernetes.io/unschedulable  NoSchedule 2019-06-13 18:14:29 +0000 UTC} {node.kubernetes.io/unreachable  NoSchedule 2019-06-13 18:15:20 +0000 UTC} {node.kubernetes.io/unreachable  NoExecute 2019-06-13 18:15:26 +0000 UTC}].

Let's keep re-testing it.
/retest

@wking
Copy link
Member Author

wking commented Jun 13, 2019

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/23161/pull-ci-openshift-origin-release-4.1-e2e-aws-upgrade/51/artifacts/e2e-aws-upgrade/nodes.json | jq -r '.items[] | .conditions = ([.status.conditions[] | {key: .type, value: .}] | from_entries) | .conditions.Ready.lastTransitionTime + " " + .conditions.Ready.status + " " + .metadata.name' | sort
2019-06-13T17:35:31Z True ip-10-0-160-149.ec2.internal
2019-06-13T17:40:33Z True ip-10-0-169-80.ec2.internal
2019-06-13T17:41:09Z True ip-10-0-130-244.ec2.internal
2019-06-13T18:13:38Z True ip-10-0-138-20.ec2.internal
2019-06-13T18:15:01Z True ip-10-0-150-197.ec2.internal
2019-06-13T18:15:20Z Unknown ip-10-0-148-26.ec2.internal

Looks like that node was just gone, with no sign of return by that cluster-teardown log collection.

@sjenning
Copy link
Contributor

sjenning commented Jun 13, 2019

I can't find evidence of the pods being created. There are no pods from a foo Job in the events at all, much less that have a FailedScheduling event. Maybe the KCM isn't running or doesn't have a leader?

@wking
Copy link
Member Author

wking commented Jun 13, 2019

Follow-up in openshift/machine-config-operator#855

@deads2k
Copy link
Contributor

deads2k commented Jun 14, 2019

In this case, kubelet ip-10-0-169-80 had two scheduled job pods at 18:12 and there are more kubelet logs going back for a considerable time showing they were available. grep the worker's journals for job-upgrade to see them.

The namespace wasn't removed until 18:11 or so.

@eparis eparis changed the title Bug 1708454: vendor/k8s.io/kubernetes/test/e2e/upgrades/apps/job: List Pods in failure message vendor/k8s.io/kubernetes/test/e2e/upgrades/apps/job: List Pods in failure message Jun 25, 2019
@openshift-ci-robot
Copy link

@wking: No Bugzilla bug is referenced in the title of this pull request.
To reference a bug, add 'Bug XXX:' to the title of this pull request and request another bug refresh with /bugzilla refresh.

Details

In response to this:

vendor/k8s.io/kubernetes/test/e2e/upgrades/apps/job: List Pods in failure message

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot removed the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Jun 25, 2019
@eparis
Copy link
Member

eparis commented Jun 25, 2019

we have a better patch in 4.2 and more jobs. So I'm going to close this.

@eparis eparis closed this Jun 25, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants