Skip to content

Conversation

@wking
Copy link
Member

@wking wking commented Dec 6, 2018

Today I saw:

error: watch closed before Until timeout
error openshift-ingress/deploy/router-default did not come up
sleep: invalid option -- '4'
Try 'sleep --help' for more information.

I suspect that the rollout status request took long enough that the fresh date call generated a time larger than wait_expiry_time. This commit rerolls the logic last touched by 7991fd3 (#2004), with an implementation based on one of my suggestions there. And, full disclosure, the buggy implementation from #2004 is also based on one of my suggestions, so don't assume I know what I'm talking about ;).

Now we pick a total wait time (10 minutes), regardless of how many times we need to reconnect the watcher. With this commit, each watcher will try to wait for the full remaining period. So the first watcher tries to wait for 10 minutes. And if the first times out after 2 minutes, the second watcher will try to wait for 8 minutes.

And the cool-off sleep is no longer parameterized, which removes the change of flaking like I saw today.

@openshift-ci-robot openshift-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Dec 6, 2018
@stevekuznetsov
Copy link
Contributor

/sigh

Maybe we should have a nice Go binary that handles these sorts of things instead of cobbling together bash? :)

@wking
Copy link
Member Author

wking commented Dec 10, 2018

Saw this again here. @michaelgugino, @mtnbikenc, @sdodson, @vrutkovs, can you take a look? @abhinavdahiya, @crawford, I can also split off the installer change into a separate PR if we don't want to wait for the Andible folks. Thoughts?

Copy link
Contributor

@crawford crawford left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Dec 10, 2018
@crawford
Copy link
Contributor

@wking is there any advantage to getting our's in earlier? If not, let's just wait.

@wking
Copy link
Member Author

wking commented Dec 10, 2018

The advantage is that repos blocking CI aren't running the Ansible tests, and fixing this for AWS makes it more likely that the PRs we need to unblock CI can squeak through.

@crawford
Copy link
Contributor

Ah, okay. Maybe split this. I haven't heard back from the Ansible team.

@sdodson
Copy link
Member

sdodson commented Dec 10, 2018

/cc @vrutkovs
Can you ack the change to the ansible related job? The rest has already been merged.

@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Dec 10, 2018
Catch up with ac206e7 (ci-operator/templates/openshift: Refactor
router-rollout wait (again), 2018-11-05, openshift#2342) and ff16a01
(ci-operator/templates/openshift: Refactor router-rollout 'oc oc',
2018-12-09, openshift#2343).
@openshift-ci-robot openshift-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Dec 10, 2018
@wking
Copy link
Member Author

wking commented Dec 10, 2018

Rebased onto master with a189182->7cc59bc, which also updates to just catch Ansible up now that the installer changes have landed via #2342 and #2343.

@wking
Copy link
Member Author

wking commented Dec 10, 2018

/assign @vrutkovs
/unassign @crawford

@vrutkovs
Copy link
Contributor

/lgtm

Why would we not wait for ingress cluster operator to stop progressing instead?

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Dec 10, 2018
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: crawford, vrutkovs, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 10, 2018
@openshift-merge-robot openshift-merge-robot merged commit 527c69a into openshift:master Dec 10, 2018
@openshift-ci-robot
Copy link
Contributor

@wking: Updated the prow-job-cluster-launch-e2e-40 configmap using the following files:

  • key cluster-launch-e2e-40.yaml using file ci-operator/templates/openshift/openshift-ansible/cluster-launch-e2e-40.yaml
Details

In response to this:

Today I saw:

error: watch closed before Until timeout
error openshift-ingress/deploy/router-default did not come up
sleep: invalid option -- '4'
Try 'sleep --help' for more information.

I suspect that the rollout status request took long enough that the fresh date call generated a time larger than wait_expiry_time. This commit rerolls the logic last touched by 7991fd3 (#2004), with an implementation based on one of my suggestions there. And, full disclosure, the buggy implementation from #2004 is also based on one of my suggestions, so don't assume I know what I'm talking about ;).

Now we pick a total wait time (10 minutes), regardless of how many times we need to reconnect the watcher. With this commit, each watcher will try to wait for the full remaining period. So the first watcher tries to wait for 10 minutes. And if the first times out after 2 minutes, the second watcher will try to wait for 8 minutes.

And the cool-off sleep is no longer parameterized, which removes the change of flaking like I saw today.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@wking
Copy link
Member Author

wking commented Dec 10, 2018

Why would we not wait for ingress cluster operator to stop progressing instead?

Tradition ;). We're goiint to replace all these hacks with a cluster-version waiter once we have that working.

@wking wking deleted the e2e-sleep-flake branch December 11, 2018 15:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants