ci-operator/templates/openshift: Refactor router-rollout wait (again) #2321

wking · 2018-12-06T07:52:31Z

Today I saw:

error: watch closed before Until timeout
error openshift-ingress/deploy/router-default did not come up
sleep: invalid option -- '4'
Try 'sleep --help' for more information.

I suspect that the rollout status request took long enough that the fresh date call generated a time larger than wait_expiry_time. This commit rerolls the logic last touched by 7991fd3 (#2004), with an implementation based on one of my suggestions there. And, full disclosure, the buggy implementation from #2004 is also based on one of my suggestions, so don't assume I know what I'm talking about ;).

Now we pick a total wait time (10 minutes), regardless of how many times we need to reconnect the watcher. With this commit, each watcher will try to wait for the full remaining period. So the first watcher tries to wait for 10 minutes. And if the first times out after 2 minutes, the second watcher will try to wait for 8 minutes.

And the cool-off sleep is no longer parameterized, which removes the change of flaking like I saw today.

stevekuznetsov · 2018-12-06T16:31:40Z

/sigh

Maybe we should have a nice Go binary that handles these sorts of things instead of cobbling together bash? :)

wking · 2018-12-10T00:38:38Z

Saw this again here. @michaelgugino, @mtnbikenc, @sdodson, @vrutkovs, can you take a look? @abhinavdahiya, @crawford, I can also split off the installer change into a separate PR if we don't want to wait for the Andible folks. Thoughts?

crawford

/lgtm

crawford · 2018-12-10T01:38:26Z

@wking is there any advantage to getting our's in earlier? If not, let's just wait.

wking · 2018-12-10T01:43:58Z

The advantage is that repos blocking CI aren't running the Ansible tests, and fixing this for AWS makes it more likely that the PRs we need to unblock CI can squeak through.

crawford · 2018-12-10T01:48:32Z

Ah, okay. Maybe split this. I haven't heard back from the Ansible team.

sdodson · 2018-12-10T02:48:59Z

/cc @vrutkovs
Can you ack the change to the ansible related job? The rest has already been merged.

Catch up with ac206e7 (ci-operator/templates/openshift: Refactor router-rollout wait (again), 2018-11-05, openshift#2342) and ff16a01 (ci-operator/templates/openshift: Refactor router-rollout 'oc oc', 2018-12-09, openshift#2343).

wking · 2018-12-10T04:40:42Z

Rebased onto master with a189182->7cc59bc, which also updates to just catch Ansible up now that the installer changes have landed via #2342 and #2343.

wking · 2018-12-10T04:41:27Z

/assign @vrutkovs
/unassign @crawford

vrutkovs · 2018-12-10T10:41:03Z

/lgtm

Why would we not wait for ingress cluster operator to stop progressing instead?

openshift-ci-robot · 2018-12-10T10:41:11Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: crawford, vrutkovs, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~ci-operator/templates/openshift/openshift-ansible/OWNERS~~ [vrutkovs]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2018-12-10T10:48:37Z

@wking: Updated the prow-job-cluster-launch-e2e-40 configmap using the following files:

key cluster-launch-e2e-40.yaml using file ci-operator/templates/openshift/openshift-ansible/cluster-launch-e2e-40.yaml

Details

In response to this:

Today I saw:
error: watch closed before Until timeout
error openshift-ingress/deploy/router-default did not come up
sleep: invalid option -- '4'
Try 'sleep --help' for more information.
I suspect that the rollout status request took long enough that the fresh date call generated a time larger than wait_expiry_time. This commit rerolls the logic last touched by 7991fd3 (#2004), with an implementation based on one of my suggestions there. And, full disclosure, the buggy implementation from #2004 is also based on one of my suggestions, so don't assume I know what I'm talking about ;).

Now we pick a total wait time (10 minutes), regardless of how many times we need to reconnect the watcher. With this commit, each watcher will try to wait for the full remaining period. So the first watcher tries to wait for 10 minutes. And if the first times out after 2 minutes, the second watcher will try to wait for 8 minutes.

And the cool-off sleep is no longer parameterized, which removes the change of flaking like I saw today.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

wking · 2018-12-10T10:56:22Z

Why would we not wait for ingress cluster operator to stop progressing instead?

Tradition ;). We're goiint to replace all these hacks with a cluster-version waiter once we have that working.

openshift-ci-robot requested review from michaelgugino and mtnbikenc December 6, 2018 07:52

openshift-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Dec 6, 2018

wking force-pushed the e2e-sleep-flake branch from 1771de4 to a189182 Compare December 6, 2018 07:52

wking mentioned this pull request Dec 6, 2018

pkg/asset/installconfig/aws: Ask for AWS access key and secret openshift/installer#798

Merged

openshift-ci-robot assigned crawford Dec 10, 2018

crawford reviewed Dec 10, 2018

View reviewed changes

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Dec 10, 2018

wking mentioned this pull request Dec 10, 2018

ci-operator/templates/openshift: Refactor router-rollout wait (again) #2342

Merged

sdodson removed request for michaelgugino and mtnbikenc December 10, 2018 02:47

openshift-ci-robot requested a review from vrutkovs December 10, 2018 02:48

wking force-pushed the e2e-sleep-flake branch from a189182 to 4bbde49 Compare December 10, 2018 04:38

openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Dec 10, 2018

openshift-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Dec 10, 2018

wking force-pushed the e2e-sleep-flake branch from 4bbde49 to 7cc59bc Compare December 10, 2018 04:39

openshift-ci-robot assigned vrutkovs and unassigned crawford Dec 10, 2018

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Dec 10, 2018

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 10, 2018

openshift-merge-robot merged commit 527c69a into openshift:master Dec 10, 2018

wking deleted the e2e-sleep-flake branch December 11, 2018 15:17

ci-operator/templates/openshift: Refactor router-rollout wait (again) #2321

ci-operator/templates/openshift: Refactor router-rollout wait (again) #2321

Uh oh!

Conversation

wking commented Dec 6, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stevekuznetsov commented Dec 6, 2018

Uh oh!

wking commented Dec 10, 2018

Uh oh!

crawford left a comment

Choose a reason for hiding this comment

Uh oh!

crawford commented Dec 10, 2018

Uh oh!

wking commented Dec 10, 2018

Uh oh!

crawford commented Dec 10, 2018

Uh oh!

sdodson commented Dec 10, 2018

Uh oh!

wking commented Dec 10, 2018

Uh oh!

wking commented Dec 10, 2018

Uh oh!

vrutkovs commented Dec 10, 2018

Uh oh!

openshift-ci-robot commented Dec 10, 2018

Uh oh!

openshift-ci-robot commented Dec 10, 2018

Uh oh!

wking commented Dec 10, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

wking commented Dec 6, 2018 •

edited

Loading