Fix how we wait on router becoming available #2004

ramr · 2018-10-23T22:46:06Z

as the cluster ingress operator uses daemonsets (not supported by oc wait).
this could be an alternative to PR #2001

wking · 2018-10-24T19:15:06Z

ci-operator/templates/openshift/installer/cluster-launch-installer-e2e.yaml

I'm not clear on how -w plays with MAX_RETRIES and the sleep 60 at the tail of the loop. But a max of 10 infinitely-blocking calls seems like something that's going to hang for too long sometimes ;). Maybe something like:

i=0 MAX_RETRIES=10 TARGET="$(date +%s)" TARGET="$((TARGET + 60))" until oc --request-timeout=55s oc rollout status "${ROUTER_DEPLOYMENT}" -n "${ROUTER_NAMESPACE}" -w || [ $i -eq $MAX_RETRIES ]; do i=$((i+1)) ... REMAINING="$((TARGET - NOW))" sleep "${REMAINING}" TARGET="$((TARGET + 60))" done

lol - yeah that's true but figured its going to fail anyway beyond this point if we don't get back the rollout info (assume there is a timeout above this in the caller) but ... fair point. I created this PR as an alternative - am not certain its needed as @smarterclayton was ok with the other alternative in #2001 - putting this one on hold.

... but figured its going to fail anyway beyond this point if we don't get back the rollout info...

Then we can fail after the loop if [[ $i -eq $MAX_RETRIES ]] (or move the MAX_RETRIES comparison inside the loop and exit 1 if it turns true)? Or just drop MAX_RETRIES and set the sleep to 1 second to recover from network hiccups and rely on whatever wrapping timeout to reap the container (assuming that exists)?

I think having a few retries might be good to do anyway to handle any other connectivity/spurious issues. I'll fix this up as per your original comments with a timeout. Thanks.

ramr · 2018-10-24T19:37:11Z

/hold

ramr · 2018-10-24T21:49:26Z

/hold cancel

ramr · 2018-10-24T21:49:39Z

/retest

ramr · 2018-10-24T21:55:37Z

@wking PTAL Thx. Made the changes - I put in a 2m wait on the oc rollout command.
Test snippet log: https://gist.github.com/ramr/05fe97fe0e0d4a7929330dc178f33db0

wking · 2018-10-25T04:35:19Z

ci-operator/templates/openshift/installer/cluster-launch-installer-e2e.yaml

I think we should drop the MAX_RETRIES check here. It's originally from fe24e8b (#1773), but:

$ i=0 $ MAX_RETRIES=3 $ until false || [ $i -eq $MAX_RETRIES ]; do i=$((i+1)); echo "$i"; [ $i -eq $MAX_RETRIES ] && echo "timeout" && break; done 1 2 3 timeout $ i=0 $ until false; do i=$((i+1)); echo "$i"; [ $i -eq $MAX_RETRIES ] && echo "timeout" && break; done $ until false; do i=$((i+1)); echo "$i"; [ $i -eq $MAX_RETRIES ] && echo "timeout" && break; done 1 2 3 timeout

You never hit the case where the MAX_RETRIES condition in this line is true, because you always hit the exit 1 timeout inside the loop body first.

CC @abhinavdahiya

will drop it - it is redundant.

wking · 2018-10-25T04:53:34Z

ci-operator/templates/openshift/installer/cluster-launch-installer-e2e.yaml

60 seconds seems like a long break for a network or spurious issue. On the other hand, if we only sleep 1, we'd exit 1 here if the network was out for only 10 seconds. So I'd still rather have logic that said "this is the wall time we're willing to wait, we don't care how many network connections you need to cover that window". If you don't like my earlier proposal, how about:

TARGET="$(date -d '10 minutes' +%s)" NOW="$(date +%s)" while [[ "${NOW}" -lt "${TARGET}" ]]; do REMAINING="$((TARGET - NOW))" if oc --request-timeout="${REMAINING}s" oc rollout status "${ROUTER_DEPLOYMENT}" -n "${ROUTER_NAMESPACE}" -w; then break fi NOW="$(date +%s)" done [[ "${NOW}" -ge "${TARGET}" ]] && echo "timeout waiting for ${ROUTER_NAMESPACE}/${ROUTER_DEPLOYMENT} to be available" && exit 1

or:

TARGET="$(date -d '10 minutes' +%s)" NOW="$(date +%s)" REMAINING="$((TARGET - NOW))" until oc --request-timeout="${REMAINING}s" oc rollout status "${ROUTER_DEPLOYMENT}" -n "${ROUTER_NAMESPACE}" -w; do NOW="$(date +%s)" REMAINING="$((TARGET - NOW))" if [[ "${REMAINING}" -le 0 ]]; then echo "timeout waiting for ${ROUTER_NAMESPACE}/${ROUTER_DEPLOYMENT} to be available" exit 1 fi done

And all of these are untested sketches, if we hit on phrasing you like let me know and I can test it more thoroughly ;).

Nah, the original one is fine. I'll use that one (and test it) and update this PR. Thanks.

uses daemonsets (not supported by `oc wait`).

ramr · 2018-10-25T06:05:26Z

@wking made the requested change PTAL Thx

collapsed a couple of lines from the wait expiry checks you had suggested
removed the redundant condition check on the until line
increased the wait time to 15 mins (took a while for the cluster to come up, so figured might as well bump that limit up a wee bit).
Tested just the snippet via a script albeit with slightly different values for the timeouts to simulate getting into the loop ... https://gist.github.com/ramr/6b585f5b579b7df9ebe034bf896c7832

wking · 2018-10-25T13:58:49Z

increased the wait time to 15 mins...

So ideally this would be:

oc --request-timeout=900s rollout status "${ROUTER_DEPLOYMENT}" -n "${ROUTER_NAMESPACE}" -w

But connections break down, so you want to retry. And if the connection breaks after 5 minutes, you want the next call to have --request-timeout=600s. Or maybe you're concerned about flooding requests, so you want to sleep for two seconds and run the next call with --request-timeout=598s. My suggestions here have that sliding timeout.

The constant-timeout, max-retries approach has to balance efficiency and responsiveness (favoring long timeouts and therefore low max-retries) against resiliency (favoring high max-retries and therefore low request tineouts). I want both ;).

But really, this seems like a generic problem that really wants an oc/kubectl-level fix, so you could:

oc --request-timeout=900s --reconnect rollout status "${ROUTER_DEPLOYMENT}" -n "${ROUTER_NAMESPACE}" -w

Without all the wrapping shell business. I'm happy pushing for that, and landing any rollout approach in the meantime. Are you comfortable with 7991fd3?

wking · 2018-10-25T14:07:57Z

Ah, I hear you're off today, so I'll just:

/lgtm

openshift-ci-robot · 2018-10-25T14:08:08Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ramr, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~ci-operator/templates/openshift/installer/OWNERS~~ [wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2018-10-25T14:10:42Z

@ramr: Updated the prow-job-cluster-launch-installer-e2e configmap using the following files:

key cluster-launch-installer-e2e.yaml using file ci-operator/templates/openshift/installer/cluster-launch-installer-e2e.yaml

Details

In response to this:

as the cluster ingress operator uses daemonsets (not supported by oc wait).
this could be an alternative to PR #2001

@ironcladlou @smarterclayton PTAL thx

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ironcladlou · 2018-10-25T17:21:46Z

Just to correct the record: I mistook today for Friday, so Ram should in fact be around... I'm sure he'll be pleased to see this merged when he gets in 😁

Today I saw [1]: error: watch closed before Until timeout error openshift-ingress/deploy/router-default did not come up sleep: invalid option -- '4' Try 'sleep --help' for more information. I suspect that the 'rollout status' request took long enough that the fresh 'date' call generated a time larger than wait_expiry_time. This commit rerolls the logic last touched by 7991fd3 (Fix how we wait on router rollout as the new cluster ingress operator, 2018-10-23, openshift#2004). Now we pick a total wait time (10 minutes), regardless of how many times we need to reconnect the watcher. With this commit, each watcher will try to wait for the full remaining period. So the first watcher tries to wait for 10 minutes. And if the first times out after 2 minutes, the second watcher will try to wait for 8 minutes. And the cool-off sleep is no longer parameterized, which removes the change of flaking like I saw today. [1]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/688/pull-ci-openshift-installer-master-e2e-aws/1971/build-log.txt

openshift-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Oct 23, 2018

openshift-ci-robot requested review from abhinavdahiya and rajatchopra October 23, 2018 22:46

ramr mentioned this pull request Oct 23, 2018

Fix how we wait on routers - cluster ingress operator uses daemonsets #2001

Closed

wking reviewed Oct 24, 2018

View reviewed changes

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 24, 2018

ramr force-pushed the router-rollout branch from 4140a7c to fb1fbca Compare October 24, 2018 21:45

openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 24, 2018

wking reviewed Oct 25, 2018

View reviewed changes

ramr force-pushed the router-rollout branch from fb1fbca to ef5e5b2 Compare October 25, 2018 05:50

openshift-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Oct 25, 2018

Fix how we wait on router rollout as the new cluster ingress operator

7991fd3

uses daemonsets (not supported by `oc wait`).

ramr force-pushed the router-rollout branch from ef5e5b2 to 7991fd3 Compare October 25, 2018 05:52

openshift-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 25, 2018

openshift-ci-robot assigned wking Oct 25, 2018

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 25, 2018

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 25, 2018

openshift-merge-robot merged commit 083c51a into openshift:master Oct 25, 2018

wking mentioned this pull request Dec 6, 2018

ci-operator/templates/openshift: Refactor router-rollout wait (again) #2321

Merged

Fix how we wait on router becoming available #2004

Fix how we wait on router becoming available #2004

Uh oh!

Conversation

ramr commented Oct 23, 2018

Uh oh!

wking Oct 24, 2018

Choose a reason for hiding this comment

Uh oh!

ramr Oct 24, 2018

Choose a reason for hiding this comment

Uh oh!

wking Oct 24, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ramr Oct 24, 2018

Choose a reason for hiding this comment

Uh oh!

ramr commented Oct 24, 2018

Uh oh!

ramr commented Oct 24, 2018

Uh oh!

ramr commented Oct 24, 2018

Uh oh!

ramr commented Oct 24, 2018

Uh oh!

wking Oct 25, 2018

Choose a reason for hiding this comment

Uh oh!

ramr Oct 25, 2018

Choose a reason for hiding this comment

Uh oh!

wking Oct 25, 2018

Choose a reason for hiding this comment

Uh oh!

ramr Oct 25, 2018

Choose a reason for hiding this comment

Uh oh!

ramr commented Oct 25, 2018

Uh oh!

wking commented Oct 25, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wking commented Oct 25, 2018

Uh oh!

openshift-ci-robot commented Oct 25, 2018

Uh oh!

openshift-ci-robot commented Oct 25, 2018

Uh oh!

ironcladlou commented Oct 25, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

wking Oct 24, 2018 •

edited

Loading

wking commented Oct 25, 2018 •

edited

Loading