-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Fix how we wait on router becoming available #2004
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not clear on how -w plays with MAX_RETRIES and the sleep 60 at the tail of the loop. But a max of 10 infinitely-blocking calls seems like something that's going to hang for too long sometimes ;). Maybe something like:
i=0
MAX_RETRIES=10
TARGET="$(date +%s)"
TARGET="$((TARGET + 60))"
until oc --request-timeout=55s oc rollout status "${ROUTER_DEPLOYMENT}" -n "${ROUTER_NAMESPACE}" -w || [ $i -eq $MAX_RETRIES ]; do
i=$((i+1))
...
REMAINING="$((TARGET - NOW))"
sleep "${REMAINING}"
TARGET="$((TARGET + 60))"
doneThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lol - yeah that's true but figured its going to fail anyway beyond this point if we don't get back the rollout info (assume there is a timeout above this in the caller) but ... fair point. I created this PR as an alternative - am not certain its needed as @smarterclayton was ok with the other alternative in #2001 - putting this one on hold.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... but figured its going to fail anyway beyond this point if we don't get back the rollout info...
Then we can fail after the loop if [[ $i -eq $MAX_RETRIES ]] (or move the MAX_RETRIES comparison inside the loop and exit 1 if it turns true)? Or just drop MAX_RETRIES and set the sleep to 1 second to recover from network hiccups and rely on whatever wrapping timeout to reap the container (assuming that exists)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think having a few retries might be good to do anyway to handle any other connectivity/spurious issues. I'll fix this up as per your original comments with a timeout. Thanks.
|
/hold |
|
/hold cancel |
|
/retest |
|
@wking PTAL Thx. Made the changes - I put in a 2m wait on the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should drop the MAX_RETRIES check here. It's originally from fe24e8b (#1773), but:
$ i=0
$ MAX_RETRIES=3
$ until false || [ $i -eq $MAX_RETRIES ]; do i=$((i+1)); echo "$i"; [ $i -eq $MAX_RETRIES ] && echo "timeout" && break; done
1
2
3
timeout
$ i=0
$ until false; do i=$((i+1)); echo "$i"; [ $i -eq $MAX_RETRIES ] && echo "timeout" && break; done
$ until false; do i=$((i+1)); echo "$i"; [ $i -eq $MAX_RETRIES ] && echo "timeout" && break; done
1
2
3
timeoutYou never hit the case where the MAX_RETRIES condition in this line is true, because you always hit the exit 1 timeout inside the loop body first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will drop it - it is redundant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
60 seconds seems like a long break for a network or spurious issue. On the other hand, if we only sleep 1, we'd exit 1 here if the network was out for only 10 seconds. So I'd still rather have logic that said "this is the wall time we're willing to wait, we don't care how many network connections you need to cover that window". If you don't like my earlier proposal, how about:
TARGET="$(date -d '10 minutes' +%s)"
NOW="$(date +%s)"
while [[ "${NOW}" -lt "${TARGET}" ]]; do
REMAINING="$((TARGET - NOW))"
if oc --request-timeout="${REMAINING}s" oc rollout status "${ROUTER_DEPLOYMENT}" -n "${ROUTER_NAMESPACE}" -w; then
break
fi
NOW="$(date +%s)"
done
[[ "${NOW}" -ge "${TARGET}" ]] && echo "timeout waiting for ${ROUTER_NAMESPACE}/${ROUTER_DEPLOYMENT} to be available" && exit 1or:
TARGET="$(date -d '10 minutes' +%s)"
NOW="$(date +%s)"
REMAINING="$((TARGET - NOW))"
until oc --request-timeout="${REMAINING}s" oc rollout status "${ROUTER_DEPLOYMENT}" -n "${ROUTER_NAMESPACE}" -w; do
NOW="$(date +%s)"
REMAINING="$((TARGET - NOW))"
if [[ "${REMAINING}" -le 0 ]]; then
echo "timeout waiting for ${ROUTER_NAMESPACE}/${ROUTER_DEPLOYMENT} to be available"
exit 1
fi
doneAnd all of these are untested sketches, if we hit on phrasing you like let me know and I can test it more thoroughly ;).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nah, the original one is fine. I'll use that one (and test it) and update this PR. Thanks.
uses daemonsets (not supported by `oc wait`).
|
@wking made the requested change PTAL Thx
|
So ideally this would be: oc --request-timeout=900s rollout status "${ROUTER_DEPLOYMENT}" -n "${ROUTER_NAMESPACE}" -wBut connections break down, so you want to retry. And if the connection breaks after 5 minutes, you want the next call to have The constant-timeout, max-retries approach has to balance efficiency and responsiveness (favoring long timeouts and therefore low max-retries) against resiliency (favoring high max-retries and therefore low request tineouts). I want both ;). But really, this seems like a generic problem that really wants an oc --request-timeout=900s --reconnect rollout status "${ROUTER_DEPLOYMENT}" -n "${ROUTER_NAMESPACE}" -wWithout all the wrapping shell business. I'm happy pushing for that, and landing any |
|
Ah, I hear you're off today, so I'll just: /lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ramr, wking The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
@ramr: Updated the
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
Just to correct the record: I mistook today for Friday, so Ram should in fact be around... I'm sure he'll be pleased to see this merged when he gets in 😁 |
Today I saw [1]: error: watch closed before Until timeout error openshift-ingress/deploy/router-default did not come up sleep: invalid option -- '4' Try 'sleep --help' for more information. I suspect that the 'rollout status' request took long enough that the fresh 'date' call generated a time larger than wait_expiry_time. This commit rerolls the logic last touched by 7991fd3 (Fix how we wait on router rollout as the new cluster ingress operator, 2018-10-23, openshift#2004). Now we pick a total wait time (10 minutes), regardless of how many times we need to reconnect the watcher. With this commit, each watcher will try to wait for the full remaining period. So the first watcher tries to wait for 10 minutes. And if the first times out after 2 minutes, the second watcher will try to wait for 8 minutes. And the cool-off sleep is no longer parameterized, which removes the change of flaking like I saw today. [1]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/688/pull-ci-openshift-installer-master-e2e-aws/1971/build-log.txt
as the cluster ingress operator uses daemonsets (not supported by
oc wait).this could be an alternative to PR #2001
@ironcladlou @smarterclayton PTAL thx