-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Fix how we wait on routers - cluster ingress operator uses daemonsets #2001
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: ramr If they are not already assigned, you can assign the PR to them by writing The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Here's what I was testing... not attached to any one approach, just throwing it out here in case it's useful: if [[ "${ROUTER_DEPLOYMENT}" == "ds/router-default" ]]; then
while true; do
[ $i -eq $MAX_RETRIES ] && echo "timeout waiting for router to be available" && exit 1
avail="$(oc get -n ${ROUTER_NAMESPACE} -o go-template='{{ .status.numberAvailable }}' ${ROUTER_DEPLOYMENT})"
desired="$(oc get -n ${ROUTER_NAMESPACE} -o go-template='{{ .status.desiredNumberScheduled }}' ${ROUTER_DEPLOYMENT})"
[ $avail -eq $desired ] && break
i=$((i+1))
echo "error ${ROUTER_NAMESPACE}/${ROUTER_DEPLOYMENT} did not come up"
sleep 60
done
else
until oc wait "${ROUTER_DEPLOYMENT}" -n "${ROUTER_NAMESPACE}" --for condition=available --timeout=10m || [ $i -eq $MAX_RETRIES ]; do
i=$((i+1))
[ $i -eq $MAX_RETRIES ] && echo "timeout waiting for router to be available" && exit 1
echo "error ${ROUTER_NAMESPACE}/${ROUTER_DEPLOYMENT} did not come up"
sleep 60
done
fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this --timeout cause each iteration to wait 10 minutes in addition to the 60 seconds in the loop?!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that was the original wait code, it was (oc wait 10 mins + 1 min sleep) * 10 retries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder why? Seems like a good way to waste like 10 minutes during setup... it takes a matter of seconds for the router to deploy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or does this have to account for a large portion of the general cluster setup?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well it would depend on how fast the deployment occurs ... this will return back (fail) if there is no deployment object but once the deployment object exists, it would still need to wait on it becoming available. So I guess rather than loop and retry the oc wait operation, just waiting on it inline makes sense.
And it wouldn't necessarily be 10 minutes (that's the worst case scenario) - if the object is available it would return back prior to the timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We were looping waiting because of a bug in wait.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a sketch of something that may be more robust for looping on blocking calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
while we're at it, maybe make it "${ROUTER_NAMESPACE}/${ROUTER_DEPLOYMENT}" and add it to the timeout error too?
da95d9c to
48e3629
Compare
|
/hold I’d prefer a real condition wait. If we don’t have one, only then can this be considered. If we need to add things to wait, let’s do that. |
|
@smarterclayton so so we can't use The other alternative I found was that we could wait on a rollout here. would that work for you? Here's a test log (commands run via the oc command line): |
|
Status check sgtm. Any flakiness or other concerns there, @smarterclayton? This could be the last thing holding up openshift/installer#467. |
|
#2004 is an alternative approach here. |
|
DaemonSets don’t have conditions? Wtf, who is running this goat rodeo?
@derekwaynecarr @mfojtik this is a pretty big gap.
You can remove the hold but we need to fix this. Hacky bash wait
conditions are the thing we don’t need.
On Oct 23, 2018, at 6:46 PM, Dan Mace <[email protected]> wrote:
Status check sgtm. Any flakiness or other concerns there, @smarterclayton
<https://github.com/smarterclayton>? This could be the last thing holding
up openshift/installer#467 <openshift/installer#467>
.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2001 (comment)>,
or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABG_p_hh8QZAA1fT6oW4VSjlMFt2QoQ_ks5un5xZgaJpZM4X2pbo>
.
|
|
The API supports them so maybe the question is why doesn't ours have any? |
|
I think we just never actually did the work to set them. I'll follow up
with Michal and Tomas and see what we can do. Available and Progressing
are both relevant on Daemonsets.
…On Tue, Oct 23, 2018 at 7:50 PM Dan Mace ***@***.***> wrote:
Not finding the code upstream which actually computes/sets DaemonSet
conditions... perhaps the field was added
<kubernetes/kubernetes@dc0167b>
as part of 51594 <kubernetes/kubernetes#51594>
for future use?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2001 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABG_pyNrq6YdqiCqXRS42G2_qmjBnGyVks5un6tLgaJpZM4X2pbo>
.
|
|
Okay, so last questions:
|
|
@ironcladlou as re: 1 - so the only way I could test this was to start up a new cluster with the installer and then ran this new code/commands via a standalone shell script by hand. If there is a better way to test this, please let me know - thanks. |
|
We don't have a way to test this today. In the future I think we will, but
for now this is as good as it gets
…On Wed, Oct 24, 2018 at 2:07 PM Ram Ranganathan ***@***.***> wrote:
@ironcladlou <https://github.com/ironcladlou> as re: 1 - so the only way
I could test this was to start up a new cluster with the installer and then
ran this new code/commands via a standalone shell script by hand. If there
is a better way to test this, please let me know - thanks.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2001 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABG_p0JjvKvqA0WaKgg6MauCmBmUoNOdks5uoKw7gaJpZM4X2pbo>
.
|
|
/hold cancel |
|
@smarterclayton anything else needed here? This is gating a couple of other PRs. Thanks. |
|
I thought there was a rollout status variant? |
|
Did #2004 actually work? |
|
I prefer that if it works |
|
Yeah, #2004 works as well. Ok, I'll fix it up to address the comments on it. |
|
closed in favor of #2004 |
Fixes how we wait on routers - cluster-ingress operator uses daemonsets ...
oc waitchecks on status condition which doesn't unavailable for daemonsets, so this does a check on number of ready instances. Alternatively, we could possibly wait on the number of available instances.@ironcladlou @smarterclayton PTAL Thx