Fix how we wait on routers - cluster ingress operator uses daemonsets #2001

ramr · 2018-10-23T20:57:38Z

Fixes how we wait on routers - cluster-ingress operator uses daemonsets ...
oc wait checks on status condition which doesn't unavailable for daemonsets, so this does a check on number of ready instances. Alternatively, we could possibly wait on the number of available instances.

@ironcladlou @smarterclayton PTAL Thx

openshift-ci-robot · 2018-10-23T20:57:43Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ramr
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: crawford

If they are not already assigned, you can assign the PR to them by writing /assign @crawford in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

ci-operator/templates/openshift/installer/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ironcladlou · 2018-10-23T21:02:07Z

Here's what I was testing... not attached to any one approach, just throwing it out here in case it's useful:

if [[ "${ROUTER_DEPLOYMENT}" == "ds/router-default" ]]; then
  while true; do
    [ $i -eq $MAX_RETRIES ] && echo "timeout waiting for router to be available" && exit 1
    avail="$(oc get -n ${ROUTER_NAMESPACE} -o go-template='{{ .status.numberAvailable }}' ${ROUTER_DEPLOYMENT})"
    desired="$(oc get -n ${ROUTER_NAMESPACE} -o go-template='{{ .status.desiredNumberScheduled }}' ${ROUTER_DEPLOYMENT})"
    [ $avail -eq $desired ] && break
    i=$((i+1))
    echo "error ${ROUTER_NAMESPACE}/${ROUTER_DEPLOYMENT} did not come up"
    sleep 60
  done
else
  until oc wait "${ROUTER_DEPLOYMENT}" -n "${ROUTER_NAMESPACE}" --for condition=available --timeout=10m || [ $i -eq $MAX_RETRIES ]; do
      i=$((i+1))
      [ $i -eq $MAX_RETRIES ] && echo "timeout waiting for router to be available" && exit 1
      echo "error ${ROUTER_NAMESPACE}/${ROUTER_DEPLOYMENT} did not come up"
      sleep 60
  done
fi

ironcladlou · 2018-10-23T21:08:55Z

ci-operator/templates/openshift/installer/cluster-launch-installer-e2e.yaml

Does this --timeout cause each iteration to wait 10 minutes in addition to the 60 seconds in the loop?!

that was the original wait code, it was (oc wait 10 mins + 1 min sleep) * 10 retries.

I wonder why? Seems like a good way to waste like 10 minutes during setup... it takes a matter of seconds for the router to deploy

Or does this have to account for a large portion of the general cluster setup?

Well it would depend on how fast the deployment occurs ... this will return back (fail) if there is no deployment object but once the deployment object exists, it would still need to wait on it becoming available. So I guess rather than loop and retry the oc wait operation, just waiting on it inline makes sense.
And it wouldn't necessarily be 10 minutes (that's the worst case scenario) - if the object is available it would return back prior to the timeout.

We were looping waiting because of a bug in wait.

Here is a sketch of something that may be more robust for looping on blocking calls.

ironcladlou · 2018-10-23T21:09:30Z

ci-operator/templates/openshift/installer/cluster-launch-installer-e2e.yaml

while we're at it, maybe make it "${ROUTER_NAMESPACE}/${ROUTER_DEPLOYMENT}" and add it to the timeout error too?

…sets.

smarterclayton · 2018-10-23T21:35:09Z

/hold

I’d prefer a real condition wait. If we don’t have one, only then can this be considered.

If we need to add things to wait, let’s do that.

ramr · 2018-10-23T22:24:42Z

@smarterclayton so oc|kubectl wait just waits on the status condition to match the expected condition-type. The daemonset doesn't have that - just:

status:
  currentNumberScheduled: 3
  desiredNumberScheduled: 3
  numberAvailable: 3
  numberMisscheduled: 0
  numberReady: 3
  observedGeneration: 1
  updatedNumberScheduled: 3

so we can't use oc wait on that.

The other alternative I found was that we could wait on a rollout here.
Example: oc rollout status ds/router-default -n openshift-ingress -w

would that work for you?

Here's a test log (commands run via the oc command line):

$ oc delete $(oc get pods -n openshift-ingress -o name) -n openshift-ingress  ;  oc get pods  -n openshift-ingress ; oc rollout status  ds/router-default -n openshift-ingress -w ; oc get all -n openshift-ingress
pod "router-default-gtvg4" deleted
pod "router-default-h69hg" deleted
pod "router-default-szqxp" deleted
NAME                   READY     STATUS              RESTARTS   AGE
router-default-52mrm   0/1       ContainerCreating   0          1s
router-default-6ckqk   0/1       Running             0          11s
router-default-l6f4b   0/1       ContainerCreating   0          7s
Waiting for daemon set "router-default" rollout to finish: 0 of 3 updated pods are available...
Waiting for daemon set "router-default" rollout to finish: 1 of 3 updated pods are available...
Waiting for daemon set "router-default" rollout to finish: 2 of 3 updated pods are available...
daemon set "router-default" successfully rolled out
NAME                       READY     STATUS    RESTARTS   AGE
pod/router-default-52mrm   1/1       Running   0          31s
pod/router-default-6ckqk   1/1       Running   0          41s
pod/router-default-l6f4b   1/1       Running   0          37s

NAME                     TYPE           CLUSTER-IP    EXTERNAL-IP   PORT(S)        AGE
service/router-default   LoadBalancer   10.3.247.22   <pending>     80:30068/TCP   10m

NAME                            DESIRED   CURRENT   READY     UP-TO-DATE   AVAILABLE   NODE SELECTOR                     AGE
daemonset.apps/router-default   3         3         3         3            3           node-role.kubernetes.io/worker=   10m

ironcladlou · 2018-10-23T22:46:48Z

Status check sgtm. Any flakiness or other concerns there, @smarterclayton? This could be the last thing holding up openshift/installer#467.

ramr · 2018-10-23T22:47:38Z

#2004 is an alternative approach here.

smarterclayton · 2018-10-23T23:32:33Z

DaemonSets don’t have conditions? Wtf, who is running this goat rodeo? @derekwaynecarr @mfojtik this is a pretty big gap. You can remove the hold but we need to fix this. Hacky bash wait conditions are the thing we don’t need. On Oct 23, 2018, at 6:46 PM, Dan Mace <[email protected]> wrote: Status check sgtm. Any flakiness or other concerns there, @smarterclayton <https://github.com/smarterclayton>? This could be the last thing holding up openshift/installer#467 <openshift/installer#467> . — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2001 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABG_p_hh8QZAA1fT6oW4VSjlMFt2QoQ_ks5un5xZgaJpZM4X2pbo> .

ironcladlou · 2018-10-23T23:38:45Z

The API supports them so maybe the question is why doesn't ours have any?

ironcladlou · 2018-10-23T23:50:33Z

Not finding the code upstream which actually computes/sets DaemonSet conditions... perhaps the field was added as part of 51594 for future use?

smarterclayton · 2018-10-24T02:40:58Z

I think we just never actually did the work to set them. I'll follow up with Michal and Tomas and see what we can do. Available and Progressing are both relevant on Daemonsets.

…

On Tue, Oct 23, 2018 at 7:50 PM Dan Mace ***@***.***> wrote: Not finding the code upstream which actually computes/sets DaemonSet conditions... perhaps the field was added <kubernetes/kubernetes@dc0167b> as part of 51594 <kubernetes/kubernetes#51594> for future use? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2001 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABG_pyNrq6YdqiCqXRS42G2_qmjBnGyVks5un6tLgaJpZM4X2pbo> .

ironcladlou · 2018-10-24T17:27:12Z

Okay, so last questions:

How did you test this, @ramr
How can we test a job template change like this in CI, @smarterclayton?

ramr · 2018-10-24T18:06:50Z

@ironcladlou as re: 1 - so the only way I could test this was to start up a new cluster with the installer and then ran this new code/commands via a standalone shell script by hand. If there is a better way to test this, please let me know - thanks.

smarterclayton · 2018-10-24T18:36:39Z

We don't have a way to test this today. In the future I think we will, but for now this is as good as it gets

…

On Wed, Oct 24, 2018 at 2:07 PM Ram Ranganathan ***@***.***> wrote: @ironcladlou <https://github.com/ironcladlou> as re: 1 - so the only way I could test this was to start up a new cluster with the installer and then ran this new code/commands via a standalone shell script by hand. If there is a better way to test this, please let me know - thanks. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2001 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABG_p0JjvKvqA0WaKgg6MauCmBmUoNOdks5uoKw7gaJpZM4X2pbo> .

ramr · 2018-10-24T19:59:34Z

/hold cancel

ramr · 2018-10-24T19:59:58Z

@smarterclayton anything else needed here? This is gating a couple of other PRs. Thanks.

smarterclayton · 2018-10-24T20:15:16Z

I thought there was a rollout status variant?

smarterclayton · 2018-10-24T20:16:14Z

Did #2004 actually work?

smarterclayton · 2018-10-24T20:16:23Z

I prefer that if it works

ramr · 2018-10-24T20:36:57Z

Yeah, #2004 works as well. Ok, I'll fix it up to address the comments on it.

ramr · 2018-10-24T21:56:00Z

closed in favor of #2004

openshift-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Oct 23, 2018

openshift-ci-robot requested review from abhinavdahiya and smarterclayton October 23, 2018 20:57

ironcladlou reviewed Oct 23, 2018

View reviewed changes

Fix how we wait on routers - new cluster ingress operator uses daemon…

48e3629

…sets.

ramr force-pushed the fix-router-ds-wait branch from da95d9c to 48e3629 Compare October 23, 2018 21:21

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 23, 2018

ramr mentioned this pull request Oct 23, 2018

Fix how we wait on router becoming available #2004

Merged

openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 24, 2018

ramr closed this Oct 24, 2018

Fix how we wait on routers - cluster ingress operator uses daemonsets #2001

Fix how we wait on routers - cluster ingress operator uses daemonsets #2001

Uh oh!

Conversation

ramr commented Oct 23, 2018

Uh oh!

openshift-ci-robot commented Oct 23, 2018

Uh oh!

ironcladlou commented Oct 23, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

smarterclayton commented Oct 23, 2018

Uh oh!

ramr commented Oct 23, 2018

Uh oh!

ironcladlou commented Oct 23, 2018

Uh oh!

ramr commented Oct 23, 2018

Uh oh!

smarterclayton commented Oct 23, 2018 via email

Uh oh!

ironcladlou commented Oct 23, 2018

Uh oh!

ironcladlou commented Oct 23, 2018

Uh oh!

smarterclayton commented Oct 24, 2018 via email

Uh oh!

ironcladlou commented Oct 24, 2018

Uh oh!

ramr commented Oct 24, 2018

Uh oh!

smarterclayton commented Oct 24, 2018 via email

Uh oh!

ramr commented Oct 24, 2018

Uh oh!

ramr commented Oct 24, 2018

Uh oh!

smarterclayton commented Oct 24, 2018

Uh oh!

smarterclayton commented Oct 24, 2018

Uh oh!

smarterclayton commented Oct 24, 2018

Uh oh!

ramr commented Oct 24, 2018

Uh oh!

ramr commented Oct 24, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants