Skip to content

Conversation

@ramr
Copy link
Contributor

@ramr ramr commented Oct 23, 2018

Fixes how we wait on routers - cluster-ingress operator uses daemonsets ...
oc wait checks on status condition which doesn't unavailable for daemonsets, so this does a check on number of ready instances. Alternatively, we could possibly wait on the number of available instances.

@ironcladlou @smarterclayton PTAL Thx

@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ramr
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: crawford

If they are not already assigned, you can assign the PR to them by writing /assign @crawford in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Oct 23, 2018
@ironcladlou
Copy link
Contributor

Here's what I was testing... not attached to any one approach, just throwing it out here in case it's useful:

if [[ "${ROUTER_DEPLOYMENT}" == "ds/router-default" ]]; then
  while true; do
    [ $i -eq $MAX_RETRIES ] && echo "timeout waiting for router to be available" && exit 1
    avail="$(oc get -n ${ROUTER_NAMESPACE} -o go-template='{{ .status.numberAvailable }}' ${ROUTER_DEPLOYMENT})"
    desired="$(oc get -n ${ROUTER_NAMESPACE} -o go-template='{{ .status.desiredNumberScheduled }}' ${ROUTER_DEPLOYMENT})"
    [ $avail -eq $desired ] && break
    i=$((i+1))
    echo "error ${ROUTER_NAMESPACE}/${ROUTER_DEPLOYMENT} did not come up"
    sleep 60
  done
else
  until oc wait "${ROUTER_DEPLOYMENT}" -n "${ROUTER_NAMESPACE}" --for condition=available --timeout=10m || [ $i -eq $MAX_RETRIES ]; do
      i=$((i+1))
      [ $i -eq $MAX_RETRIES ] && echo "timeout waiting for router to be available" && exit 1
      echo "error ${ROUTER_NAMESPACE}/${ROUTER_DEPLOYMENT} did not come up"
      sleep 60
  done
fi

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this --timeout cause each iteration to wait 10 minutes in addition to the 60 seconds in the loop?!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that was the original wait code, it was (oc wait 10 mins + 1 min sleep) * 10 retries.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why? Seems like a good way to waste like 10 minutes during setup... it takes a matter of seconds for the router to deploy

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or does this have to account for a large portion of the general cluster setup?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well it would depend on how fast the deployment occurs ... this will return back (fail) if there is no deployment object but once the deployment object exists, it would still need to wait on it becoming available. So I guess rather than loop and retry the oc wait operation, just waiting on it inline makes sense.
And it wouldn't necessarily be 10 minutes (that's the worst case scenario) - if the object is available it would return back prior to the timeout.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We were looping waiting because of a bug in wait.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a sketch of something that may be more robust for looping on blocking calls.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while we're at it, maybe make it "${ROUTER_NAMESPACE}/${ROUTER_DEPLOYMENT}" and add it to the timeout error too?

@ramr ramr force-pushed the fix-router-ds-wait branch from da95d9c to 48e3629 Compare October 23, 2018 21:21
@smarterclayton
Copy link
Contributor

/hold

I’d prefer a real condition wait. If we don’t have one, only then can this be considered.

If we need to add things to wait, let’s do that.

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 23, 2018
@ramr
Copy link
Contributor Author

ramr commented Oct 23, 2018

@smarterclayton so oc|kubectl wait just waits on the status condition to match the expected condition-type. The daemonset doesn't have that - just:

status:
  currentNumberScheduled: 3
  desiredNumberScheduled: 3
  numberAvailable: 3
  numberMisscheduled: 0
  numberReady: 3
  observedGeneration: 1
  updatedNumberScheduled: 3

so we can't use oc wait on that.

The other alternative I found was that we could wait on a rollout here.
Example: oc rollout status ds/router-default -n openshift-ingress -w

would that work for you?

Here's a test log (commands run via the oc command line):

$ oc delete $(oc get pods -n openshift-ingress -o name) -n openshift-ingress  ;  oc get pods  -n openshift-ingress ; oc rollout status  ds/router-default -n openshift-ingress -w ; oc get all -n openshift-ingress
pod "router-default-gtvg4" deleted
pod "router-default-h69hg" deleted
pod "router-default-szqxp" deleted
NAME                   READY     STATUS              RESTARTS   AGE
router-default-52mrm   0/1       ContainerCreating   0          1s
router-default-6ckqk   0/1       Running             0          11s
router-default-l6f4b   0/1       ContainerCreating   0          7s
Waiting for daemon set "router-default" rollout to finish: 0 of 3 updated pods are available...
Waiting for daemon set "router-default" rollout to finish: 1 of 3 updated pods are available...
Waiting for daemon set "router-default" rollout to finish: 2 of 3 updated pods are available...
daemon set "router-default" successfully rolled out
NAME                       READY     STATUS    RESTARTS   AGE
pod/router-default-52mrm   1/1       Running   0          31s
pod/router-default-6ckqk   1/1       Running   0          41s
pod/router-default-l6f4b   1/1       Running   0          37s

NAME                     TYPE           CLUSTER-IP    EXTERNAL-IP   PORT(S)        AGE
service/router-default   LoadBalancer   10.3.247.22   <pending>     80:30068/TCP   10m

NAME                            DESIRED   CURRENT   READY     UP-TO-DATE   AVAILABLE   NODE SELECTOR                     AGE
daemonset.apps/router-default   3         3         3         3            3           node-role.kubernetes.io/worker=   10m

@ironcladlou
Copy link
Contributor

Status check sgtm. Any flakiness or other concerns there, @smarterclayton? This could be the last thing holding up openshift/installer#467.

@ramr
Copy link
Contributor Author

ramr commented Oct 23, 2018

#2004 is an alternative approach here.

@smarterclayton
Copy link
Contributor

smarterclayton commented Oct 23, 2018 via email

@ironcladlou
Copy link
Contributor

The API supports them so maybe the question is why doesn't ours have any?

@ironcladlou
Copy link
Contributor

Not finding the code upstream which actually computes/sets DaemonSet conditions... perhaps the field was added as part of 51594 for future use?

@smarterclayton
Copy link
Contributor

smarterclayton commented Oct 24, 2018 via email

@ironcladlou
Copy link
Contributor

Okay, so last questions:

  1. How did you test this, @ramr
  2. How can we test a job template change like this in CI, @smarterclayton?

@ramr
Copy link
Contributor Author

ramr commented Oct 24, 2018

@ironcladlou as re: 1 - so the only way I could test this was to start up a new cluster with the installer and then ran this new code/commands via a standalone shell script by hand. If there is a better way to test this, please let me know - thanks.

@smarterclayton
Copy link
Contributor

smarterclayton commented Oct 24, 2018 via email

@ramr
Copy link
Contributor Author

ramr commented Oct 24, 2018

/hold cancel

@ramr
Copy link
Contributor Author

ramr commented Oct 24, 2018

@smarterclayton anything else needed here? This is gating a couple of other PRs. Thanks.

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 24, 2018
@smarterclayton
Copy link
Contributor

I thought there was a rollout status variant?

@smarterclayton
Copy link
Contributor

Did #2004 actually work?

@smarterclayton
Copy link
Contributor

I prefer that if it works

@ramr
Copy link
Contributor Author

ramr commented Oct 24, 2018

Yeah, #2004 works as well. Ok, I'll fix it up to address the comments on it.

@ramr
Copy link
Contributor Author

ramr commented Oct 24, 2018

closed in favor of #2004

@ramr ramr closed this Oct 24, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/S Denotes a PR that changes 10-29 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants