Skip to content

Conversation

@2uasimojo
Copy link
Member

@2uasimojo 2uasimojo commented Jun 28, 2021

Add enhancement doc proposing ways to keep Active clusters in a ClusterPool for faster claiming.

HIVE-1576

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 28, 2021
@petr-muller
Copy link
Member

/cc

@openshift-ci openshift-ci bot requested a review from petr-muller June 28, 2021 21:19
@gurnben
Copy link
Member

gurnben commented Jul 30, 2021

Everything looks good to me @2uasimojo - you've done a particularly good job of enumerating the edge cases, reasoning, and alternatives! We're approaching a point where we might even keep a hot spare or two to speed up CI!

@2uasimojo 2uasimojo force-pushed the HIVE-1576 branch 2 times, most recently from 68b0c01 to 30e9fb2 Compare August 20, 2021 20:50
@abhinavdahiya
Copy link
Contributor

LGTM

Copy link
Contributor

@dgoodwin dgoodwin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty good to me

With this feature, we will create with `powerState` unset.
Then in a separate code path:
- Sort unassigned (installing + ready) CDs by `creationTimestamp`
- Update the oldest `runningCount` CDs with `powerState=Running`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this conflict with the stale cluster deletion, which is also based on oldest?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes (which wasn't a thing when I wrote this :) )

Today we don't take staleness into account when assigning: we'll assign the oldest installed cluster even if it's stale; but then that takes that cluster out of consideration for deletion due to staleness.

If we don't treat stale clusters specially:

  • Kick the oldest cluster, which happens to be stale, to running
  • Delete the oldest stale cluster, which is ^ that one
  • Replace that cluster. While it's installing...
  • Kick the oldest cluster, which happens to be stale, to running

We won't delete another stale cluster until the replacement is done installing. As long as a claim comes in before then, we'll assign the now-running stale cluster that was the second-oldest when we started. If not, we'll repeat the above sequence. It's not really thrashy since it's only happening at worst on a ~40m interval.

Is this something you feel we need to solve?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the important thing to note here is that we're preserving all of these rules separately (because trying to coordinate them would be insanely complicated):

  • Claiming gets the oldest ready cluster -- FIFO
  • runningCount hits the oldest ready clusters -- again for FIFO purposes so a given ⬆️ claim is most likely to get a running cluster
  • Delete the oldest stale cluster first -- again for FIFO purposes.

Add enhancement doc proposing a new field to keep Running clusters in a
ClusterPool for faster claiming.

HIVE-1576
@2uasimojo
Copy link
Member Author

/assign @joelddiaz

@joelddiaz
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 14, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 14, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: 2uasimojo, joelddiaz

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [2uasimojo,joelddiaz]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit 820ee2d into openshift:master Sep 14, 2021
@2uasimojo 2uasimojo deleted the HIVE-1576 branch September 14, 2021 19:03
2uasimojo added a commit to 2uasimojo/hive that referenced this pull request Oct 5, 2021
In openshift#1434, we added the paradigm of keeping some number of ClusterPool
clusters in Running state, so they don't have to suffer Resuming time
when claimed. However, that work was narrowly focussed on conforming to
the ClusterPool.Spec.RunningCount.

This commit expands the concept to try to make sure we're always running
enough clusters to satisfy claims. In particular, when we are servicing
a number of claims in excess of the pool's capacity, we create that
excess number of clusters... but we previously didn't account for that
number when calculating how many should be Running.

Now we do.

To explain via a corner case: If your pool has a zero Size and you
create a claim, we will create a cluster and immediately assign it. But
because RunningCount is zero, prior to this commit, we would create the
cluster with `PowerState=Hibernating`, and then kick it over to
`Running` when we assigned the claim. Now we'll create it with
`PowerState=Running` in the first place.

HIVE-1651
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants