-
Notifications
You must be signed in to change notification settings - Fork 254
RFE: ClusterPool hot spares #1434
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
/cc |
|
Everything looks good to me @2uasimojo - you've done a particularly good job of enumerating the edge cases, reasoning, and alternatives! We're approaching a point where we might even keep a hot spare or two to speed up CI! |
68b0c01 to
30e9fb2
Compare
|
LGTM |
dgoodwin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks pretty good to me
| With this feature, we will create with `powerState` unset. | ||
| Then in a separate code path: | ||
| - Sort unassigned (installing + ready) CDs by `creationTimestamp` | ||
| - Update the oldest `runningCount` CDs with `powerState=Running` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this conflict with the stale cluster deletion, which is also based on oldest?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes (which wasn't a thing when I wrote this :) )
Today we don't take staleness into account when assigning: we'll assign the oldest installed cluster even if it's stale; but then that takes that cluster out of consideration for deletion due to staleness.
If we don't treat stale clusters specially:
- Kick the oldest cluster, which happens to be stale, to running
- Delete the oldest stale cluster, which is ^ that one
- Replace that cluster. While it's installing...
- Kick the oldest cluster, which happens to be stale, to running
We won't delete another stale cluster until the replacement is done installing. As long as a claim comes in before then, we'll assign the now-running stale cluster that was the second-oldest when we started. If not, we'll repeat the above sequence. It's not really thrashy since it's only happening at worst on a ~40m interval.
Is this something you feel we need to solve?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the important thing to note here is that we're preserving all of these rules separately (because trying to coordinate them would be insanely complicated):
- Claiming gets the oldest ready cluster -- FIFO
- runningCount hits the oldest ready clusters -- again for FIFO purposes so a given ⬆️ claim is most likely to get a running cluster
- Delete the oldest stale cluster first -- again for FIFO purposes.
Add enhancement doc proposing a new field to keep Running clusters in a ClusterPool for faster claiming. HIVE-1576
30e9fb2 to
9eb7e10
Compare
|
/assign @joelddiaz |
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: 2uasimojo, joelddiaz The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
In openshift#1434, we added the paradigm of keeping some number of ClusterPool clusters in Running state, so they don't have to suffer Resuming time when claimed. However, that work was narrowly focussed on conforming to the ClusterPool.Spec.RunningCount. This commit expands the concept to try to make sure we're always running enough clusters to satisfy claims. In particular, when we are servicing a number of claims in excess of the pool's capacity, we create that excess number of clusters... but we previously didn't account for that number when calculating how many should be Running. Now we do. To explain via a corner case: If your pool has a zero Size and you create a claim, we will create a cluster and immediately assign it. But because RunningCount is zero, prior to this commit, we would create the cluster with `PowerState=Hibernating`, and then kick it over to `Running` when we assigned the claim. Now we'll create it with `PowerState=Running` in the first place. HIVE-1651
Add enhancement doc proposing ways to keep Active clusters in a ClusterPool for faster claiming.
HIVE-1576