-
Notifications
You must be signed in to change notification settings - Fork 253
ClusterPool: Delete broken (ProvisionStopped) clusters #1524
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
In preparation for adding more logic to two near-identical code paths, factor them into a common local function. No functional change; refactor only. Prep for HIVE-1615
Add a metric, `hive_cluster_deployments_provision_failed_terminal_total`, labeled by clusterpool namespace/name (blank if not a pool CD), keeping track of the total number of ClusterDeployments for which we declare provisioning failed for the last time. Part of HIVE-1615
Codecov Report
@@ Coverage Diff @@
## master #1524 +/- ##
==========================================
+ Coverage 41.51% 41.59% +0.08%
==========================================
Files 336 336
Lines 30569 30601 +32
==========================================
+ Hits 12691 12729 +38
+ Misses 16794 16785 -9
- Partials 1084 1087 +3
|
|
/test e2e-pool |
With this commit, the clusterpool controller detects "broken" ClusterDeployments as those whose ProvisionStopped condition is True* and deletes them so they can be replaced and do not continue to consume capacity in the pool. When adding or deleting clusters to satisfy capacity requirements, we prioritize as follows: - If we're under capacity, add new clusters before deleting broken clusters. This is in case we hit `maxConcurrent`: we would prefer to use that quota to add clusters to the pool. - If we're over capacity, delete broken clusters before deleting viable (installing or assignable) clusters. *In the future, we may expand this definition to include other definitions of "broken". HIVE-1615
| // If too many, delete some. | ||
| case drift > 0: | ||
| toDel := minIntVarible(drift, availableCurrent) | ||
| if err := r.deleteExcessClusters(cds, toDel, logger); err != nil { | ||
| return reconcile.Result{}, err | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to reviewers: this block moved down unchanged. Since drift > 0 and drift < 0 are disjoint, this didn't affect existing logic; but it was necessary to be able to inject the new case in the right spot logically.
| assert.Equal(t, test.expectedTotalClusters-test.expectedAssignedCDs, actualUnassignedCDs, "unexpected number of unassigned CDs") | ||
| assert.Equal(t, test.expectedRunning, actualRunning, "unexpected number of running CDs") | ||
| assert.Equal(t, test.expectedTotalClusters-test.expectedRunning, actualHibernating, "unexpected number of assigned CDs") | ||
| assert.Equal(t, test.expectedTotalClusters-test.expectedRunning, actualHibernating, "unexpected number of hibernating CDs") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Latent copypasta
| removeCDsFromSlice(&cds.assignable, cdName) | ||
| removeCDsFromSlice(&cds.installing, cdName) | ||
| removeCDsFromSlice(&cds.assignable, cdName) | ||
| removeCDsFromSlice(&cds.broken, cdName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
/test e2e |
| Name: "hive_cluster_deployments_provision_failed_terminal_total", | ||
| Help: "Counter incremented when a cluster provision has failed and won't be retried.", | ||
| }, | ||
| []string{"clusterpool_namespacedname"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't we want to clear the metric when provision succeeds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The provision will not succeed. We set this when we've failed for the last time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given how you're logging this metric, it would be deleted only when hive controller restarts - since it is not attached to anything else. So it should be cleared in the clusterpool controller for the relevant clusterpool
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still not following this. I don't see why we should ever clear this metric. I definitely don't see why controller restart should be a significant event. Why should this metric behave any differently from e.g. hive_cluster_deployments_installed_total?
| updated = false | ||
| // Fun extra variable to keep track of whether we should increment metricProvisionFailedTerminal | ||
| // later; because we only want to do that if (we change that status and) the status update succeeds. | ||
| provisionFailedTerminal := false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the drawback of not having this flag as a gating condition to report the metric?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having trouble parsing your sentence (English is hard). Are you suggesting we don't need this variable? How else would we know when to increment the counter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I had a thorough look at the code, and I think you don't actually need this boolean. Simply observe the metric where you're setting this metric to true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want to observe the metric unless we actually changed the ProvisionStopped condition to True due to a final ProvisionFailed. If I don't use this boolean then, for example, we'll increment it any time we change any of the ClusterInstall* conditions. We don't want that.
| updated = false | ||
| // Fun extra variable to keep track of whether we should increment metricProvisionFailedTerminal | ||
| // later; because we only want to do that if (we change that status and) the status update succeeds. | ||
| provisionFailedTerminal := false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I had a thorough look at the code, and I think you don't actually need this boolean. Simply observe the metric where you're setting this metric to true.
| } | ||
| // If we declared the provision terminally failed, bump our metric | ||
| if provisionFailedTerminal { | ||
| incProvisionFailedTerminal(cd) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On a related note, you don't actually need to observe it in metrics reconcile, you can fire the metric.observe right here (make it a global variable and set it here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made incProvisionFailedTerminal a func because it was more than 1LOC and I needed to call it from multiple places.
| Name: "hive_cluster_deployments_provision_failed_terminal_total", | ||
| Help: "Counter incremented when a cluster provision has failed and won't be retried.", | ||
| }, | ||
| []string{"clusterpool_namespacedname"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given how you're logging this metric, it would be deleted only when hive controller restarts - since it is not attached to anything else. So it should be cleared in the clusterpool controller for the relevant clusterpool
| ) | ||
| ) | ||
|
|
||
| func incProvisionFailedTerminal(cd *hivev1.ClusterDeployment) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If for every CD the counter is incremented, then it should also be decremented if the relevant clusterdeployment is deleted.
I think you should make this metric a guage and report it every time for how many provisions failed.
Note that prometheus doesn't fire a change in metric until it is actually changed. So firing the metrics.Set multiple times for the same value is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If for every CD the counter is incremented, then it should also be decremented if the relevant clusterdeployment is deleted.
We're using a counter because we want to track how many times pool CDs failed to provision, ever. We'll use prometheus to report things like the rate at which these failures are occurring.
I think you should make this metric a guage
We talked about metrics when doing stale CD replacement, which is very similar to this. We discussed using a (separate) gauge metric to indicate the number of stale CDs in a pool at any given time, but ended up deciding YAGNI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah! This clarifies things - I had misjudged what the metric was used for
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: 2uasimojo, suhanime The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/retest |
With this PR, the clusterpool controller detects "broken" ClusterDeployments as those whose ProvisionStopped condition is True* and deletes them so they can be replaced and do not continue to consume capacity in the pool.
When adding or deleting clusters to satisfy capacity requirements, we prioritize as follows:
maxConcurrent: we would prefer to use that quota to add clusters to the pool.We also add a metric,
hive_cluster_deployments_provision_failed_terminal_total, labeled by clusterpool namespace/name (blank if not a pool CD), keeping track of the total number of ClusterDeployments for which we declare provisioning failed for the last time.*In the future, we may expand this definition to include other definitions of "broken".
HIVE-1615