ClusterPool: Delete broken (ProvisionStopped) clusters #1524

2uasimojo · 2021-09-01T18:59:31Z

With this PR, the clusterpool controller detects "broken" ClusterDeployments as those whose ProvisionStopped condition is True* and deletes them so they can be replaced and do not continue to consume capacity in the pool.

When adding or deleting clusters to satisfy capacity requirements, we prioritize as follows:

If we're under capacity, add new clusters before deleting broken clusters. This is in case we hit maxConcurrent: we would prefer to use that quota to add clusters to the pool.
If we're over capacity, delete broken clusters before deleting viable (installing or assignable) clusters.

We also add a metric, hive_cluster_deployments_provision_failed_terminal_total, labeled by clusterpool namespace/name (blank if not a pool CD), keeping track of the total number of ClusterDeployments for which we declare provisioning failed for the last time.

*In the future, we may expand this definition to include other definitions of "broken".

HIVE-1615

In preparation for adding more logic to two near-identical code paths, factor them into a common local function. No functional change; refactor only. Prep for HIVE-1615

Add a metric, `hive_cluster_deployments_provision_failed_terminal_total`, labeled by clusterpool namespace/name (blank if not a pool CD), keeping track of the total number of ClusterDeployments for which we declare provisioning failed for the last time. Part of HIVE-1615

codecov · 2021-09-01T19:18:51Z

Codecov Report

Merging #1524 (f5a92de) into master (b5a401d) will increase coverage by 0.08%.
The diff coverage is 80.00%.

@@            Coverage Diff             @@
##           master    #1524      +/-   ##
==========================================
+ Coverage   41.51%   41.59%   +0.08%     
==========================================
  Files         336      336              
  Lines       30569    30601      +32     
==========================================
+ Hits        12691    12729      +38     
+ Misses      16794    16785       -9     
- Partials     1084     1087       +3

Impacted Files	Coverage Δ
pkg/controller/clusterdeployment/metrics.go	`86.66% <66.66%> (-13.34%)`	⬇️
...g/controller/clusterpool/clusterpool_controller.go	`54.83% <66.66%> (+0.23%)`	⬆️
.../controller/clusterdeployment/clusterprovisions.go	`59.42% <75.00%> (+2.64%)`	⬆️
pkg/controller/clusterpool/collections.go	`76.60% <93.75%> (+0.69%)`	⬆️
...kg/controller/clusterdeployment/clusterinstalls.go	`69.82% <100.00%> (+0.73%)`	⬆️
pkg/test/clusterdeployment/clusterdeployment.go	`96.59% <100.00%> (+0.20%)`	⬆️

2uasimojo · 2021-09-01T19:59:29Z

/test e2e-pool

With this commit, the clusterpool controller detects "broken" ClusterDeployments as those whose ProvisionStopped condition is True* and deletes them so they can be replaced and do not continue to consume capacity in the pool. When adding or deleting clusters to satisfy capacity requirements, we prioritize as follows: - If we're under capacity, add new clusters before deleting broken clusters. This is in case we hit `maxConcurrent`: we would prefer to use that quota to add clusters to the pool. - If we're over capacity, delete broken clusters before deleting viable (installing or assignable) clusters. *In the future, we may expand this definition to include other definitions of "broken". HIVE-1615

2uasimojo · 2021-09-01T20:50:52Z

pkg/controller/clusterpool/clusterpool_controller.go

+	// If too many, delete some.
+	case drift > 0:
+		toDel := minIntVarible(drift, availableCurrent)
+		if err := r.deleteExcessClusters(cds, toDel, logger); err != nil {
+			return reconcile.Result{}, err
+		}


Note to reviewers: this block moved down unchanged. Since drift > 0 and drift < 0 are disjoint, this didn't affect existing logic; but it was necessary to be able to inject the new case in the right spot logically.

2uasimojo · 2021-09-01T20:51:20Z

pkg/controller/clusterpool/clusterpool_controller_test.go

 			assert.Equal(t, test.expectedTotalClusters-test.expectedAssignedCDs, actualUnassignedCDs, "unexpected number of unassigned CDs")
 			assert.Equal(t, test.expectedRunning, actualRunning, "unexpected number of running CDs")
-			assert.Equal(t, test.expectedTotalClusters-test.expectedRunning, actualHibernating, "unexpected number of assigned CDs")
+			assert.Equal(t, test.expectedTotalClusters-test.expectedRunning, actualHibernating, "unexpected number of hibernating CDs")


Latent copypasta

2uasimojo · 2021-09-01T20:53:21Z

pkg/controller/clusterpool/collections.go

 	removeCDsFromSlice(&cds.assignable, cdName)
 	removeCDsFromSlice(&cds.installing, cdName)
-	removeCDsFromSlice(&cds.assignable, cdName)
+	removeCDsFromSlice(&cds.broken, cdName)


Note: this was previously a dup of L482, mistake made in 3dc3c3c / #1484.

2uasimojo · 2021-09-01T20:55:07Z

/assign @suhanime
/cc @dgoodwin

2uasimojo · 2021-09-01T22:36:27Z

/test e2e

suhanime · 2021-09-02T20:46:09Z

pkg/controller/clusterdeployment/metrics.go

+		Name: "hive_cluster_deployments_provision_failed_terminal_total",
+		Help: "Counter incremented when a cluster provision has failed and won't be retried.",
+	},
+		[]string{"clusterpool_namespacedname"},


Wouldn't we want to clear the metric when provision succeeds?

The provision will not succeed. We set this when we've failed for the last time.

Given how you're logging this metric, it would be deleted only when hive controller restarts - since it is not attached to anything else. So it should be cleared in the clusterpool controller for the relevant clusterpool

I'm still not following this. I don't see why we should ever clear this metric. I definitely don't see why controller restart should be a significant event. Why should this metric behave any differently from e.g. hive_cluster_deployments_installed_total?

suhanime · 2021-09-02T20:48:05Z

pkg/controller/clusterdeployment/clusterinstalls.go

 	updated = false
+	// Fun extra variable to keep track of whether we should increment metricProvisionFailedTerminal
+	// later; because we only want to do that if (we change that status and) the status update succeeds.
+	provisionFailedTerminal := false


What is the drawback of not having this flag as a gating condition to report the metric?

Having trouble parsing your sentence (English is hard). Are you suggesting we don't need this variable? How else would we know when to increment the counter?

So I had a thorough look at the code, and I think you don't actually need this boolean. Simply observe the metric where you're setting this metric to true.

I don't want to observe the metric unless we actually changed the ProvisionStopped condition to True due to a final ProvisionFailed. If I don't use this boolean then, for example, we'll increment it any time we change any of the ClusterInstall* conditions. We don't want that.

suhanime · 2021-09-03T15:51:28Z

pkg/controller/clusterdeployment/clusterinstalls.go

 	updated = false
+	// Fun extra variable to keep track of whether we should increment metricProvisionFailedTerminal
+	// later; because we only want to do that if (we change that status and) the status update succeeds.
+	provisionFailedTerminal := false


So I had a thorough look at the code, and I think you don't actually need this boolean. Simply observe the metric where you're setting this metric to true.

suhanime · 2021-09-03T15:52:35Z

pkg/controller/clusterdeployment/clusterinstalls.go

 		}
+		// If we declared the provision terminally failed, bump our metric
+		if provisionFailedTerminal {
+			incProvisionFailedTerminal(cd)


On a related note, you don't actually need to observe it in metrics reconcile, you can fire the metric.observe right here (make it a global variable and set it here)

I made incProvisionFailedTerminal a func because it was more than 1LOC and I needed to call it from multiple places.

suhanime · 2021-09-03T16:26:20Z

pkg/controller/clusterdeployment/metrics.go

+		Name: "hive_cluster_deployments_provision_failed_terminal_total",
+		Help: "Counter incremented when a cluster provision has failed and won't be retried.",
+	},
+		[]string{"clusterpool_namespacedname"},


Given how you're logging this metric, it would be deleted only when hive controller restarts - since it is not attached to anything else. So it should be cleared in the clusterpool controller for the relevant clusterpool

suhanime · 2021-09-03T16:31:11Z

pkg/controller/clusterdeployment/metrics.go

+	)
 )

+func incProvisionFailedTerminal(cd *hivev1.ClusterDeployment) {


If for every CD the counter is incremented, then it should also be decremented if the relevant clusterdeployment is deleted.
I think you should make this metric a guage and report it every time for how many provisions failed.
Note that prometheus doesn't fire a change in metric until it is actually changed. So firing the metrics.Set multiple times for the same value is fine.

If for every CD the counter is incremented, then it should also be decremented if the relevant clusterdeployment is deleted.

We're using a counter because we want to track how many times pool CDs failed to provision, ever. We'll use prometheus to report things like the rate at which these failures are occurring.

I think you should make this metric a guage

We talked about metrics when doing stale CD replacement, which is very similar to this. We discussed using a (separate) gauge metric to indicate the number of stale CDs in a pool at any given time, but ended up deciding YAGNI.

Ah! This clarifies things - I had misjudged what the metric was used for

suhanime · 2021-09-03T19:05:07Z

/lgtm

openshift-ci · 2021-09-03T19:05:40Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: 2uasimojo, suhanime

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [2uasimojo,suhanime]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

2uasimojo · 2021-09-03T20:20:28Z

/retest

Refactor: consolidate setting ProvisionStopped=True

ab47817

In preparation for adding more logic to two near-identical code paths, factor them into a common local function. No functional change; refactor only. Prep for HIVE-1615

openshift-ci bot requested review from abutcher and twiest September 1, 2021 19:00

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 1, 2021

2uasimojo changed the title ~~Refactor: consolidate setting ProvisionStopped=True~~ ClusterPool: Delete broken (ProvisionStopped) clusters Sep 1, 2021

2uasimojo commented Sep 1, 2021

View reviewed changes

openshift-ci bot assigned suhanime Sep 1, 2021

openshift-ci bot requested a review from dgoodwin September 1, 2021 20:55

2uasimojo mentioned this pull request Sep 2, 2021

RFE: ClusterPool hot spares #1434

Merged

suhanime reviewed Sep 2, 2021

View reviewed changes

suhanime reviewed Sep 3, 2021

View reviewed changes

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 3, 2021

openshift-merge-robot merged commit 162432a into openshift:master Sep 3, 2021

2uasimojo deleted the HIVE-1615 branch September 3, 2021 22:27

ClusterPool: Delete broken (ProvisionStopped) clusters #1524

ClusterPool: Delete broken (ProvisionStopped) clusters #1524

Uh oh!

Conversation

2uasimojo commented Sep 1, 2021 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Sep 1, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

2uasimojo commented Sep 1, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

2uasimojo Sep 1, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

2uasimojo commented Sep 1, 2021

Uh oh!

2uasimojo commented Sep 1, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

suhanime commented Sep 3, 2021

Uh oh!

openshift-ci bot commented Sep 3, 2021

Uh oh!

2uasimojo commented Sep 3, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

2uasimojo commented Sep 1, 2021 •

edited by openshift-ci bot

Loading

codecov bot commented Sep 1, 2021 •

edited

Loading

2uasimojo Sep 1, 2021 •

edited

Loading