Skip to content

Conversation

@2uasimojo
Copy link
Member

@2uasimojo 2uasimojo commented Sep 1, 2021

With this PR, the clusterpool controller detects "broken" ClusterDeployments as those whose ProvisionStopped condition is True* and deletes them so they can be replaced and do not continue to consume capacity in the pool.

When adding or deleting clusters to satisfy capacity requirements, we prioritize as follows:

  • If we're under capacity, add new clusters before deleting broken clusters. This is in case we hit maxConcurrent: we would prefer to use that quota to add clusters to the pool.
  • If we're over capacity, delete broken clusters before deleting viable (installing or assignable) clusters.

We also add a metric, hive_cluster_deployments_provision_failed_terminal_total, labeled by clusterpool namespace/name (blank if not a pool CD), keeping track of the total number of ClusterDeployments for which we declare provisioning failed for the last time.

*In the future, we may expand this definition to include other definitions of "broken".

HIVE-1615

In preparation for adding more logic to two near-identical code paths,
factor them into a common local function. No functional change; refactor
only.

Prep for HIVE-1615
@openshift-ci openshift-ci bot requested review from abutcher and twiest September 1, 2021 19:00
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 1, 2021
Add a metric, `hive_cluster_deployments_provision_failed_terminal_total`,
labeled by clusterpool namespace/name (blank if not a pool CD), keeping
track of the total number of ClusterDeployments for which we declare
provisioning failed for the last time.

Part of HIVE-1615
@codecov
Copy link

codecov bot commented Sep 1, 2021

Codecov Report

Merging #1524 (f5a92de) into master (b5a401d) will increase coverage by 0.08%.
The diff coverage is 80.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1524      +/-   ##
==========================================
+ Coverage   41.51%   41.59%   +0.08%     
==========================================
  Files         336      336              
  Lines       30569    30601      +32     
==========================================
+ Hits        12691    12729      +38     
+ Misses      16794    16785       -9     
- Partials     1084     1087       +3     
Impacted Files Coverage Δ
pkg/controller/clusterdeployment/metrics.go 86.66% <66.66%> (-13.34%) ⬇️
...g/controller/clusterpool/clusterpool_controller.go 54.83% <66.66%> (+0.23%) ⬆️
.../controller/clusterdeployment/clusterprovisions.go 59.42% <75.00%> (+2.64%) ⬆️
pkg/controller/clusterpool/collections.go 76.60% <93.75%> (+0.69%) ⬆️
...kg/controller/clusterdeployment/clusterinstalls.go 69.82% <100.00%> (+0.73%) ⬆️
pkg/test/clusterdeployment/clusterdeployment.go 96.59% <100.00%> (+0.20%) ⬆️

@2uasimojo
Copy link
Member Author

/test e2e-pool

With this commit, the clusterpool controller detects "broken"
ClusterDeployments as those whose ProvisionStopped condition is True*
and deletes them so they can be replaced and do not continue to consume
capacity in the pool.

When adding or deleting clusters to satisfy capacity requirements, we
prioritize as follows:
- If we're under capacity, add new clusters before deleting broken
clusters. This is in case we hit `maxConcurrent`: we would prefer to use
that quota to add clusters to the pool.
- If we're over capacity, delete broken clusters before deleting viable
(installing or assignable) clusters.

*In the future, we may expand this definition to include other
definitions of "broken".

HIVE-1615
@2uasimojo 2uasimojo changed the title Refactor: consolidate setting ProvisionStopped=True ClusterPool: Delete broken (ProvisionStopped) clusters Sep 1, 2021
Comment on lines +387 to +392
// If too many, delete some.
case drift > 0:
toDel := minIntVarible(drift, availableCurrent)
if err := r.deleteExcessClusters(cds, toDel, logger); err != nil {
return reconcile.Result{}, err
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to reviewers: this block moved down unchanged. Since drift > 0 and drift < 0 are disjoint, this didn't affect existing logic; but it was necessary to be able to inject the new case in the right spot logically.

assert.Equal(t, test.expectedTotalClusters-test.expectedAssignedCDs, actualUnassignedCDs, "unexpected number of unassigned CDs")
assert.Equal(t, test.expectedRunning, actualRunning, "unexpected number of running CDs")
assert.Equal(t, test.expectedTotalClusters-test.expectedRunning, actualHibernating, "unexpected number of assigned CDs")
assert.Equal(t, test.expectedTotalClusters-test.expectedRunning, actualHibernating, "unexpected number of hibernating CDs")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Latent copypasta

removeCDsFromSlice(&cds.assignable, cdName)
removeCDsFromSlice(&cds.installing, cdName)
removeCDsFromSlice(&cds.assignable, cdName)
removeCDsFromSlice(&cds.broken, cdName)
Copy link
Member Author

@2uasimojo 2uasimojo Sep 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: this was previously a dup of L482, mistake made in 3dc3c3c / #1484.

@2uasimojo
Copy link
Member Author

/assign @suhanime
/cc @dgoodwin

@2uasimojo
Copy link
Member Author

/test e2e

Name: "hive_cluster_deployments_provision_failed_terminal_total",
Help: "Counter incremented when a cluster provision has failed and won't be retried.",
},
[]string{"clusterpool_namespacedname"},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't we want to clear the metric when provision succeeds?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The provision will not succeed. We set this when we've failed for the last time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given how you're logging this metric, it would be deleted only when hive controller restarts - since it is not attached to anything else. So it should be cleared in the clusterpool controller for the relevant clusterpool

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not following this. I don't see why we should ever clear this metric. I definitely don't see why controller restart should be a significant event. Why should this metric behave any differently from e.g. hive_cluster_deployments_installed_total?

updated = false
// Fun extra variable to keep track of whether we should increment metricProvisionFailedTerminal
// later; because we only want to do that if (we change that status and) the status update succeeds.
provisionFailedTerminal := false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the drawback of not having this flag as a gating condition to report the metric?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having trouble parsing your sentence (English is hard). Are you suggesting we don't need this variable? How else would we know when to increment the counter?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I had a thorough look at the code, and I think you don't actually need this boolean. Simply observe the metric where you're setting this metric to true.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to observe the metric unless we actually changed the ProvisionStopped condition to True due to a final ProvisionFailed. If I don't use this boolean then, for example, we'll increment it any time we change any of the ClusterInstall* conditions. We don't want that.

updated = false
// Fun extra variable to keep track of whether we should increment metricProvisionFailedTerminal
// later; because we only want to do that if (we change that status and) the status update succeeds.
provisionFailedTerminal := false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I had a thorough look at the code, and I think you don't actually need this boolean. Simply observe the metric where you're setting this metric to true.

}
// If we declared the provision terminally failed, bump our metric
if provisionFailedTerminal {
incProvisionFailedTerminal(cd)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a related note, you don't actually need to observe it in metrics reconcile, you can fire the metric.observe right here (make it a global variable and set it here)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made incProvisionFailedTerminal a func because it was more than 1LOC and I needed to call it from multiple places.

Name: "hive_cluster_deployments_provision_failed_terminal_total",
Help: "Counter incremented when a cluster provision has failed and won't be retried.",
},
[]string{"clusterpool_namespacedname"},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given how you're logging this metric, it would be deleted only when hive controller restarts - since it is not attached to anything else. So it should be cleared in the clusterpool controller for the relevant clusterpool

)
)

func incProvisionFailedTerminal(cd *hivev1.ClusterDeployment) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If for every CD the counter is incremented, then it should also be decremented if the relevant clusterdeployment is deleted.
I think you should make this metric a guage and report it every time for how many provisions failed.
Note that prometheus doesn't fire a change in metric until it is actually changed. So firing the metrics.Set multiple times for the same value is fine.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If for every CD the counter is incremented, then it should also be decremented if the relevant clusterdeployment is deleted.

We're using a counter because we want to track how many times pool CDs failed to provision, ever. We'll use prometheus to report things like the rate at which these failures are occurring.

I think you should make this metric a guage

We talked about metrics when doing stale CD replacement, which is very similar to this. We discussed using a (separate) gauge metric to indicate the number of stale CDs in a pool at any given time, but ended up deciding YAGNI.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah! This clarifies things - I had misjudged what the metric was used for

@suhanime
Copy link
Contributor

suhanime commented Sep 3, 2021

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 3, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 3, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: 2uasimojo, suhanime

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@2uasimojo
Copy link
Member Author

/retest

@openshift-merge-robot openshift-merge-robot merged commit 162432a into openshift:master Sep 3, 2021
@2uasimojo 2uasimojo deleted the HIVE-1615 branch September 3, 2021 22:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants