-
Notifications
You must be signed in to change notification settings - Fork 264
ClusterPool: Delete broken (ProvisionStopped) clusters #1524
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 2 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -140,6 +140,9 @@ func (r *ReconcileClusterDeployment) reconcileExistingInstallingClusterInstall(c | |
| } | ||
|
|
||
| updated = false | ||
| // Fun extra variable to keep track of whether we should increment metricProvisionFailedTerminal | ||
| // later; because we only want to do that if (we change that status and) the status update succeeds. | ||
| provisionFailedTerminal := false | ||
| conditions, updated = controllerutils.SetClusterDeploymentConditionWithChangeCheck(conditions, | ||
| hivev1.ProvisionStoppedCondition, | ||
| stopped.Status, | ||
|
|
@@ -149,6 +152,7 @@ func (r *ReconcileClusterDeployment) reconcileExistingInstallingClusterInstall(c | |
| ) | ||
| if updated { | ||
| statusModified = true | ||
| provisionFailedTerminal = true | ||
| } | ||
|
|
||
| completed = controllerutils.FindClusterDeploymentCondition(conditions, hivev1.ClusterInstallCompletedClusterDeploymentCondition) | ||
|
|
@@ -196,6 +200,10 @@ func (r *ReconcileClusterDeployment) reconcileExistingInstallingClusterInstall(c | |
| logger.WithError(err).Error("failed to update the status of clusterdeployment") | ||
| return reconcile.Result{}, err | ||
| } | ||
| // If we declared the provision terminally failed, bump our metric | ||
| if provisionFailedTerminal { | ||
| incProvisionFailedTerminal(cd) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. On a related note, you don't actually need to observe it in metrics reconcile, you can fire the metric.observe right here (make it a global variable and set it here)
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I made |
||
| } | ||
| } | ||
|
|
||
| return reconcile.Result{}, nil | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -4,6 +4,8 @@ import ( | |
| "github.com/prometheus/client_golang/prometheus" | ||
|
|
||
| "sigs.k8s.io/controller-runtime/pkg/metrics" | ||
|
|
||
| hivev1 "github.com/openshift/hive/apis/hive/v1" | ||
| ) | ||
|
|
||
| var ( | ||
|
|
@@ -61,8 +63,22 @@ var ( | |
| Buckets: []float64{10, 30, 60, 300, 600, 1200, 1800}, | ||
| }, | ||
| ) | ||
| metricProvisionFailedTerminal = prometheus.NewCounterVec(prometheus.CounterOpts{ | ||
| Name: "hive_cluster_deployments_provision_failed_terminal_total", | ||
| Help: "Counter incremented when a cluster provision has failed and won't be retried.", | ||
| }, | ||
| []string{"clusterpool_namespacedname"}, | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Wouldn't we want to clear the metric when provision succeeds?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The provision will not succeed. We set this when we've failed for the last time.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Given how you're logging this metric, it would be deleted only when hive controller restarts - since it is not attached to anything else. So it should be cleared in the clusterpool controller for the relevant clusterpool
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm still not following this. I don't see why we should ever clear this metric. I definitely don't see why controller restart should be a significant event. Why should this metric behave any differently from e.g. hive_cluster_deployments_installed_total? |
||
| ) | ||
| ) | ||
|
|
||
| func incProvisionFailedTerminal(cd *hivev1.ClusterDeployment) { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If for every CD the counter is incremented, then it should also be decremented if the relevant clusterdeployment is deleted.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
We're using a counter because we want to track how many times pool CDs failed to provision, ever. We'll use prometheus to report things like the rate at which these failures are occurring.
We talked about metrics when doing stale CD replacement, which is very similar to this. We discussed using a (separate) gauge metric to indicate the number of stale CDs in a pool at any given time, but ended up deciding YAGNI.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah! This clarifies things - I had misjudged what the metric was used for |
||
| poolNSName := "" | ||
| if poolRef := cd.Spec.ClusterPoolRef; poolRef != nil { | ||
| poolNSName = poolRef.Namespace + "/" + poolRef.PoolName | ||
| } | ||
| metricProvisionFailedTerminal.WithLabelValues(poolNSName).Inc() | ||
| } | ||
|
|
||
| func init() { | ||
| metrics.Registry.MustRegister(metricInstallJobDuration) | ||
| metrics.Registry.MustRegister(metricCompletedInstallJobRestarts) | ||
|
|
@@ -72,4 +88,5 @@ func init() { | |
| metrics.Registry.MustRegister(metricClustersInstalled) | ||
| metrics.Registry.MustRegister(metricClustersDeleted) | ||
| metrics.Registry.MustRegister(metricDNSDelaySeconds) | ||
| metrics.Registry.MustRegister(metricProvisionFailedTerminal) | ||
| } | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the drawback of not having this flag as a gating condition to report the metric?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having trouble parsing your sentence (English is hard). Are you suggesting we don't need this variable? How else would we know when to increment the counter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I had a thorough look at the code, and I think you don't actually need this boolean. Simply observe the metric where you're setting this metric to true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want to observe the metric unless we actually changed the ProvisionStopped condition to True due to a final ProvisionFailed. If I don't use this boolean then, for example, we'll increment it any time we change any of the ClusterInstall* conditions. We don't want that.