Skip to content
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions pkg/controller/clusterdeployment/clusterinstalls.go
Original file line number Diff line number Diff line change
Expand Up @@ -140,6 +140,9 @@ func (r *ReconcileClusterDeployment) reconcileExistingInstallingClusterInstall(c
}

updated = false
// Fun extra variable to keep track of whether we should increment metricProvisionFailedTerminal
// later; because we only want to do that if (we change that status and) the status update succeeds.
provisionFailedTerminal := false

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the drawback of not having this flag as a gating condition to report the metric?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having trouble parsing your sentence (English is hard). Are you suggesting we don't need this variable? How else would we know when to increment the counter?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I had a thorough look at the code, and I think you don't actually need this boolean. Simply observe the metric where you're setting this metric to true.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to observe the metric unless we actually changed the ProvisionStopped condition to True due to a final ProvisionFailed. If I don't use this boolean then, for example, we'll increment it any time we change any of the ClusterInstall* conditions. We don't want that.

conditions, updated = controllerutils.SetClusterDeploymentConditionWithChangeCheck(conditions,
hivev1.ProvisionStoppedCondition,
stopped.Status,
Expand All @@ -149,6 +152,7 @@ func (r *ReconcileClusterDeployment) reconcileExistingInstallingClusterInstall(c
)
if updated {
statusModified = true
provisionFailedTerminal = true
}

completed = controllerutils.FindClusterDeploymentCondition(conditions, hivev1.ClusterInstallCompletedClusterDeploymentCondition)
Expand Down Expand Up @@ -196,6 +200,10 @@ func (r *ReconcileClusterDeployment) reconcileExistingInstallingClusterInstall(c
logger.WithError(err).Error("failed to update the status of clusterdeployment")
return reconcile.Result{}, err
}
// If we declared the provision terminally failed, bump our metric
if provisionFailedTerminal {
incProvisionFailedTerminal(cd)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a related note, you don't actually need to observe it in metrics reconcile, you can fire the metric.observe right here (make it a global variable and set it here)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made incProvisionFailedTerminal a func because it was more than 1LOC and I needed to call it from multiple places.

}
}

return reconcile.Result{}, nil
Expand Down
31 changes: 10 additions & 21 deletions pkg/controller/clusterdeployment/clusterprovisions.go
Original file line number Diff line number Diff line change
Expand Up @@ -52,14 +52,14 @@ func (r *ReconcileClusterDeployment) startNewProvision(

r.deleteStaleProvisions(existingProvisions, logger)

if cd.Status.InstallRestarts > 0 && cd.Annotations[tryInstallOnceAnnotation] == "true" {
logger.Debug("not creating new provision since the deployment is set to try install only once")
setProvisionStoppedTrue := func(reason, message string) (reconcile.Result, error) {
logger.Debug("not creating new provision: %s", message)
conditions, changed := controllerutils.SetClusterDeploymentConditionWithChangeCheck(
cd.Status.Conditions,
hivev1.ProvisionStoppedCondition,
corev1.ConditionTrue,
installOnlyOnceSetReason,
"Deployment is set to try install only once",
reason,
message,
controllerutils.UpdateConditionIfReasonOrMessageChange)
if changed {
cd.Status.Conditions = conditions
Expand All @@ -68,27 +68,16 @@ func (r *ReconcileClusterDeployment) startNewProvision(
logger.WithError(err).Log(controllerutils.LogLevel(err), "failed to update cluster deployment status")
return reconcile.Result{}, err
}
incProvisionFailedTerminal(cd)
}
return reconcile.Result{}, nil
}

if cd.Status.InstallRestarts > 0 && cd.Annotations[tryInstallOnceAnnotation] == "true" {
return setProvisionStoppedTrue(installOnlyOnceSetReason, "Deployment is set to try install only once")
}
if cd.Spec.InstallAttemptsLimit != nil && cd.Status.InstallRestarts >= int(*cd.Spec.InstallAttemptsLimit) {
logger.Debug("not creating new provision since the install attempts limit has been reached")
conditions, changed := controllerutils.SetClusterDeploymentConditionWithChangeCheck(
cd.Status.Conditions,
hivev1.ProvisionStoppedCondition,
corev1.ConditionTrue,
installAttemptsLimitReachedReason,
"Install attempts limit reached",
controllerutils.UpdateConditionIfReasonOrMessageChange)
if changed {
cd.Status.Conditions = conditions
logger.Debugf("setting ProvisionStoppedCondition to %v", corev1.ConditionTrue)
if err := r.Status().Update(context.TODO(), cd); err != nil {
logger.WithError(err).Log(controllerutils.LogLevel(err), "failed to update cluster deployment status")
return reconcile.Result{}, err
}
}
return reconcile.Result{}, nil
return setProvisionStoppedTrue(installAttemptsLimitReachedReason, "Install attempts limit reached")
}

conditions, changed := controllerutils.SetClusterDeploymentConditionWithChangeCheck(
Expand Down
17 changes: 17 additions & 0 deletions pkg/controller/clusterdeployment/metrics.go
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@ import (
"github.com/prometheus/client_golang/prometheus"

"sigs.k8s.io/controller-runtime/pkg/metrics"

hivev1 "github.com/openshift/hive/apis/hive/v1"
)

var (
Expand Down Expand Up @@ -61,8 +63,22 @@ var (
Buckets: []float64{10, 30, 60, 300, 600, 1200, 1800},
},
)
metricProvisionFailedTerminal = prometheus.NewCounterVec(prometheus.CounterOpts{
Name: "hive_cluster_deployments_provision_failed_terminal_total",
Help: "Counter incremented when a cluster provision has failed and won't be retried.",
},
[]string{"clusterpool_namespacedname"},

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't we want to clear the metric when provision succeeds?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The provision will not succeed. We set this when we've failed for the last time.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given how you're logging this metric, it would be deleted only when hive controller restarts - since it is not attached to anything else. So it should be cleared in the clusterpool controller for the relevant clusterpool

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not following this. I don't see why we should ever clear this metric. I definitely don't see why controller restart should be a significant event. Why should this metric behave any differently from e.g. hive_cluster_deployments_installed_total?

)
)

func incProvisionFailedTerminal(cd *hivev1.ClusterDeployment) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If for every CD the counter is incremented, then it should also be decremented if the relevant clusterdeployment is deleted.
I think you should make this metric a guage and report it every time for how many provisions failed.
Note that prometheus doesn't fire a change in metric until it is actually changed. So firing the metrics.Set multiple times for the same value is fine.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If for every CD the counter is incremented, then it should also be decremented if the relevant clusterdeployment is deleted.

We're using a counter because we want to track how many times pool CDs failed to provision, ever. We'll use prometheus to report things like the rate at which these failures are occurring.

I think you should make this metric a guage

We talked about metrics when doing stale CD replacement, which is very similar to this. We discussed using a (separate) gauge metric to indicate the number of stale CDs in a pool at any given time, but ended up deciding YAGNI.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah! This clarifies things - I had misjudged what the metric was used for

poolNSName := ""
if poolRef := cd.Spec.ClusterPoolRef; poolRef != nil {
poolNSName = poolRef.Namespace + "/" + poolRef.PoolName
}
metricProvisionFailedTerminal.WithLabelValues(poolNSName).Inc()
}

func init() {
metrics.Registry.MustRegister(metricInstallJobDuration)
metrics.Registry.MustRegister(metricCompletedInstallJobRestarts)
Expand All @@ -72,4 +88,5 @@ func init() {
metrics.Registry.MustRegister(metricClustersInstalled)
metrics.Registry.MustRegister(metricClustersDeleted)
metrics.Registry.MustRegister(metricDNSDelaySeconds)
metrics.Registry.MustRegister(metricProvisionFailedTerminal)
}