Skip to content

helm/metrics: add a new gauge metric to monitor helm releases#579

Closed
zonggen wants to merge 4 commits intoopenshift:masterfrom
zonggen:helm-metrics
Closed

helm/metrics: add a new gauge metric to monitor helm releases#579
zonggen wants to merge 4 commits intoopenshift:masterfrom
zonggen:helm-metrics

Conversation

@zonggen
Copy link
Copy Markdown
Contributor

@zonggen zonggen commented Sep 1, 2021

Adds a helm_chart_release_health_status gauge metric to monitor the
status of the current helm releases.

To query the total number of healthy releases: count (helm_chart_release_health_status == 1) by (chartName, chartVersion, releaseName).
To query the total number of unhealthy releases: count (helm_chart_release_health_status == 0) by (chartName, chartVersion, releaseName).

From console UI:
image

From Prometheus UI:
image

Closes: https://issues.redhat.com/browse/HELM-224
Signed-off-by: Allen Bai abai@redhat.com

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 1, 2021
@openshift-ci openshift-ci Bot requested review from jhadvig and spadgett September 1, 2021 19:59
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Sep 1, 2021

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: zonggen
To complete the pull request process, please assign bparees after the PR has been reviewed.
You can assign the PR to them by writing /assign @bparees in a comment when ready.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Comment thread manifests/03-rbac-role-cluster.yaml Outdated
- create
- update
- delete
- apiGroups: [""]
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is required to run helm list --all-namespaces with Helm SDK.

Error without it: metric helm_chart_release_health_status unhandled: list: failed to list: secrets is forbidden: User "system:serviceaccount:openshift-console-operator:console-operator" cannot list resource "secrets" in API group "" at the cluster scope

Adds a `helm_chart_release_health_status` gauge metric to monitor the
status of the current helm releases.

Closes: https://issues.redhat.com/browse/HELM-224
Signed-off-by: Allen Bai <abai@redhat.com>
@zonggen zonggen changed the title [WIP] helm/metrics: add a new gauge metric to monitor helm releases helm/metrics: add a new gauge metric to monitor helm releases Sep 1, 2021
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 1, 2021
Copy link
Copy Markdown
Contributor

@dperaza4dustbit dperaza4dustbit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All those vendor files came from the helm sdk reference?

Comment thread pkg/helm/metrics/metrics.go Outdated
}

for _, release := range releases {
if status := release.Info.Status.String(); status == "deployed" {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So only deploy is healthy, what about the other states. I would say if it is unknown or failed then is 0 then is 1 for all other states

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

Comment thread pkg/helm/metrics/metrics.go Outdated
for _, release := range releases {
if status := release.Info.Status.String(); status == "deployed" {
klog.V(4).Infof("metric helm_chart_release_health_status 1: %s %s %s", release.Name, release.Chart.Metadata.Name, release.Chart.Metadata.Version)
helmChartReleaseHealthStatus.WithLabelValues(release.Name, release.Chart.Metadata.Name, release.Chart.Metadata.Version).Set(1)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the only diff between this and the line below is the 0 or 1. I would do healthStatus := 1 then change to 0 if status unknown or failed. Then make a single call to klog and to helmChartReleaseHealthStatus passing healthStatus

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

@zonggen
Copy link
Copy Markdown
Contributor Author

zonggen commented Sep 2, 2021

The vendor files came from after I did go mod vendor followed by go mod tidy.

Signed-off-by: Allen Bai <abai@redhat.com>
@zonggen
Copy link
Copy Markdown
Contributor Author

zonggen commented Sep 8, 2021

@jhadvig @spadgett Gentle reminder: could you take a look at this PR?

@jhadvig
Copy link
Copy Markdown
Member

jhadvig commented Sep 9, 2021

@zonggen I see that the e2e TestMetricsEndpoint TC is failing:

=== RUN   TestMetricsEndpoint
    ....
    metrics_test.go:71: searching for helm_chart_release_health_status in metrics data...
    metrics_test.go:74: did not find helm_chart_release_health_status
    metrics_test.go:47: waiting for cleanup to reach settled state...
--- FAIL: TestMetricsEndpoint (6.27s)

Should be fixed in the PR so it's passing.

@zonggen zonggen force-pushed the helm-metrics branch 2 times, most recently from a928d49 to c6bfe9a Compare September 9, 2021 15:35
This is to mitigate the case when no helm release is present in the
cluster and we want the metrics to get initialized.

Reference: https://prometheus.io/docs/practices/instrumentation/#avoid-missing-metrics

Signed-off-by: Allen Bai <abai@redhat.com>
@zonggen
Copy link
Copy Markdown
Contributor Author

zonggen commented Sep 10, 2021

/retest

@zonggen
Copy link
Copy Markdown
Contributor Author

zonggen commented Sep 10, 2021

@jhadvig Could you take a second look? The test case failed because the helm_chart_release_health_status was not initialized util there's a Helm release in the cluster. Added a logic to initialize it to 0 with no labels.

Reference: https://prometheus.io/docs/practices/instrumentation/#avoid-missing-metrics

Copy link
Copy Markdown
Member

@jhadvig jhadvig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zonggen thanks for the PR. Adding couple of comments.

- apiGroups:
- ""
resources:
- secrets
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does the operator need to be able to fetch any secret from the cluster? Looks like a potential security issue. If we need we should rather create a Role for a specific namespace. But on the other hand I dont see any usage in the added code, where we are fetching any secret for the metrics.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The helm list --all-namespaces action from metrics.go requires this particular permission. If I remove this it would produce:

metrics.go:40] metric helm_chart_release_health_status unhandled: list: failed to list: secrets is forbidden: User "system:serviceaccount:openshift-console-operator:console-operator" cannot list resource "secrets" in API group "" at the cluster scope

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty concerning because console-operator is effectively running as root on the cluster with this change. Is there any way to add the metrics without needing this permission?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this is a hard no. We won't escalate the console operator to have read on Secrets cluster-wide.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understand, we are looking for alternative options to mitigate. It won't be a full solution if we have to use the users identity on the console side, but that is what we will pivot towards. For future reference @spadgett and @jwforres, If we had a way to filter Allow Rules by Resource Type like for example "type": "helm.sh/release.v1", would that work to reduce the access escalation? Trying to figure out if I propose a change in Kubernetes RBAC upstream on this.

Comment thread pkg/console/operator/sync_v400.go Outdated
func (co *consoleOperator) SyncConsoleConfig(ctx context.Context, consoleConfig *configv1.Console, consoleURL string) (*configv1.Console, error) {
oldURL := consoleConfig.Status.ConsoleURL
metrics.HandleConsoleURL(oldURL, consoleURL)
helmmetrics.HandleHelmChartReleaseHealthStatus()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we call the helmmetrics.HandleHelmChartReleaseHealthStatus() inside the SyncConsoleConfig method? Its not using any of its arguments.
I would suggest to eight put it directly to the main sync_v400 method or create a new controller.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, will move it to the main sync_v400. The original intent was to piggyback on the console_url metric. Also willing to add a new controller but that controller will only consist of this one line helmmetrics.HandleHelmChartReleaseHealthStatus().

@@ -0,0 +1,90 @@
package metrics
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets put this into the pkg/helm/metrics/helm_metrics.go so we can reuse functions like recoverMetricPanic() instead of duplicating them

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to pkg/console/metrics/helm_metrics.go

Comment thread pkg/helm/metrics/metrics.go Outdated
// Reference: https://github.com/helm/helm/issues/7430#issuecomment-620489002
func getActionConfig() (*action.Configuration, error) {
actionConfig := new(action.Configuration)
var kubeConfig *genericclioptions.ConfigFlags
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dont think kubeconfig is the right name

Suggested change
var kubeConfig *genericclioptions.ConfigFlags
var configFlags *genericclioptions.ConfigFlags

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also why do we need to initialize the configFlags, since on line in 72 you are setting it ? cant we jusst set it directly ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated :)

Signed-off-by: Allen Bai <abai@redhat.com>
@zonggen
Copy link
Copy Markdown
Contributor Author

zonggen commented Sep 13, 2021

Thanks for reviewing! Waiting for the tests before another look.

@zonggen
Copy link
Copy Markdown
Contributor Author

zonggen commented Sep 15, 2021

@jhadvig PR is ready for another look but the ci/prow/e2e-aws-operator test kept failing for other non-related reasons. Do I keep retesting until it succeeds? Thanks :)

@spadgett
Copy link
Copy Markdown
Member

/hold

@openshift-ci openshift-ci Bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 17, 2021
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Sep 17, 2021

@zonggen: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-operator 696a192 link /test e2e-aws-operator
ci/prow/e2e-agnostic-upgrade 696a192 link /test e2e-agnostic-upgrade

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@zonggen
Copy link
Copy Markdown
Contributor Author

zonggen commented Oct 5, 2021

Closing in favor of #601.

For record, we are shifting away from calling cluster-wise helm list actions in console-operator, instead use a counter metric in console that increments whenever a user installs a Helm chart. By doing this, we are not depending on the cluster-wise secrets and only adds the counter when the user helm install a Chart.

@zonggen zonggen closed this Oct 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants