helm/metrics: add a new gauge metric to monitor helm releases by zonggen · Pull Request #579 · openshift/console-operator

zonggen · 2021-09-01T19:58:32Z

Adds a helm_chart_release_health_status gauge metric to monitor the
status of the current helm releases.

To query the total number of healthy releases: count (helm_chart_release_health_status == 1) by (chartName, chartVersion, releaseName).
To query the total number of unhealthy releases: count (helm_chart_release_health_status == 0) by (chartName, chartVersion, releaseName).

From console UI:

From Prometheus UI:

Closes: https://issues.redhat.com/browse/HELM-224
Signed-off-by: Allen Bai abai@redhat.com

openshift-ci · 2021-09-01T19:59:07Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: zonggen
To complete the pull request process, please assign bparees after the PR has been reviewed.
You can assign the PR to them by writing /assign @bparees in a comment when ready.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

zonggen · 2021-09-01T20:03:56Z

  - create
  - update
  - delete
+- apiGroups: [""]


This is required to run helm list --all-namespaces with Helm SDK.

Error without it: metric helm_chart_release_health_status unhandled: list: failed to list: secrets is forbidden: User "system:serviceaccount:openshift-console-operator:console-operator" cannot list resource "secrets" in API group "" at the cluster scope

Adds a `helm_chart_release_health_status` gauge metric to monitor the status of the current helm releases. Closes: https://issues.redhat.com/browse/HELM-224 Signed-off-by: Allen Bai <abai@redhat.com>

dperaza4dustbit

All those vendor files came from the helm sdk reference?

dperaza4dustbit · 2021-09-01T22:35:16Z

+		}
+
+		for _, release := range releases {
+			if status := release.Info.Status.String(); status == "deployed" {


So only deploy is healthy, what about the other states. I would say if it is unknown or failed then is 0 then is 1 for all other states

dperaza4dustbit · 2021-09-01T22:40:00Z

+		for _, release := range releases {
+			if status := release.Info.Status.String(); status == "deployed" {
+				klog.V(4).Infof("metric helm_chart_release_health_status 1: %s %s %s", release.Name, release.Chart.Metadata.Name, release.Chart.Metadata.Version)
+				helmChartReleaseHealthStatus.WithLabelValues(release.Name, release.Chart.Metadata.Name, release.Chart.Metadata.Version).Set(1)


the only diff between this and the line below is the 0 or 1. I would do healthStatus := 1 then change to 0 if status unknown or failed. Then make a single call to klog and to helmChartReleaseHealthStatus passing healthStatus

zonggen · 2021-09-02T13:16:32Z

The vendor files came from after I did go mod vendor followed by go mod tidy.

Signed-off-by: Allen Bai <abai@redhat.com>

zonggen · 2021-09-08T16:23:03Z

@jhadvig @spadgett Gentle reminder: could you take a look at this PR?

jhadvig · 2021-09-09T08:37:27Z

@zonggen I see that the e2e TestMetricsEndpoint TC is failing:

=== RUN   TestMetricsEndpoint
    ....
    metrics_test.go:71: searching for helm_chart_release_health_status in metrics data...
    metrics_test.go:74: did not find helm_chart_release_health_status
    metrics_test.go:47: waiting for cleanup to reach settled state...
--- FAIL: TestMetricsEndpoint (6.27s)

Should be fixed in the PR so it's passing.

This is to mitigate the case when no helm release is present in the cluster and we want the metrics to get initialized. Reference: https://prometheus.io/docs/practices/instrumentation/#avoid-missing-metrics Signed-off-by: Allen Bai <abai@redhat.com>

zonggen · 2021-09-10T14:34:47Z

/retest

zonggen · 2021-09-10T17:33:21Z

@jhadvig Could you take a second look? The test case failed because the helm_chart_release_health_status was not initialized util there's a Helm release in the cluster. Added a logic to initialize it to 0 with no labels.

Reference: https://prometheus.io/docs/practices/instrumentation/#avoid-missing-metrics

jhadvig

@zonggen thanks for the PR. Adding couple of comments.

jhadvig · 2021-09-13T11:34:04Z

+- apiGroups:
+  - ""
+  resources:
+  - secrets


Why does the operator need to be able to fetch any secret from the cluster? Looks like a potential security issue. If we need we should rather create a Role for a specific namespace. But on the other hand I dont see any usage in the added code, where we are fetching any secret for the metrics.

The helm list --all-namespaces action from metrics.go requires this particular permission. If I remove this it would produce:

metrics.go:40] metric helm_chart_release_health_status unhandled: list: failed to list: secrets is forbidden: User "system:serviceaccount:openshift-console-operator:console-operator" cannot list resource "secrets" in API group "" at the cluster scope

This is pretty concerning because console-operator is effectively running as root on the cluster with this change. Is there any way to add the metrics without needing this permission?

Yeah this is a hard no. We won't escalate the console operator to have read on Secrets cluster-wide.

Understand, we are looking for alternative options to mitigate. It won't be a full solution if we have to use the users identity on the console side, but that is what we will pivot towards. For future reference @spadgett and @jwforres, If we had a way to filter Allow Rules by Resource Type like for example "type": "helm.sh/release.v1", would that work to reduce the access escalation? Trying to figure out if I propose a change in Kubernetes RBAC upstream on this.

jhadvig · 2021-09-13T11:52:47Z

 func (co *consoleOperator) SyncConsoleConfig(ctx context.Context, consoleConfig *configv1.Console, consoleURL string) (*configv1.Console, error) {
 	oldURL := consoleConfig.Status.ConsoleURL
 	metrics.HandleConsoleURL(oldURL, consoleURL)
+	helmmetrics.HandleHelmChartReleaseHealthStatus()


Why do we call the helmmetrics.HandleHelmChartReleaseHealthStatus() inside the SyncConsoleConfig method? Its not using any of its arguments.
I would suggest to eight put it directly to the main sync_v400 method or create a new controller.

Agreed, will move it to the main sync_v400. The original intent was to piggyback on the console_url metric. Also willing to add a new controller but that controller will only consist of this one line helmmetrics.HandleHelmChartReleaseHealthStatus().

jhadvig · 2021-09-13T11:57:46Z

@@ -0,0 +1,90 @@
+package metrics


lets put this into the pkg/helm/metrics/helm_metrics.go so we can reuse functions like recoverMetricPanic() instead of duplicating them

Moved to pkg/console/metrics/helm_metrics.go

jhadvig · 2021-09-13T12:31:19Z

+// Reference: https://github.com/helm/helm/issues/7430#issuecomment-620489002
+func getActionConfig() (*action.Configuration, error) {
+	actionConfig := new(action.Configuration)
+	var kubeConfig *genericclioptions.ConfigFlags


Dont think kubeconfig is the right name

Suggested change

var kubeConfig *genericclioptions.ConfigFlags

var configFlags *genericclioptions.ConfigFlags

also why do we need to initialize the configFlags, since on line in 72 you are setting it ? cant we jusst set it directly ?

Signed-off-by: Allen Bai <abai@redhat.com>

zonggen · 2021-09-13T13:38:15Z

Thanks for reviewing! Waiting for the tests before another look.

zonggen · 2021-09-15T15:42:52Z

@jhadvig PR is ready for another look but the ci/prow/e2e-aws-operator test kept failing for other non-related reasons. Do I keep retesting until it succeeds? Thanks :)

spadgett · 2021-09-17T14:55:11Z

/hold

openshift-ci · 2021-09-17T20:02:56Z

@zonggen: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-aws-operator	`696a192`	link	`/test e2e-aws-operator`
ci/prow/e2e-agnostic-upgrade	`696a192`	link	`/test e2e-agnostic-upgrade`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

zonggen · 2021-10-05T20:12:23Z

Closing in favor of #601.

For record, we are shifting away from calling cluster-wise helm list actions in console-operator, instead use a counter metric in console that increments whenever a user installs a Helm chart. By doing this, we are not depending on the cluster-wise secrets and only adds the counter when the user helm install a Chart.

openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 1, 2021

openshift-ci Bot requested review from jhadvig and spadgett September 1, 2021 19:59

zonggen commented Sep 1, 2021

View reviewed changes

helm/metrics: add a new gauge metric to monitor helm releases

d00be4c

Adds a `helm_chart_release_health_status` gauge metric to monitor the status of the current helm releases. Closes: https://issues.redhat.com/browse/HELM-224 Signed-off-by: Allen Bai <abai@redhat.com>

zonggen force-pushed the helm-metrics branch from e3c498b to d00be4c Compare September 1, 2021 20:09

zonggen changed the title ~~[WIP] helm/metrics: add a new gauge metric to monitor helm releases~~ helm/metrics: add a new gauge metric to monitor helm releases Sep 1, 2021

openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 1, 2021

dperaza4dustbit reviewed Sep 1, 2021

View reviewed changes

zonggen force-pushed the helm-metrics branch from a8aa8ab to b06fdea Compare September 2, 2021 15:33

helm/metrics: update according to reviews

bfd9094

Signed-off-by: Allen Bai <abai@redhat.com>

zonggen force-pushed the helm-metrics branch from b06fdea to bfd9094 Compare September 2, 2021 19:32

zonggen force-pushed the helm-metrics branch 2 times, most recently from a928d49 to c6bfe9a Compare September 9, 2021 15:35

zonggen force-pushed the helm-metrics branch from c6bfe9a to 9500219 Compare September 9, 2021 18:18

jhadvig requested changes Sep 13, 2021

View reviewed changes

helm/metrics: update according to reviews

696a192

Signed-off-by: Allen Bai <abai@redhat.com>

openshift-ci Bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 17, 2021

zonggen force-pushed the helm-metrics branch from 44d2d57 to 696a192 Compare September 17, 2021 18:01

zonggen closed this Oct 5, 2021

	var kubeConfig *genericclioptions.ConfigFlags
	var configFlags *genericclioptions.ConfigFlags

Conversation

zonggen commented Sep 1, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci Bot commented Sep 1, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dperaza4dustbit left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zonggen commented Sep 2, 2021

Uh oh!

zonggen commented Sep 8, 2021

Uh oh!

jhadvig commented Sep 9, 2021

Uh oh!

zonggen commented Sep 10, 2021

Uh oh!

zonggen commented Sep 10, 2021

Uh oh!

jhadvig left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zonggen commented Sep 13, 2021

Uh oh!

zonggen commented Sep 15, 2021

Uh oh!

spadgett commented Sep 17, 2021

Uh oh!

openshift-ci Bot commented Sep 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zonggen commented Oct 5, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

zonggen commented Sep 1, 2021 •

edited

Loading

openshift-ci Bot commented Sep 17, 2021 •

edited

Loading