-
Notifications
You must be signed in to change notification settings - Fork 231
Re-enable metrics, add metrics for vSphere #590
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Re-enable metrics, add metrics for vSphere #590
Conversation
|
Skipping CI for Draft Pull Request. |
c4165e0 to
9e1f8af
Compare
JoelSpeed
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some comments as I read through, let me know if you have any questions about them
pkg/metrics/metrics.go
Outdated
| prometheus.CounterOpts{ | ||
| Name: "failed_machine_sets_total", | ||
| Help: "Number of times machine set provisioning has failed.", | ||
| }, []string{"name", "namespace", "timestamp", "reason"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we should include the timestamp, does it add any extra information for the user do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does not look like, if only having additional unique key for tracking successes or failures for same named machine sets, which sounds too much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prometheus will allow you to see this anyway, as it will track the value over time. So you can use a query to work out when the value changed and get a spike on the graph when it was incremented. Good thought though! Definitely the right line of thinking
pkg/metrics/metrics.go
Outdated
| prometheus.CounterOpts{ | ||
| Name: "succeeded_machine_sets_total", | ||
| Help: "Number of times machine set provisioning has succeded.", | ||
| }, []string{"name", "namespace", "timestamp"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, don't think the timestamp adds anything?
pkg/metrics/metrics.go
Outdated
| }, []string{"name", "namespace", "timestamp", "reason"}, | ||
| ) | ||
|
|
||
| // SucceededMachineSetProvisionCount calculates the number of success provisioning for a machine |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // SucceededMachineSetProvisionCount calculates the number of success provisioning for a machine | |
| // SucceededMachineSetProvisionCount calculates the number of success provisioning for a MachineSet |
9e1f8af to
077974a
Compare
077974a to
5df6dd4
Compare
|
/retest |
1 similar comment
|
/retest |
5df6dd4 to
ffe4a58
Compare
8ad2615 to
d85718a
Compare
JoelSpeed
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realise there's some tidy up to do, added comments where there is so we don't forget them before merging. Added a couple of suggestions, but otherwise looking pretty good
d85718a to
0a0b736
Compare
9430cb2 to
37c97ca
Compare
|
/approve |
pkg/metrics/metrics.go
Outdated
| } | ||
|
|
||
| const ( | ||
| VSphereProvider string = "vsphere" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need this? The platform is implicit in any running cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was assuming the failure rates will be collected across clusters, and aggregated by that value in a longer term?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These metrics are only exposed at each individual cluster scope. If anything were to aggregate them in the future we will consider introduce any further suitable label by then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would we not want to have this label if we were bring telemetry back to RH about our controllers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's my idea too. May be a better idea to embed the provider name resolving into MAO, so it will identify it from infrastructure config and set it before sending metrics, @enxebre @JoelSpeed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would we not want to have this label if we were bring telemetry back to RH about our controllers?
If we do we, we can discuss it when there's a proposal for exposing machine API metrics for telemetry. As for today this property does not provide any value for the consumers of these metrics, i.e users owning the cluster. Most of the information around machine usage it's already expose as metrics by mao, you can get lists of failed machines and we could possibly group by error message. This is complementary.
| k8s-app: machine-api-operator | ||
| sessionAffinity: None | ||
| --- | ||
| apiVersion: v1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where is the serviceMonitor that goes with this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can’t be merged until provider integration PRs get into repos, or the CI would fail as the machine metrics won’t be accessible. The serviceMonitor i therefore in #609
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or the CI would fail as the machine metrics won’t be accessible
can you elaborate? can you point me to the origin test that will fail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have a look into ´Alerts shouldn't report any alerts in firing state...´ test in every provider: sample https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/pr-logs/pull/openshift_machine-api-operator/590/pull-ci-openshift-machine-api-operator-master-e2e-aws/2496
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why would this fire an alert?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not the service which is triggering it, but the serviceMonitor from the second PR #609 I included the merge order in Jira issue, posted it in slack too. While provider PRs are not merged, the metrics port is not served from the code, and while Prometheus is trying to connect, it causes machine metrics respond with 502. That results in alert in openshift-monitoring namespace. Joel and me already discussed the issue in slack, and decided to split the PR, instead of revendoring MAO PR branch in every provider.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought prometheus was no triggering alert for that scenario. This would need to account for openstack and baremetal.
There's nothing stopping us from including the serviceMonitor for machineset and mhc in this PR and watching working right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’ve opened issues in both repos to make sure this won’t be left unanswered.
- Enable metrics to support updated MAO metrics integration cluster-api-provider-openstack#100
- Metrics port change to support updated MAO metrics integration cluster-api-provider-baremetal#75
including machineset and mhc in the metrics should not be disruptive.
72ea3b8 to
08bd94a
Compare
|
Can we please update the description of the PR to reflect that this is enabling the plumbing for metrics for machineSet and MHC controllers and the structured being used with sevice/serviceMonitor and rbac-proxies? This is looking great though generally speaking the smallest the PR the less friction you'll find. We could have get this working e2e for MHC and machineSet and then enabling the providers. |
| } | ||
| if moTask.Info.State == types.TaskInfoStateError { | ||
| metrics.RegisterFailedInstanceCreate(&metrics.MachineLabels{ | ||
| Name: r.machine.Name, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how are this metrics actually being exposed through this controller metrics server?
wouldn't this need to metrics.Registry.MustRegister(failedInstanceCreateCount) or anything?
https://github.com/kubernetes-sigs/controller-runtime/blob/c0438568a706ec61de31b92f4d76e7fb7e1007b9/pkg/internal/controller/metrics/metrics.go#L50
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This happens inside the init in https://github.com/openshift/machine-api-operator/pull/590/files#diff-7cbe8e056d62a2de30c7066e359bd9c9R68
71dc77b to
028f0e9
Compare
elmiko
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Danil-Grigorev we just merged the metrics doc for mao, would you mind adding something about these new metrics to that doc as well?
028f0e9 to
b1e4070
Compare
- Added vSphere api metrics
b1e4070 to
eb4b161
Compare
|
/test unit |
1 similar comment
|
/test unit |
|
/test unit |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
1 similar comment
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
@Danil-Grigorev: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
Working on https://issues.redhat.com/browse/OCPCLOUD-784
This PR is adding support for reporting following prometheus metrics and also starting controller-runtime metrics server to make these metrics available for prometheus servers:
mapi_instance_create_failed: Total count of "create" cloud api errorsmapi_instance_update_failed: Total count of "update" cloud api errorsmapi_instance_delete_failed: Total count of "delete" cloud api errorsLabels on these metrics (for vSphere):