Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DFBUGS-789: Fix 'ceph_disk_occupation' query expressions #2812

Conversation

aruniiird
Copy link
Contributor

@aruniiird aruniiird commented Sep 19, 2024

Need to address changes in 'ceph_disk_occupation' metric labels.

What is the change in 'ceph_disk_occupation' metric?
'ceph_disk_occupation' result no longer has 'exported_instance' label, instead it has 'instance' label.

What is the issue we are facing because of it?
We are hitting 'PrometheusRuleFailures' due to this new label change in our alerts / rules.
Second issue is that we are not seeing any results for some of the query expressions.

What is the solution?
Update the query expressions, change 'exported_instance' to 'instance'. Any 'label_replace' action which changes 'exported_instance' label to 'instance' label is no longer required (as the 'instance' label is directly available now)

@aruniiird
Copy link
Contributor Author

@weirdwiz , @jmolmo , @umangachapagain please take a look.
Tested all the changed expressions in a live cluster with expected results (where as previous queries were either not giving any result or giving prometheus-exprn-evaluation-errors).

@aruniiird aruniiird force-pushed the fix-ceph_disk_occupation-query branch from 0b43c92 to 7f6fa87 Compare September 19, 2024 14:19
@jmolmo
Copy link
Member

jmolmo commented Sep 24, 2024

I think that the change is ok.
Just to comment that the metric "'ceph_disk_occupation' comes from the "disk_occupation" metric generated by the prometheus manager module.

As you can see, this metric never had the label "exported_instance", So the change in the label name probably comes from the ODF side. Probably you will need to check and understand when and why this label changed. And after that review that it does not impact in another metrics.

@aruniiird aruniiird force-pushed the fix-ceph_disk_occupation-query branch from 7f6fa87 to 2d544db Compare September 24, 2024 11:34
@malayparida2000
Copy link
Contributor

@aruniiird Would this be a blocker in 4.17?

@aruniiird
Copy link
Contributor Author

I think that the change is ok. Just to comment that the metric "'ceph_disk_occupation' comes from the "disk_occupation" metric generated by the prometheus manager module.

As you can see, this metric never had the label "exported_instance", So the change in the label name probably comes from the ODF side. Probably you will need to check and understand when and why this label changed. And after that review that it does not impact in another metrics.

Correct @jmolmo . Checked in the ODF / OCS side, couldn't find much. There might be a chance that this records/alerts where not working for a long time. Current changes are working (with this PR) thus enabling those named records and alerts from now on wards.

@aruniiird
Copy link
Contributor Author

@aruniiird Would this be a blocker in 4.17?

@malayparida2000 , this won't be a blocker (as the query may not have worked for some time), but this is a good candidate for a 4.17 z-stream release and for newer (4.18) releases

@aruniiird
Copy link
Contributor Author

We have a customer BZ: DFBUGS-789 , related to this. Can we prioritize this? @malayparida2000 , @umangachapagain , please take a look.

Copy link
Contributor

@weirdwiz weirdwiz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also probably also move these changes to the ceph-mixin repo, if we're keeping that up to date.

@weirdwiz
Copy link
Contributor

/cherry-pick release-4.18

@openshift-cherrypick-robot

@weirdwiz: once the present PR merges, I will cherry-pick it on top of release-4.18 in a new PR and assign it to you.

In response to this:

/cherry-pick release-4.18

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@weirdwiz
Copy link
Contributor

/cherry-pick release-4.17

@openshift-cherrypick-robot

@weirdwiz: once the present PR merges, I will cherry-pick it on top of release-4.17 in a new PR and assign it to you.

In response to this:

/cherry-pick release-4.17

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@weirdwiz
Copy link
Contributor

/cherry-pick release-4.16

@openshift-cherrypick-robot

@weirdwiz: once the present PR merges, I will cherry-pick it on top of release-4.16 in a new PR and assign it to you.

In response to this:

/cherry-pick release-4.16

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@aruniiird aruniiird changed the title Fix 'ceph_disk_occupation' query expressions DFBUGS-789: Fix 'ceph_disk_occupation' query expressions Nov 28, 2024
@openshift-ci-robot openshift-ci-robot added jira/severity-important jira/valid-reference Indicates that this PR references a valid jira ticket of any type jira/invalid-bug Indicates that the referenced jira bug is invalid for the branch this PR is targeting labels Nov 28, 2024
@openshift-ci-robot
Copy link

openshift-ci-robot commented Nov 28, 2024

@aruniiird: This pull request references [Jira Issue DFBUGS-789](https://issues.redhat.com//browse/DFBUGS-789), which is invalid:

  • expected the bug to target the "odf-4.18" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

Need to address changes in 'ceph_disk_occupation' metric labels.

What is the change in 'ceph_disk_occupation' metric?
'ceph_disk_occupation' result no longer has 'exported_instance' label, instead it has 'instance' label.

What is the issue we are facing because of it?
We are hitting 'PrometheusRuleFailures' due to this new label change in our alerts / rules.
Second issue is that we are not seeing any results for some of the query expressions.

What is the solution?
Update the query expressions, change 'exported_instance' to 'instance'. Any 'label_replace' action which changes 'exported_instance' label to 'instance' label is no longer required (as the 'instance' label is directly available now)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Need to address changes in 'ceph_disk_occupation' metric labels.

What is the change in 'ceph_disk_occupation' metric?
'ceph_disk_occupation' result no longer has 'exported_instance' label,
instead it has 'instance' label.

What is the issue we are facing because of it?
We are hitting 'PrometheusRuleFailures' due to this new label changes
in our alerts / rules, where this metric is used.
Second issue is that we are not seeing any results for some of the
query expressions.

What is the solution?
Update the query expressions, change 'exported_instance' to 'instance'.
Any 'label_replace' action which changes 'exported_instance' label to
'instance' label is no longer required (as the 'instance' label is
directly available now)

Signed-off-by: Arun Kumar Mohan <[email protected]>
@aruniiird aruniiird force-pushed the fix-ceph_disk_occupation-query branch from 2d544db to a81e357 Compare December 13, 2024 12:16
@aruniiird
Copy link
Contributor Author

@umangachapagain , @malayparida2000 , please take a look. Customers are asking for a solution...
PS: rebased on top of latest master

@aruniiird
Copy link
Contributor Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that the referenced jira bug is valid for the branch this PR is targeting and removed jira/invalid-bug Indicates that the referenced jira bug is invalid for the branch this PR is targeting labels Dec 13, 2024
@openshift-ci-robot
Copy link

openshift-ci-robot commented Dec 13, 2024

@aruniiird: This pull request references [Jira Issue DFBUGS-789](https://issues.redhat.com//browse/DFBUGS-789), which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (odf-4.18) matches configured target version for branch (odf-4.18)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 17, 2024
Copy link
Contributor

openshift-ci bot commented Dec 17, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aruniiird, umangachapagain

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 17, 2024
@openshift-merge-bot openshift-merge-bot bot merged commit d62b2a2 into red-hat-storage:main Dec 17, 2024
11 checks passed
@openshift-ci-robot
Copy link

openshift-ci-robot commented Dec 17, 2024

@aruniiird: [Jira Issue DFBUGS-789](https://issues.redhat.com//browse/DFBUGS-789): All pull requests linked via external trackers have merged:

[Jira Issue DFBUGS-789](https://issues.redhat.com//browse/DFBUGS-789) has been moved to the MODIFIED state.

In response to this:

Need to address changes in 'ceph_disk_occupation' metric labels.

What is the change in 'ceph_disk_occupation' metric?
'ceph_disk_occupation' result no longer has 'exported_instance' label, instead it has 'instance' label.

What is the issue we are facing because of it?
We are hitting 'PrometheusRuleFailures' due to this new label change in our alerts / rules.
Second issue is that we are not seeing any results for some of the query expressions.

What is the solution?
Update the query expressions, change 'exported_instance' to 'instance'. Any 'label_replace' action which changes 'exported_instance' label to 'instance' label is no longer required (as the 'instance' label is directly available now)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-cherrypick-robot

@weirdwiz: new pull request created: #2933

In response to this:

/cherry-pick release-4.18

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-cherrypick-robot

@weirdwiz: new pull request created: #2934

In response to this:

/cherry-pick release-4.17

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-cherrypick-robot

@weirdwiz: new pull request created: #2935

In response to this:

/cherry-pick release-4.16

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/severity-important jira/valid-bug Indicates that the referenced jira bug is valid for the branch this PR is targeting jira/valid-reference Indicates that this PR references a valid jira ticket of any type lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants