Skip to content

Conversation

@petr-muller
Copy link
Member

@petr-muller petr-muller commented Aug 31, 2023

Add a new cluster_version_conditional_updates_recommended_conditions_seconds which, for each conditional update known to CVO, reports how long (in seconds) is the Recommended condition on that update in the current state (this is the time since the conditions' lastTransitionTime). The metric is labelled with version, reason and status. Note that lastTransitionTime was not correctly maintained by CVO until fixed in #964.

Using this metric, we create an alert that fires when there is an update that is in an Unknown status for more than 50 minutes and this state is maintained for 10 minutes.

The metric was created slightly more generic than the alert would have needed. Technically the CVO could compute the "bad" state right away and just flap a 0/1 metric, but I believe the metrics are more useful when slightly generic.

@openshift-ci-robot openshift-ci-robot added jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. labels Aug 31, 2023
@openshift-ci-robot
Copy link
Contributor

@petr-muller: This pull request references Jira Issue OCPBUGS-9050, which is invalid:

  • expected the bug to target the "4.14.0" version, but it targets "4.12.z" instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Aug 31, 2023
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 31, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 31, 2023

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@petr-muller
Copy link
Member Author

/test all

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 31, 2023
@petr-muller
Copy link
Member Author

Haha we do not respect lastTransitionTime contract and reset it to zero even without a status change...

image

@petr-muller petr-muller force-pushed the ocpbugs-9050-cannot-evaluate-conditional-updates branch from 384f24b to 4261ee7 Compare September 4, 2023 10:21
@petr-muller
Copy link
Member Author

/test all

@petr-muller petr-muller force-pushed the ocpbugs-9050-cannot-evaluate-conditional-updates branch from 4261ee7 to d032eeb Compare September 4, 2023 15:16
@petr-muller
Copy link
Member Author

/test all

@petr-muller petr-muller force-pushed the ocpbugs-9050-cannot-evaluate-conditional-updates branch from d032eeb to bc442fd Compare September 4, 2023 15:18
@petr-muller
Copy link
Member Author

/test all

@petr-muller
Copy link
Member Author

/retest

@petr-muller petr-muller force-pushed the ocpbugs-9050-cannot-evaluate-conditional-updates branch from bc442fd to 14199a3 Compare September 5, 2023 12:10
@petr-muller
Copy link
Member Author

/test all

@petr-muller
Copy link
Member Author

/jira refresh

@openshift-ci-robot
Copy link
Contributor

@petr-muller: This pull request references Jira Issue OCPBUGS-9050, which is invalid:

  • expected the bug to target the "4.14.0" version, but it targets "4.15.0" instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@petr-muller
Copy link
Member Author

petr-muller commented Sep 5, 2023

Needs #964 first

@petr-muller
Copy link
Member Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Sep 8, 2023
@openshift-ci-robot
Copy link
Contributor

@petr-muller: This pull request references Jira Issue OCPBUGS-9050, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.15.0) matches configured target version for branch (4.15.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @shellyyang1989

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@petr-muller petr-muller force-pushed the ocpbugs-9050-cannot-evaluate-conditional-updates branch from 14199a3 to db1a301 Compare September 12, 2023 13:51
@petr-muller petr-muller marked this pull request as ready for review September 12, 2023 13:52
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 12, 2023
- Used `max by` in the alert to drop irrelevant labels from alert
- Replaced further `"Recommended"` literals with new constant
- Do not `ToLower` booleanish string labels
- Report metrics labelled with `condition` and `status` instead of a
  specialized `recommended` label. We only export the `Recommended`
  condition status still.
- Naming and code structure tweaks
@petr-muller petr-muller force-pushed the ocpbugs-9050-cannot-evaluate-conditional-updates branch 3 times, most recently from ed62379 to e30c47c Compare October 24, 2023 16:02
If the CVO is just starting up, it should populate its "known state" of
available updates from the `ClusterVersion` status, if its contains
them. Previous CVO may have evaluated the same graph data like the the
current one is about to do and if it did, there are likely existing
conditions in the status that we need to respect (for example, do not
bump a `lastTransitionTime` field on a condition on a conditinal update
that was already evaluated with the same result.
@petr-muller petr-muller force-pushed the ocpbugs-9050-cannot-evaluate-conditional-updates branch from e30c47c to 983b3d0 Compare October 24, 2023 16:05
@petr-muller
Copy link
Member Author

This should be now ready for another round of review and testing. The underlying metrics was pivoted to be timestamp-based rather than duration based (this is a followup of the Slack conversation where we discovered this upstream recommendation).

New commit 983b3d0 makes CVO persist conditions known about conditional updates from the ClusterVersion status when it starts and evaluates available updates for the first time.

}
}

// Collect collects metrics from the operator into the channel ch
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: truncated comment

Copy link
Member

@wking wking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 24, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 24, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@LalatenduMohanty
Copy link
Member

/label backport-risk-assessed

@openshift-ci openshift-ci bot added the backport-risk-assessed Indicates a PR to a release branch has been evaluated and considered safe to accept. label Oct 25, 2023
@petr-muller
Copy link
Member Author

/hold cancel
/test e2e-hypershift

Lot of failures, nothing indicates CVO

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 26, 2023
@petr-muller
Copy link
Member Author

/hold

This was held because it waits for testing (the backport-risk-assessed confused me, it's not needed here but in #985)

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 26, 2023
@shellyyang1989
Copy link
Contributor

shellyyang1989 commented Oct 26, 2023

Pre-merge testing

# oc adm upgrade 
Cluster version is 4.15.0-0.test-2023-10-26-103451-ci-ln-59brt82-latest

// Patch dummy cincy

# oc patch clusterversion/version --patch '{"spec":{"upstream":"https://raw.githubusercontent.com/shellyyang1989/upgrade-cincy/master/cincy-conditional-edge-invalid-promql.json"}}' --type=merge
clusterversion.config.openshift.io/version patched

# oc adm upgrade channel stable-4.15
warning: No channels known to be compatible with the current version "4.15.0-0.test-2023-10-26-103451-ci-ln-59brt82-latest"; unable to validate "stable-4.15". Setting the update channel to "stable-4.15" anyway.

// Recommended=Unknown update condition is present

conditionalUpdates:
    - conditions:
      - lastTransitionTime: "2023-10-26T11:54:21Z"
        message: |-
          Could not evaluate exposure to update risk InvalidPromQL (executing PromQL query: bad_data: 1:49: parse error: unexpected identifier "buggy" in label matching, expected string)
            InvalidPromQL description: Invalid Promql
            InvalidPromQL URL: https://invalid.com/a
        reason: EvaluationFailed
        status: Unknown
        type: Recommended
      release:
        image: registry.ci.openshift.org/ocp/release@sha256:d9759e7c8ec5e2555419d84ff36aff2a4c8f9367236c18e722a3fe4d7c4f6dee
        version: 4.15.0-0.nightly-2023-11-11-065245
      risks:
      - matchingRules:
        - promql:
            promql: group(cluster_version_available_updates{channel=buggy})
          type: PromQL
        message: Invalid Promql
        name: InvalidPromQL
        url: https://invalid.com/a

// After 1 hour, the alert CannotEvaluateConditionalUpdates fires

# curl -s -k -H "Authorization: Bearer $token"  https://$url/api/v1/alerts | jq -r '.data.alerts[]|select(.labels.alertname == "CannotEvaluateConditionalUpdates")'
{
  "labels": {
    "alertname": "CannotEvaluateConditionalUpdates",
    "condition": "Recommended",
    "reason": "EvaluationFailed",
    "severity": "warning",
    "status": "Unknown",
    "version": "4.15.0-0.nightly-2023-11-11-065245"
  },
  "annotations": {
    "description": "Failure to evaluate conditional update matches means that Cluster Version Operator cannot decide whether an update path is recommended or not.",
    "summary": "Cluster Version Operator cannot evaluate conditional update matches for 1h 0m 6s."
  },
  "state": "firing",
  "activeAt": "2023-10-26T12:54:27.319838523Z",
  "value": "3.606319000005722e+03"
}

// Delete CVO pod

# oc delete pod cluster-version-operator-5f94759dfb-hprlp -n openshift-cluster-version
pod "cluster-version-operator-5f94759dfb-hprlp" deleted

// The alert survives

# curl -s -k -H "Authorization: Bearer $token"  https://$url/api/v1/alerts | jq -r '.data.alerts[]|select(.labels.alertname == "CannotEvaluateConditionalUpdates")'
{
  "labels": {
    "alertname": "CannotEvaluateConditionalUpdates",
    "condition": "Recommended",
    "reason": "EvaluationFailed",
    "severity": "warning",
    "status": "Unknown",
    "version": "4.15.0-0.nightly-2023-11-11-065245"
  },
  "annotations": {
    "description": "Failure to evaluate conditional update matches means that Cluster Version Operator cannot decide whether an update path is recommended or not.",
    "summary": "Cluster Version Operator cannot evaluate conditional update matches for 1h 1m 6s."
  },
  "state": "firing",
  "activeAt": "2023-10-26T12:54:27.319838523Z",
  "value": "3.666319000005722e+03"
}

Looks good.

@shellyyang1989
Copy link
Contributor

/label qe-approved

@openshift-ci openshift-ci bot added the qe-approved Signifies that QE has signed off on this PR label Oct 26, 2023
@petr-muller
Copy link
Member Author

/hold cancel
/test e2e-hypershift

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 26, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 26, 2023

@petr-muller: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-ci openshift-ci bot merged commit 00d0940 into openshift:master Oct 26, 2023
@openshift-ci-robot
Copy link
Contributor

@petr-muller: Jira Issue OCPBUGS-9050: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-9050 has been moved to the MODIFIED state.

Details

In response to this:

Add a new cluster_version_conditional_updates_recommended_conditions_seconds which, for each conditional update known to CVO, reports how long (in seconds) is the Recommended condition on that update in the current state (this is the time since the conditions' lastTransitionTime). The metric is labelled with version, reason and status. Note that lastTransitionTime was not correctly maintained by CVO until fixed in #964.

Using this metric, we create an alert that fires when there is an update that is in an Unknown status for more than 50 minutes and this state is maintained for 10 minutes.

The metric was created slightly more generic than the alert would have needed. Technically the CVO could compute the "bad" state right away and just flap a 0/1 metric, but I believe the metrics are more useful when slightly generic.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-merge-robot
Copy link
Contributor

Fix included in accepted release 4.15.0-0.nightly-2023-10-27-135451

@petr-muller petr-muller deleted the ocpbugs-9050-cannot-evaluate-conditional-updates branch October 30, 2023 16:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. backport-risk-assessed Indicates a PR to a release branch has been evaluated and considered safe to accept. jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. qe-approved Signifies that QE has signed off on this PR tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants