OCPBUGS-25708: pkg/cvo/availableupdates: Only bump LastAttempt on Cincinnati pulls #1009

wking · 2023-12-20T07:13:10Z

965bfb2 (#939) pivoted from "every syncAvailableUpdates round that does anything useful has a fresh Cincinnati pull" to "some syncAvailableUpdates rounds have a fresh Cincinnati pull, but others just re-eval some Recommended=Unknown conditional updates". Then syncAvailableUpdates calls setAvailableUpdates.

However, until this commit, setAvailableUpdates had been bumping LastAttempt every time, even in the just-re-eval conditional updates" case. That meant we never tripped the:

        } else if !optrAvailableUpdates.RecentlyChanged(optr.minimumUpdateCheckInterval) {
                klog.V(2).Infof("Retrieving available updates again, because more than %s has elapsed since %s", optr.minimumUpdateCheckInterval, optrAvailableUpdates.LastAttempt.Format(time.RFC3339))

condition to trigger a fresh Cincinnati pull. Which could lead to deadlocks like:

Cincinnati serves vulnerable PromQL, like MCO-958: Blocking edges to 4.14.2+ and 4.13.25+ cincinnati-graph-data#4524.
Clusters pick up that broken PromQL, try to evaluate, and fail. Re-eval-and-fail loop continues.
Cincinnati PromQL fixed, like Fix AROBrokenDNSMasq cincinnati-graph-data#4528.
Cases:
- (a) Before 965bfb2, and also after this commit, Clusters pick up the fixed PromQL, try to evaluate, and start succeeding. Hooray!
- (b) Clusters with 965bfb2 but without this commit say "it's been a long time since we pulled fresh Cincinanti information, but it has not been long since my last attempt to eval this broken PromQL, so let me skip the Cincinnati pull and re-eval that old PromQL", which fails. Re-eval-and-fail loop continues.

To break out of 4.b, clusters on impacted releases can roll their CVO pod:

$ oc -n openshift-cluster-version delete -l k8s-app=cluster-version-operator pod

which will clear out LastAttempt and trigger a fresh Cincinnati pull. I'm not sure if there's another recovery method...

965bfb2 (pkg/cvo/availableupdates: Requeue risk evaluation on failure, 2023-09-18, openshift#939) pivoted from "every syncAvailableUpdates round that does anything useful has a fresh Cincinnati pull" to "some syncAvailableUpdates rounds have a fresh Cincinnati pull, but others just re-eval some Recommended=Unknown conditional updates". Then syncAvailableUpdates calls setAvailableUpdates. However, until this commit, setAvailableUpdates had been bumping LastAttempt every time, even in the just-re-eval conditional updates" case. That meant we never tripped the: } else if !optrAvailableUpdates.RecentlyChanged(optr.minimumUpdateCheckInterval) { klog.V(2).Infof("Retrieving available updates again, because more than %s has elapsed since %s", optr.minimumUpdateCheckInterval, optrAvailableUpdates.LastAttempt.Format(time.RFC3339)) condition to trigger a fresh Cincinnati pull. Which could lead to deadlocks like: 1. Cincinnati serves vulnerable PromQL, like [1]. 2. Clusters pick up that broken PromQL, try to evaluate, and fail. Re-eval-and-fail loop continues. 3. Cincinnati PromQL fixed, like [2]. 4. Cases: a. Before 965bfb2, and also after this commit, Clusters pick up the fixed PromQL, try to evaluate, and start succeeding. Hooray! b. Clusters with 965bfb2 but without this commit say "it's been a long time since we pulled fresh Cincinanti information, but it has not been long since my last attempt to eval this broken PromQL, so let me skip the Cincinnati pull and re-eval that old PromQL", which fails. Re-eval-and-fail loop continues. To break out of 4.b, clusters on impacted releases can roll their CVO pod: $ oc -n openshift-cluster-version delete -l k8s-app=cluster-version-operator pod which will clear out LastAttempt and trigger a fresh Cincinnati pull. I'm not sure if there's another recovery method... [1]: openshift/cincinnati-graph-data#4524 [2]: openshift/cincinnati-graph-data#4528

openshift-ci-robot · 2023-12-20T07:48:11Z

@wking: This pull request references Jira Issue OCPBUGS-25708, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.16.0) matches configured target version for branch (4.16.0)
bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @jiajliu

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

965bfb2 (#939) pivoted from "every syncAvailableUpdates round that does anything useful has a fresh Cincinnati pull" to "some syncAvailableUpdates rounds have a fresh Cincinnati pull, but others just re-eval some Recommended=Unknown conditional updates". Then syncAvailableUpdates calls setAvailableUpdates.

However, until this commit, setAvailableUpdates had been bumping LastAttempt every time, even in the just-re-eval conditional updates" case. That meant we never tripped the:
       } else if !optrAvailableUpdates.RecentlyChanged(optr.minimumUpdateCheckInterval) {
               klog.V(2).Infof("Retrieving available updates again, because more than %s has elapsed since %s", optr.minimumUpdateCheckInterval, optrAvailableUpdates.LastAttempt.Format(time.RFC3339))
condition to trigger a fresh Cincinnati pull. Which could lead to deadlocks like:

Cincinnati serves vulnerable PromQL, like MCO-958: Blocking edges to 4.14.2+ and 4.13.25+ cincinnati-graph-data#4524.

Clusters pick up that broken PromQL, try to evaluate, and fail. Re-eval-and-fail loop continues.

Cincinnati PromQL fixed, like Fix AROBrokenDNSMasq cincinnati-graph-data#4528.

Cases:

(a) Before 965bfb2, and also after this commit, Clusters pick up the fixed PromQL, try to evaluate, and start succeeding. Hooray!

(b) Clusters with 965bfb2 but without this commit say "it's been a long time since we pulled fresh Cincinanti information, but it has not been long since my last attempt to eval this broken PromQL, so let me skip the Cincinnati pull and re-eval that old PromQL", which fails. Re-eval-and-fail loop continues.

To break out of 4.b, clusters on impacted releases can roll their CVO pod:
$ oc -n openshift-cluster-version delete -l k8s-app=cluster-version-operator pod
which will clear out LastAttempt and trigger a fresh Cincinnati pull. I'm not sure if there's another recovery method...

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

petr-muller · 2023-12-20T15:18:50Z

/cc

wking · 2023-12-20T19:29:34Z

Ok, Cluster Bot testing with launch 4.16,openshift/cluster-version-operator#1009 gcp ([logs][1]):

$ oc patch clusterversion version --type json -p '[{"op": "add", "path": "/spec/upstream", "value": "https://raw.githubusercontent.com/wking/cincinnati-graph-data/demo/cincinnati-graph.json"}]'
$ oc adm upgrade channel demo
$ oc get -o json clusterversion version | jq '.status.conditionalUpdates[] | {conditions, risks: ([.risks[] | {name, promql: .matchingRules[0].promql.promql}])}'
{
  "conditions": [
    {
      "lastTransitionTime": "2023-12-20T19:03:57Z",
      "message": "The update is recommended, because none of the conditional update risks apply to this cluster.",
      "reason": "AsExpected",
      "status": "True",
      "type": "Recommended"
    }
  ],
  "risks": [
    {
      "name": "A",
      "promql": "cluster_operator_conditions"
    },
    {
      "name": "B",
      "promql": "group(cluster_version_available_updates{channel=\"buggy\"})\nor\n0 * group(cluster_version_available_updates{channel!=\"buggy\"})"
    },
    {
      "name": "C",
      "promql": "group(csv_succeeded{name=~\"local-storage-operator[.].*\"}) or 0 * group(csv_count)"
    },
    {
      "name": "D",
      "promql": "0 * max(cluster_version)"
    },
    {
      "name": "E",
      "promql": "0 * 0 * max(cluster_version)"
    }
  ]
}

I dunno how the cluster_operator_conditions eval isn't failing on multiple matching time-series. Let me add more debugging and try again....

wking · 2023-12-20T20:39:02Z

Ok, new Cluster Bot round with launch 4.16,openshift/cluster-version-operator#1009,openshift/cluster-version-operator#1010 gcp (logs):

$ oc patch clusterversion version --type json -p '[{"op": "add", "path": "/spec/upstream", "value": "https://raw.githubusercontent.com/wking/cincinnati-graph-data/demo/cincinnati-graph.json"}]'
$ oc adm upgrade channel demo
$ oc adm upgrade --include-not-recommended
Cluster version is 4.16.0-0.test-2023-12-20-194115-ci-ln-hyd2jpt-latest

Upstream: https://raw.githubusercontent.com/wking/cincinnati-graph-data/demo/cincinnati-graph.json
Channel: demo
No updates available. You may still upgrade to a specific release image with --to-image or wait for new updates to be available.

Supported but not recommended updates:

  Version: 4.16.1
  Image: quay.io/openshift-release-dev/ocp-release@sha256:0000000000000000000000000000000000000000000000000000000000000000
  Recommended: Unknown
  Reason: EvaluationFailed
  Message: Could not evaluate exposure to update risk A (invalid PromQL result length must be one, but is 147)
    A description: A.
    A URL: https://bug.example.com/a

So hooray, we're sticking on multiple matches now. Not sure why we didn't last time... Continuing with the OCPBUGS-25708 reproducer proceedure:

$ sed -i 's/cluster_operator_conditions/group(cluster_operator_conditions)/' cincinnati-graph.json
$ git commit -am 'Fix broken PromQL'
$ git push wking demo
$ sleep 600
$ oc adm upgrade --include-not-recommended
Cluster version is 4.16.0-0.test-2023-12-20-194115-ci-ln-hyd2jpt-latest

Upstream: https://raw.githubusercontent.com/wking/cincinnati-graph-data/demo/cincinnati-graph.json
Channel: demo
No updates available. You may still upgrade to a specific release image with --to-image or wait for new updates to be available.

Supported but not recommended updates:

  Version: 4.16.1
  Image: quay.io/openshift-release-dev/ocp-release@sha256:0000000000000000000000000000000000000000000000000000000000000000
  Recommended: False
  Reason: A
  Message: A. https://bug.example.com/a

So it noticed the PromQL expression fix, and went from Unknown (many matching time-series) to False (group(...) evaluates to "1, I'm exposed"). Looks good to me :)

LalatenduMohanty · 2023-12-21T06:03:57Z

/assign @LalatenduMohanty

LalatenduMohanty · 2023-12-21T06:11:27Z

So hooray, we're sticking on multiple matches now.

@wking Not sure what you meant here. Can you please explain?

LalatenduMohanty · 2023-12-21T15:20:26Z

/lgtm

wking · 2023-12-21T15:45:03Z

So hooray, we're sticking on multiple matches now.

@wking Not sure what you meant here. Can you please explain?

In my first verification attempt, when I fed the cluster cluster_operator_conditions as a PromQL expression, it picked up the expression, and apparently evaluated it, but still had Recommended=True for the update. I'd have expected Recommended=Unknown with "this PromQL matches >1 time-series". I still don't understand what was going on in that run.

In my second verification attempt, I mixed in #1010 for debugging, but didn't end up needing it, because we got Recommended=Unknown with Could not evaluate exposure to update risk A (invalid PromQL result length must be one, but is 147) immediately. And the "hooray..." was celebrating that difference from my first verification attempt.

petr-muller

Looks good! Pretty nasty failure case - it's quite easy in hindsight, but hard to figure out before you know it's there ;)

openshift-ci · 2023-12-21T17:22:20Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: LalatenduMohanty, petr-muller, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [LalatenduMohanty,petr-muller,wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

shellyyang1989 · 2023-12-29T14:12:20Z

/label qe-approved

openshift-ci-robot · 2023-12-29T14:12:26Z

@wking: This pull request references Jira Issue OCPBUGS-25708, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.16.0) matches configured target version for branch (4.16.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @shellyyang1989

Details

In response to this:

965bfb2 (#939) pivoted from "every syncAvailableUpdates round that does anything useful has a fresh Cincinnati pull" to "some syncAvailableUpdates rounds have a fresh Cincinnati pull, but others just re-eval some Recommended=Unknown conditional updates". Then syncAvailableUpdates calls setAvailableUpdates.

However, until this commit, setAvailableUpdates had been bumping LastAttempt every time, even in the just-re-eval conditional updates" case. That meant we never tripped the:
       } else if !optrAvailableUpdates.RecentlyChanged(optr.minimumUpdateCheckInterval) {
               klog.V(2).Infof("Retrieving available updates again, because more than %s has elapsed since %s", optr.minimumUpdateCheckInterval, optrAvailableUpdates.LastAttempt.Format(time.RFC3339))
condition to trigger a fresh Cincinnati pull. Which could lead to deadlocks like:

Cincinnati serves vulnerable PromQL, like MCO-958: Blocking edges to 4.14.2+ and 4.13.25+ cincinnati-graph-data#4524.

Clusters pick up that broken PromQL, try to evaluate, and fail. Re-eval-and-fail loop continues.

Cincinnati PromQL fixed, like Fix AROBrokenDNSMasq cincinnati-graph-data#4528.

Cases:

(a) Before 965bfb2, and also after this commit, Clusters pick up the fixed PromQL, try to evaluate, and start succeeding. Hooray!

(b) Clusters with 965bfb2 but without this commit say "it's been a long time since we pulled fresh Cincinanti information, but it has not been long since my last attempt to eval this broken PromQL, so let me skip the Cincinnati pull and re-eval that old PromQL", which fails. Re-eval-and-fail loop continues.

To break out of 4.b, clusters on impacted releases can roll their CVO pod:
$ oc -n openshift-cluster-version delete -l k8s-app=cluster-version-operator pod
which will clear out LastAttempt and trigger a fresh Cincinnati pull. I'm not sure if there's another recovery method...

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-bot · 2024-01-01T00:00:33Z

/jira refresh

The requirements for Jira bugs have changed (Jira issues linked to PRs on main branch need to target different OCP), recalculating validity.

openshift-ci-robot · 2024-01-01T00:00:52Z

@openshift-bot: This pull request references Jira Issue OCPBUGS-25708, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.16.0) matches configured target version for branch (4.16.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @shellyyang1989

Details

In response to this:

/jira refresh

The requirements for Jira bugs have changed (Jira issues linked to PRs on main branch need to target different OCP), recalculating validity.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

wking · 2024-01-02T19:18:37Z

cluster-baremetal-operator crash-looping is getting addressed in openshift/cluster-baremetal-operator#395. I dunno about the RequiredInstallerResourcesMissing issues from kube-apiserver-operator, but that's also unrelated to this pull.

/override ci/prow/e2e-agnostic-ovn-upgrade-out-of-change

openshift-ci · 2024-01-02T19:18:55Z

@wking: Overrode contexts on behalf of wking: ci/prow/e2e-agnostic-ovn-upgrade-out-of-change

Details

In response to this:

cluster-baremetal-operator crash-looping is getting addressed in openshift/cluster-baremetal-operator#395. I dunno about the RequiredInstallerResourcesMissing issues from kube-apiserver-operator, but that's also unrelated to this pull.

/override ci/prow/e2e-agnostic-ovn-upgrade-out-of-change

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci · 2024-01-02T19:18:57Z

@wking: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-ci-robot · 2024-01-02T19:22:03Z

@wking: Jira Issue OCPBUGS-25708: All pull requests linked via external trackers have merged:

openshift/cluster-version-operator#1009

Jira Issue OCPBUGS-25708 has been moved to the MODIFIED state.

Details

In response to this:

965bfb2 (#939) pivoted from "every syncAvailableUpdates round that does anything useful has a fresh Cincinnati pull" to "some syncAvailableUpdates rounds have a fresh Cincinnati pull, but others just re-eval some Recommended=Unknown conditional updates". Then syncAvailableUpdates calls setAvailableUpdates.

However, until this commit, setAvailableUpdates had been bumping LastAttempt every time, even in the just-re-eval conditional updates" case. That meant we never tripped the:
       } else if !optrAvailableUpdates.RecentlyChanged(optr.minimumUpdateCheckInterval) {
               klog.V(2).Infof("Retrieving available updates again, because more than %s has elapsed since %s", optr.minimumUpdateCheckInterval, optrAvailableUpdates.LastAttempt.Format(time.RFC3339))
condition to trigger a fresh Cincinnati pull. Which could lead to deadlocks like:

Cincinnati serves vulnerable PromQL, like MCO-958: Blocking edges to 4.14.2+ and 4.13.25+ cincinnati-graph-data#4524.

Clusters pick up that broken PromQL, try to evaluate, and fail. Re-eval-and-fail loop continues.

Cincinnati PromQL fixed, like Fix AROBrokenDNSMasq cincinnati-graph-data#4528.

Cases:

(a) Before 965bfb2, and also after this commit, Clusters pick up the fixed PromQL, try to evaluate, and start succeeding. Hooray!

(b) Clusters with 965bfb2 but without this commit say "it's been a long time since we pulled fresh Cincinanti information, but it has not been long since my last attempt to eval this broken PromQL, so let me skip the Cincinnati pull and re-eval that old PromQL", which fails. Re-eval-and-fail loop continues.

To break out of 4.b, clusters on impacted releases can roll their CVO pod:
$ oc -n openshift-cluster-version delete -l k8s-app=cluster-version-operator pod
which will clear out LastAttempt and trigger a fresh Cincinnati pull. I'm not sure if there's another recovery method...

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

wking · 2024-01-02T19:25:26Z

/cherrypick release-4.15

openshift-cherrypick-robot · 2024-01-02T19:26:06Z

@wking: new pull request created: #1013

Details

In response to this:

/cherrypick release-4.15

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-bot · 2024-01-02T21:09:28Z

[ART PR BUILD NOTIFIER]

This PR has been included in build cluster-version-operator-container-v4.16.0-202401022050.p0.g5d3c08b.assembly.stream for distgit cluster-version-operator.
All builds following this will include this PR.

openshift-ci bot requested review from DavidHurta and LalatenduMohanty December 20, 2023 07:13

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 20, 2023

wking force-pushed the only-bump-last-attempt-on-fresh-cincinnati-pulls branch from eaea152 to e2d6af5 Compare December 20, 2023 07:38

wking changed the title ~~pkg/cvo/availableupdates: Only bump LastAttempt on Cincinnati pulls~~ OCPBUGS-25708: pkg/cvo/availableupdates: Only bump LastAttempt on Cincinnati pulls Dec 20, 2023

openshift-ci bot requested a review from jiajliu December 20, 2023 07:48

openshift-ci bot requested a review from petr-muller December 20, 2023 15:18

openshift-ci bot assigned LalatenduMohanty Dec 21, 2023

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 21, 2023

petr-muller approved these changes Dec 21, 2023

View reviewed changes

openshift-ci bot assigned petr-muller Dec 21, 2023

openshift-ci bot added the qe-approved Signifies that QE has signed off on this PR label Dec 29, 2023

openshift-ci bot requested a review from shellyyang1989 December 29, 2023 14:12

openshift-merge-bot bot merged commit 5d3c08b into openshift:master Jan 2, 2024

wking deleted the only-bump-last-attempt-on-fresh-cincinnati-pulls branch January 2, 2024 19:24

openshift-cherrypick-robot mentioned this pull request Jan 2, 2024

[release-4.15] OCPBUGS-25949: pkg/cvo/availableupdates: Only bump LastAttempt on Cincinnati pulls #1013

Merged

OCPBUGS-25708: pkg/cvo/availableupdates: Only bump LastAttempt on Cincinnati pulls #1009

OCPBUGS-25708: pkg/cvo/availableupdates: Only bump LastAttempt on Cincinnati pulls #1009

Uh oh!

Conversation

wking commented Dec 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Dec 20, 2023

Uh oh!

petr-muller commented Dec 20, 2023

Uh oh!

wking commented Dec 20, 2023

Uh oh!

wking commented Dec 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LalatenduMohanty commented Dec 21, 2023

Uh oh!

LalatenduMohanty commented Dec 21, 2023

Uh oh!

LalatenduMohanty commented Dec 21, 2023

Uh oh!

wking commented Dec 21, 2023

Uh oh!

petr-muller left a comment

Choose a reason for hiding this comment

Uh oh!

openshift-ci bot commented Dec 21, 2023

Uh oh!

shellyyang1989 commented Dec 29, 2023

Uh oh!

openshift-ci-robot commented Dec 29, 2023

Uh oh!

openshift-bot commented Jan 1, 2024

Uh oh!

openshift-ci-robot commented Jan 1, 2024

Uh oh!

wking commented Jan 2, 2024

Uh oh!

openshift-ci bot commented Jan 2, 2024

Uh oh!

openshift-ci bot commented Jan 2, 2024

Uh oh!

openshift-ci-robot commented Jan 2, 2024

Uh oh!

wking commented Jan 2, 2024

Uh oh!

openshift-cherrypick-robot commented Jan 2, 2024

Uh oh!

openshift-bot commented Jan 2, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

wking commented Dec 20, 2023 •

edited

Loading

wking commented Dec 20, 2023 •

edited

Loading