Skip to content

Conversation

@wking
Copy link
Member

@wking wking commented Dec 20, 2023

965bfb2 (#939) pivoted from "every syncAvailableUpdates round that does anything useful has a fresh Cincinnati pull" to "some syncAvailableUpdates rounds have a fresh Cincinnati pull, but others just re-eval some Recommended=Unknown conditional updates". Then syncAvailableUpdates calls setAvailableUpdates.

However, until this commit, setAvailableUpdates had been bumping LastAttempt every time, even in the just-re-eval conditional updates" case. That meant we never tripped the:

        } else if !optrAvailableUpdates.RecentlyChanged(optr.minimumUpdateCheckInterval) {
                klog.V(2).Infof("Retrieving available updates again, because more than %s has elapsed since %s", optr.minimumUpdateCheckInterval, optrAvailableUpdates.LastAttempt.Format(time.RFC3339))

condition to trigger a fresh Cincinnati pull. Which could lead to deadlocks like:

  1. Cincinnati serves vulnerable PromQL, like MCO-958: Blocking edges to 4.14.2+ and 4.13.25+ cincinnati-graph-data#4524.
  2. Clusters pick up that broken PromQL, try to evaluate, and fail. Re-eval-and-fail loop continues.
  3. Cincinnati PromQL fixed, like Fix AROBrokenDNSMasq cincinnati-graph-data#4528.
  4. Cases:
    • (a) Before 965bfb2, and also after this commit, Clusters pick up the fixed PromQL, try to evaluate, and start succeeding. Hooray!
    • (b) Clusters with 965bfb2 but without this commit say "it's been a long time since we pulled fresh Cincinanti information, but it has not been long since my last attempt to eval this broken PromQL, so let me skip the Cincinnati pull and re-eval that old PromQL", which fails. Re-eval-and-fail loop continues.

To break out of 4.b, clusters on impacted releases can roll their CVO pod:

$ oc -n openshift-cluster-version delete -l k8s-app=cluster-version-operator pod

which will clear out LastAttempt and trigger a fresh Cincinnati pull. I'm not sure if there's another recovery method...

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 20, 2023
965bfb2 (pkg/cvo/availableupdates: Requeue risk evaluation on
failure, 2023-09-18, openshift#939) pivoted from "every syncAvailableUpdates
round that does anything useful has a fresh Cincinnati pull" to "some
syncAvailableUpdates rounds have a fresh Cincinnati pull, but others
just re-eval some Recommended=Unknown conditional updates".  Then
syncAvailableUpdates calls setAvailableUpdates.

However, until this commit, setAvailableUpdates had been bumping
LastAttempt every time, even in the just-re-eval conditional updates"
case.  That meant we never tripped the:

        } else if !optrAvailableUpdates.RecentlyChanged(optr.minimumUpdateCheckInterval) {
                klog.V(2).Infof("Retrieving available updates again, because more than %s has elapsed since %s", optr.minimumUpdateCheckInterval, optrAvailableUpdates.LastAttempt.Format(time.RFC3339))

condition to trigger a fresh Cincinnati pull.  Which could lead to
deadlocks like:

1. Cincinnati serves vulnerable PromQL, like [1].
2. Clusters pick up that broken PromQL, try to evaluate, and fail.
   Re-eval-and-fail loop continues.
3. Cincinnati PromQL fixed, like [2].
4. Cases:
   a. Before 965bfb2, and also after this commit, Clusters pick up
      the fixed PromQL, try to evaluate, and start succeeding.  Hooray!
   b. Clusters with 965bfb2 but without this commit say "it's been
      a long time since we pulled fresh Cincinanti information, but it
      has not been long since my last attempt to eval this broken
      PromQL, so let me skip the Cincinnati pull and re-eval that old
      PromQL", which fails.  Re-eval-and-fail loop continues.

To break out of 4.b, clusters on impacted releases can roll their CVO
pod:

  $ oc -n openshift-cluster-version delete -l k8s-app=cluster-version-operator pod

which will clear out LastAttempt and trigger a fresh Cincinnati pull.
I'm not sure if there's another recovery method...

[1]: openshift/cincinnati-graph-data#4524
[2]: openshift/cincinnati-graph-data#4528
@wking wking force-pushed the only-bump-last-attempt-on-fresh-cincinnati-pulls branch from eaea152 to e2d6af5 Compare December 20, 2023 07:38
@wking wking changed the title pkg/cvo/availableupdates: Only bump LastAttempt on Cincinnati pulls OCPBUGS-25708: pkg/cvo/availableupdates: Only bump LastAttempt on Cincinnati pulls Dec 20, 2023
@openshift-ci-robot openshift-ci-robot added jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Dec 20, 2023
@openshift-ci-robot
Copy link
Contributor

@wking: This pull request references Jira Issue OCPBUGS-25708, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.16.0) matches configured target version for branch (4.16.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @jiajliu

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

965bfb2 (#939) pivoted from "every syncAvailableUpdates round that does anything useful has a fresh Cincinnati pull" to "some syncAvailableUpdates rounds have a fresh Cincinnati pull, but others just re-eval some Recommended=Unknown conditional updates". Then syncAvailableUpdates calls setAvailableUpdates.

However, until this commit, setAvailableUpdates had been bumping LastAttempt every time, even in the just-re-eval conditional updates" case. That meant we never tripped the:

       } else if !optrAvailableUpdates.RecentlyChanged(optr.minimumUpdateCheckInterval) {
               klog.V(2).Infof("Retrieving available updates again, because more than %s has elapsed since %s", optr.minimumUpdateCheckInterval, optrAvailableUpdates.LastAttempt.Format(time.RFC3339))

condition to trigger a fresh Cincinnati pull. Which could lead to deadlocks like:

  1. Cincinnati serves vulnerable PromQL, like MCO-958: Blocking edges to 4.14.2+ and 4.13.25+ cincinnati-graph-data#4524.
  2. Clusters pick up that broken PromQL, try to evaluate, and fail. Re-eval-and-fail loop continues.
  3. Cincinnati PromQL fixed, like Fix AROBrokenDNSMasq cincinnati-graph-data#4528.
  4. Cases:
    • (a) Before 965bfb2, and also after this commit, Clusters pick up the fixed PromQL, try to evaluate, and start succeeding. Hooray!
    • (b) Clusters with 965bfb2 but without this commit say "it's been a long time since we pulled fresh Cincinanti information, but it has not been long since my last attempt to eval this broken PromQL, so let me skip the Cincinnati pull and re-eval that old PromQL", which fails. Re-eval-and-fail loop continues.

To break out of 4.b, clusters on impacted releases can roll their CVO pod:

$ oc -n openshift-cluster-version delete -l k8s-app=cluster-version-operator pod

which will clear out LastAttempt and trigger a fresh Cincinnati pull. I'm not sure if there's another recovery method...

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot requested a review from jiajliu December 20, 2023 07:48
@petr-muller
Copy link
Member

/cc

@openshift-ci openshift-ci bot requested a review from petr-muller December 20, 2023 15:18
@wking
Copy link
Member Author

wking commented Dec 20, 2023

Ok, Cluster Bot testing with launch 4.16,openshift/cluster-version-operator#1009 gcp ([logs][1]):

$ oc patch clusterversion version --type json -p '[{"op": "add", "path": "/spec/upstream", "value": "https://raw.githubusercontent.com/wking/cincinnati-graph-data/demo/cincinnati-graph.json"}]'
$ oc adm upgrade channel demo
$ oc get -o json clusterversion version | jq '.status.conditionalUpdates[] | {conditions, risks: ([.risks[] | {name, promql: .matchingRules[0].promql.promql}])}'
{
  "conditions": [
    {
      "lastTransitionTime": "2023-12-20T19:03:57Z",
      "message": "The update is recommended, because none of the conditional update risks apply to this cluster.",
      "reason": "AsExpected",
      "status": "True",
      "type": "Recommended"
    }
  ],
  "risks": [
    {
      "name": "A",
      "promql": "cluster_operator_conditions"
    },
    {
      "name": "B",
      "promql": "group(cluster_version_available_updates{channel=\"buggy\"})\nor\n0 * group(cluster_version_available_updates{channel!=\"buggy\"})"
    },
    {
      "name": "C",
      "promql": "group(csv_succeeded{name=~\"local-storage-operator[.].*\"}) or 0 * group(csv_count)"
    },
    {
      "name": "D",
      "promql": "0 * max(cluster_version)"
    },
    {
      "name": "E",
      "promql": "0 * 0 * max(cluster_version)"
    }
  ]
}

I dunno how the cluster_operator_conditions eval isn't failing on multiple matching time-series. Let me add more debugging and try again....

@wking
Copy link
Member Author

wking commented Dec 20, 2023

Ok, new Cluster Bot round with launch 4.16,openshift/cluster-version-operator#1009,openshift/cluster-version-operator#1010 gcp (logs):

$ oc patch clusterversion version --type json -p '[{"op": "add", "path": "/spec/upstream", "value": "https://raw.githubusercontent.com/wking/cincinnati-graph-data/demo/cincinnati-graph.json"}]'
$ oc adm upgrade channel demo
$ oc adm upgrade --include-not-recommended
Cluster version is 4.16.0-0.test-2023-12-20-194115-ci-ln-hyd2jpt-latest

Upstream: https://raw.githubusercontent.com/wking/cincinnati-graph-data/demo/cincinnati-graph.json
Channel: demo
No updates available. You may still upgrade to a specific release image with --to-image or wait for new updates to be available.

Supported but not recommended updates:

  Version: 4.16.1
  Image: quay.io/openshift-release-dev/ocp-release@sha256:0000000000000000000000000000000000000000000000000000000000000000
  Recommended: Unknown
  Reason: EvaluationFailed
  Message: Could not evaluate exposure to update risk A (invalid PromQL result length must be one, but is 147)
    A description: A.
    A URL: https://bug.example.com/a

So hooray, we're sticking on multiple matches now. Not sure why we didn't last time... Continuing with the OCPBUGS-25708 reproducer proceedure:

$ sed -i 's/cluster_operator_conditions/group(cluster_operator_conditions)/' cincinnati-graph.json
$ git commit -am 'Fix broken PromQL'
$ git push wking demo
$ sleep 600
$ oc adm upgrade --include-not-recommended
Cluster version is 4.16.0-0.test-2023-12-20-194115-ci-ln-hyd2jpt-latest

Upstream: https://raw.githubusercontent.com/wking/cincinnati-graph-data/demo/cincinnati-graph.json
Channel: demo
No updates available. You may still upgrade to a specific release image with --to-image or wait for new updates to be available.

Supported but not recommended updates:

  Version: 4.16.1
  Image: quay.io/openshift-release-dev/ocp-release@sha256:0000000000000000000000000000000000000000000000000000000000000000
  Recommended: False
  Reason: A
  Message: A. https://bug.example.com/a

So it noticed the PromQL expression fix, and went from Unknown (many matching time-series) to False (group(...) evaluates to "1, I'm exposed"). Looks good to me :)

@LalatenduMohanty
Copy link
Member

/assign @LalatenduMohanty

@LalatenduMohanty
Copy link
Member

So hooray, we're sticking on multiple matches now.

@wking Not sure what you meant here. Can you please explain?

@LalatenduMohanty
Copy link
Member

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 21, 2023
@wking
Copy link
Member Author

wking commented Dec 21, 2023

So hooray, we're sticking on multiple matches now.

@wking Not sure what you meant here. Can you please explain?

In my first verification attempt, when I fed the cluster cluster_operator_conditions as a PromQL expression, it picked up the expression, and apparently evaluated it, but still had Recommended=True for the update. I'd have expected Recommended=Unknown with "this PromQL matches >1 time-series". I still don't understand what was going on in that run.

In my second verification attempt, I mixed in #1010 for debugging, but didn't end up needing it, because we got Recommended=Unknown with Could not evaluate exposure to update risk A (invalid PromQL result length must be one, but is 147) immediately. And the "hooray..." was celebrating that difference from my first verification attempt.

Copy link
Member

@petr-muller petr-muller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Pretty nasty failure case - it's quite easy in hindsight, but hard to figure out before you know it's there ;)

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 21, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: LalatenduMohanty, petr-muller, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [LalatenduMohanty,petr-muller,wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@shellyyang1989
Copy link
Contributor

/label qe-approved

@openshift-ci openshift-ci bot added the qe-approved Signifies that QE has signed off on this PR label Dec 29, 2023
@openshift-ci-robot
Copy link
Contributor

@wking: This pull request references Jira Issue OCPBUGS-25708, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.16.0) matches configured target version for branch (4.16.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @shellyyang1989

Details

In response to this:

965bfb2 (#939) pivoted from "every syncAvailableUpdates round that does anything useful has a fresh Cincinnati pull" to "some syncAvailableUpdates rounds have a fresh Cincinnati pull, but others just re-eval some Recommended=Unknown conditional updates". Then syncAvailableUpdates calls setAvailableUpdates.

However, until this commit, setAvailableUpdates had been bumping LastAttempt every time, even in the just-re-eval conditional updates" case. That meant we never tripped the:

       } else if !optrAvailableUpdates.RecentlyChanged(optr.minimumUpdateCheckInterval) {
               klog.V(2).Infof("Retrieving available updates again, because more than %s has elapsed since %s", optr.minimumUpdateCheckInterval, optrAvailableUpdates.LastAttempt.Format(time.RFC3339))

condition to trigger a fresh Cincinnati pull. Which could lead to deadlocks like:

  1. Cincinnati serves vulnerable PromQL, like MCO-958: Blocking edges to 4.14.2+ and 4.13.25+ cincinnati-graph-data#4524.
  2. Clusters pick up that broken PromQL, try to evaluate, and fail. Re-eval-and-fail loop continues.
  3. Cincinnati PromQL fixed, like Fix AROBrokenDNSMasq cincinnati-graph-data#4528.
  4. Cases:
    • (a) Before 965bfb2, and also after this commit, Clusters pick up the fixed PromQL, try to evaluate, and start succeeding. Hooray!
    • (b) Clusters with 965bfb2 but without this commit say "it's been a long time since we pulled fresh Cincinanti information, but it has not been long since my last attempt to eval this broken PromQL, so let me skip the Cincinnati pull and re-eval that old PromQL", which fails. Re-eval-and-fail loop continues.

To break out of 4.b, clusters on impacted releases can roll their CVO pod:

$ oc -n openshift-cluster-version delete -l k8s-app=cluster-version-operator pod

which will clear out LastAttempt and trigger a fresh Cincinnati pull. I'm not sure if there's another recovery method...

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-bot
Copy link
Contributor

/jira refresh

The requirements for Jira bugs have changed (Jira issues linked to PRs on main branch need to target different OCP), recalculating validity.

@openshift-ci-robot
Copy link
Contributor

@openshift-bot: This pull request references Jira Issue OCPBUGS-25708, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.16.0) matches configured target version for branch (4.16.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @shellyyang1989

Details

In response to this:

/jira refresh

The requirements for Jira bugs have changed (Jira issues linked to PRs on main branch need to target different OCP), recalculating validity.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@wking
Copy link
Member Author

wking commented Jan 2, 2024

cluster-baremetal-operator crash-looping is getting addressed in openshift/cluster-baremetal-operator#395. I dunno about the RequiredInstallerResourcesMissing issues from kube-apiserver-operator, but that's also unrelated to this pull.

/override ci/prow/e2e-agnostic-ovn-upgrade-out-of-change

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 2, 2024

@wking: Overrode contexts on behalf of wking: ci/prow/e2e-agnostic-ovn-upgrade-out-of-change

Details

In response to this:

cluster-baremetal-operator crash-looping is getting addressed in openshift/cluster-baremetal-operator#395. I dunno about the RequiredInstallerResourcesMissing issues from kube-apiserver-operator, but that's also unrelated to this pull.

/override ci/prow/e2e-agnostic-ovn-upgrade-out-of-change

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 2, 2024

@wking: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot bot merged commit 5d3c08b into openshift:master Jan 2, 2024
@openshift-ci-robot
Copy link
Contributor

@wking: Jira Issue OCPBUGS-25708: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-25708 has been moved to the MODIFIED state.

Details

In response to this:

965bfb2 (#939) pivoted from "every syncAvailableUpdates round that does anything useful has a fresh Cincinnati pull" to "some syncAvailableUpdates rounds have a fresh Cincinnati pull, but others just re-eval some Recommended=Unknown conditional updates". Then syncAvailableUpdates calls setAvailableUpdates.

However, until this commit, setAvailableUpdates had been bumping LastAttempt every time, even in the just-re-eval conditional updates" case. That meant we never tripped the:

       } else if !optrAvailableUpdates.RecentlyChanged(optr.minimumUpdateCheckInterval) {
               klog.V(2).Infof("Retrieving available updates again, because more than %s has elapsed since %s", optr.minimumUpdateCheckInterval, optrAvailableUpdates.LastAttempt.Format(time.RFC3339))

condition to trigger a fresh Cincinnati pull. Which could lead to deadlocks like:

  1. Cincinnati serves vulnerable PromQL, like MCO-958: Blocking edges to 4.14.2+ and 4.13.25+ cincinnati-graph-data#4524.
  2. Clusters pick up that broken PromQL, try to evaluate, and fail. Re-eval-and-fail loop continues.
  3. Cincinnati PromQL fixed, like Fix AROBrokenDNSMasq cincinnati-graph-data#4528.
  4. Cases:
    • (a) Before 965bfb2, and also after this commit, Clusters pick up the fixed PromQL, try to evaluate, and start succeeding. Hooray!
    • (b) Clusters with 965bfb2 but without this commit say "it's been a long time since we pulled fresh Cincinanti information, but it has not been long since my last attempt to eval this broken PromQL, so let me skip the Cincinnati pull and re-eval that old PromQL", which fails. Re-eval-and-fail loop continues.

To break out of 4.b, clusters on impacted releases can roll their CVO pod:

$ oc -n openshift-cluster-version delete -l k8s-app=cluster-version-operator pod

which will clear out LastAttempt and trigger a fresh Cincinnati pull. I'm not sure if there's another recovery method...

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@wking wking deleted the only-bump-last-attempt-on-fresh-cincinnati-pulls branch January 2, 2024 19:24
@wking
Copy link
Member Author

wking commented Jan 2, 2024

/cherrypick release-4.15

@openshift-cherrypick-robot

@wking: new pull request created: #1013

Details

In response to this:

/cherrypick release-4.15

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

This PR has been included in build cluster-version-operator-container-v4.16.0-202401022050.p0.g5d3c08b.assembly.stream for distgit cluster-version-operator.
All builds following this will include this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. qe-approved Signifies that QE has signed off on this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants