Skip to content

Conversation

@wking
Copy link
Member

@wking wking commented Oct 25, 2023

Picking #974 and #980 back to 4.12.

LalatenduMohanty and others added 30 commits November 7, 2022 18:07
With this change we will not fail precondition checks when a version is not in available updates.
So users should be able to upgrade to any version with approriate client side overrides.
For example using --allow-explicit-upgrade and --to-imge flags in "oc adm upgrade"

Signed-off-by: Lalatendu Mohanty <lmohanty@redhat.com>
…ry-pick-856-to-release-4.12

[release-4.12] OCPBUGS-3352: Do not fail precondition check for UnknownUpdate
…ddress

The problem was identified to be a broken substitution of internal load
balancer into `KUBERNETES_SERVICE_HOST` by Trevor and David (see my [JIRA comment](https://issues.redhat.com/browse/OCPBUGS-1458?focusedCommentId=21090756&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-21090756)
and related [Slack thread](https://coreos.slack.com/archives/C011CSSPBLK/p1664925995946479?thread_ts=1661182025.992649&cid=C011CSSPBLK)).

CVO injects the LB hostname in the
[`ModifyDeployment`](https://github.com/openshift/cluster-version-operator/blob/dc1ad0aef5f3e1b88074448d21445a5bddb6b05b/lib/resourcebuilder/apps.go#L19)
fine, but then the deployment gets applied in
[`ApplyDeployment`](https://github.com/openshift/cluster-version-operator/blob/dc1ad0aef5f3e1b88074448d21445a5bddb6b05b/lib/resourceapply/apps.go#L17)
and the
`EnsureDeployment`->`ensurePodTemplateSpec`->`ensurePodSpec`->`ensureContainers`->`ensureContainer`->`ensureEnvVar`
chain stomps the updated value in `required` by the old value from
`existing` and reverts the injection in this way

This behavior was added intentionally in openshift#559
as a part of a fix for various hot-looping issues. The substitution
apparently caused some hot-looping issues in the past ([slack thread](https://coreos.slack.com/archives/CEGKQ43CP/p1620934857402200?thread_ts=1620895567.367100&cid=CEGKQ43CP)).
I have tested removing the special handling `KUBERNETES_SERVICE_HOST`
thoroughly, and saw no problematic behavior. After fixing other
hot-looping problems in openshift#855
to eliminate noise, no new hot-loops occurs with
`KUBERNETES_SERVICE_HOST` handling removed.
The client-go code retries a subset of network errors on GET for 30s,
but we saw occurrences of other short disruptions, like DNS ones, that
make us abort and restart unnecessarily soon.

Make CVO retry all errors for 25s and only abort when we do not succeed
in this time frame. This helps CVO survive short disruptions on startup,
leading to less noise, mostly during installation.
…ry-pick-851-to-release-4.12

[release-4.12] OCPBUGS-3770: Allow CVO to update `KUBERNETES_SERVICE_HOST` with LB address
Avoid errors like:

  $ oc get -o json clusterversion version | jq -r '.status.history[0].acceptedRisks'
  Forced through blocking failures: Precondition "ClusterVersionRecommendedUpdate" failed because of "UnknownUpdate": RetrievedUpdates=True (), so the update from 4.13.0-0.okd-2022-12-11-064650 to 4.13.0-0.okd-2022-12-13-052859 is probably neither recommended nor supported.

Instead, tweak the logic from 481bcde (OCPBUGS-2727: Do not fail
precondition check for UnknownUpdate, 2022-10-21, openshift#856), and only
append the "Forced through blocking failures:" prefix when the forcing
was required.

Also fail-closed and treat non *Error errors as blocking too, where
previously only !pferr.NonBlockingWarning were blocking.
…ry-pick-876-to-release-4.12

OCPBUGS-5083: pkg/payload/precondition: Do not claim warnings would have blocked
Previously, the throttling reused the `minimumUpdateCheckInterval` value
which is derived from the full CVO minimum sync period. This value is
set between 2m and 4m at CVO startup. This period is unecessarily long
and bad for UX, things happen with a delay and our own testcase expects
upgradeability to be propagated in 3 minutes at worst.

Hardcode the throttling to 2m (lower bound of previous behavior) to
prevent flapping on flurries but allow changes to propagate
deterministically faster. We will still get a bit of non-determinisim
from sync periods and requeueing, so this change should not cause any
periodic API-hammering.
Refactor the code that handles throttling upgradeability checks. Create
a new method that computes the duration for which the existing
`Upgradeable` status is considered recent enough to not be synced, and
simply pass this duration to the `RecentlyChanged` method. The new
method is now unit tested, too. Upgradeable-related intervals are now
uncoupled to unrelated sync intervals and are grouped in a new struct.
…ry-pick-882-to-release-4.12

[release-4.12] OCPBUGS-5879: Set upgradeability check throttling period to 2m
Signed-off-by: Lalatendu Mohanty <lmohanty@redhat.com>
OCPBUGS-8304: Adding admin-gate ack-4.12-kube-1.26-api-removals-in-4.13
…rget version

When changing channels it's possible that multiple new conditional
update risks will need to be evaluated.  For instance, a cluster
running 4.10.34 in a 4.10 channel might only have to evaluate
OpenStackNodeCreationFails.  But when the channel is changed to a 4.11
channel multiple new risks require evaluation and the evaluation of
new risks was throttled at one every 10 minutes.  This means if there
are three new risks it may take up to 20 minutes after the channel has
changed for the full set of conditional updates to be computed.

With this commit, I'm sorting the conditional updates in
version-descending order, which is the order we've used in the
ClusterVersion status since c9dd479 (pkg/cvo/availableupdates:
Sort (conditional)updates, 2021-09-29, openshift#663).  This prioritizes the
longest-hop risks.  For example, 4.10.34 currently has the following
updates:

* 4.10.(z!=38): no risks
* 4.10.38: OpenStackNodeCreationFails
* 4.11.(z<10): ARM64SecCompError524, AWSOldBootImagesLackAfterburn, MachineConfigRenderingChurn, OVNNetworkPolicyLongName
* 4.11.(10<=z<26): ARM64SecCompError524, AWSOldBootImagesLackAfterburn, MachineConfigRenderingChurn
* 4.11.26: ARM64SecCompError524, AWSOldBootImagesLackAfterburn
* 4.11.(27<=z<...): AWSOldBootImagesLackAfterburn

By focusing on the largest target (say 4.11.30), we'd evaluate
AWSOldBootImagesLackAfterburn first.  If it did not match the current
cluster, 4.11.27 and later would be quickly recommended.  It would
take another 10m before the self-throttling allowed us to evaluate
ARM64SecCompError524, and once we had, that would unblock 4.11.26.
Ten minutes after that, we'd evaluate MachineConfigRenderingChurn, and
unblock 4.11.(10<=z<26).  And so on.  But folks on 4.10.34 today are
much more likely to be interested in 4.11.30 and other tip releases
than they are to care about 4.11.10 and other relatively old releases.
When we separated payload load from payload apply (openshift#683) the context
used for the retrieval changed as well. It went from one that was
constrained by syncTimeout (2 -4 minutes) [1] to being the unconstrained
shutdownContext [2]. However if "force" is specified we explicitly set a
2 minute timeout in RetrievePayload. This commit creates a new context
with a reasonable timeout for RetrievePayload regardless of "force".

[1]
https://github.com/openshift/cluster-version-operator/blob/57ffa7c610fb92ef4ccd9e9c49e75915e86e9296/pkg/cvo/sync_worker.go#L605

[2]
https://github.com/openshift/cluster-version-operator/blob/57ffa7c610fb92ef4ccd9e9c49e75915e86e9296/pkg/cvo/cvo.go#L413
The `RetrievePayload` performs two operations: verification and
download. Both can take a non-trivial amount of time to terminate, up to
"hanging" where CVO needs to abort the operation. The verification
result can be ignored when upgrade is forced. The CVO calls
`RetrievePayload` with a context that does not set a deadline, so
`RetrievePayload` previously set its own internal deadline, common for
both operations. This led to a suboptimal behavior on forced upgrades,
where "hanging" verification could eat the whole timeout budget, got
cancelled but its result was ignored (because of force). The code
tried to proceed with download but that immediately aborts because of
the expired context.

Improve timeouts in `RetrievePayload` for both input context states:
with and without deadline. If the input context sets a deadline, it is
respected. If it does not, the default, separate deadlines are applied
for both operations. In both cases, the code makes sure the hanging
verification never spends the whole budget. When verification terminates
fast, the rest of its alloted time is provided to the download
operation.
…ry-pick-912-to-release-4.12

OCPBUGS-10514: pkg/cvo/availableupdates: Prioritize conditional risks for largest target version
…ry-pick-896-to-release-4.12

[release-4.12] OCPBUGS-10565: RetrievePayload: Improve timeouts and cover behavior with tests
The kubelet consistently resolves the name.  This change allows the CVO
to use the kubelet's DNS configuration.
Globals cause problems for unit tests (demonstrated in some of the unit
tests) and prevent general reuse.  The anti-pattern is avoided in
kuberentes code and should be OCP as well.  This updates to providing a
configured condition registry without init blocks and provides a
standard clustercondition registry for each consumption.
We do this so that our host-network pod can use the node's resolv.conf to resolve the internal load balancer name
on the pod before DNS pods are available and before the service network is available.  The side effect is that
the CVO cannot resolve service DNS names.
…release-4.12

OCPBUGS-12182: Update dnsPolicy to allow consistent resolution of the internal LB
…ry-pick-904-to-release-4.12

[release-4.12] OCPBUGS-14096: Trigger new sync round on ClusterOperator Available changes
When:

* It has been longer than MinBetweenMatches since our last wrapped
  evaluation,
* There is no alternative key stale enough (MinForCondition) to steal
  the evaluation, and
* The currently requested key is not cached yet, e.g. because it was
  recently declared as a new risk to a cluster-version operator which
  had recently run all the other relevent risks from the current
  Cincinnati response.

avoid crashing with:

  $ grep -B1 -A15 'too fresh' previous.log
  I0927 12:07:55.594222       1 cincinnati.go:114] Using a root CA pool with 0 root CA subjects to request updates from https://raw.githubusercontent.com/shellyyang1989/upgrade-cincy/master/cincy-conditional-edge-invalid-promql.json?arch=amd64&channel=stable-4.15&id=dc628f75-7778-457a-bb69-6a31a243c3a9&version=4.15.0-0.test-2023-09-27-091926-ci-ln-01zw7kk-latest
  I0927 12:07:55.726463       1 cache.go:118] {"type":"PromQL","promql":{"promql":"0 * group(cluster_version)"}} is the most stale cached cluster-condition match entry, but it is too fresh (last evaluated on 2023-09-27 11:37:25.876804482 +0000 UTC m=+175.082381015).  However, we don't have a cached evaluation for {"type":"PromQL","promql":{"promql":"group(cluster_version_available_updates{channel=buggy})"}}, so attempt to evaluate that now.
  I0927 12:07:55.726602       1 cache.go:129] {"type":"PromQL","promql":{"promql":"0 * group(cluster_version)"}} is stealing this cluster-condition match call for {"type":"PromQL","promql":{"promql":"group(cluster_version_available_updates{channel=buggy})"}}, because its last evaluation completed 30m29.849594461s ago
  I0927 12:07:55.758573       1 cvo.go:703] Finished syncing available updates "openshift-cluster-version/version" (170.074319ms)
  E0927 12:07:55.758847       1 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
  goroutine 194 [running]:
  k8s.io/apimachinery/pkg/util/runtime.logPanic({0x1c4df00?, 0x32abc60})
          /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:75 +0x99
  k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc001489d40?})
          /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:49 +0x75
  panic({0x1c4df00, 0x32abc60})
          /usr/lib/golang/src/runtime/panic.go:884 +0x213
  github.com/openshift/cluster-version-operator/pkg/clusterconditions/promql.(*PromQL).Match(0xc0004860e0, {0x220ded8, 0xc00041e550}, 0x0)
          /go/src/github.com/openshift/cluster-version-operator/pkg/clusterconditions/promql/promql.go:134 +0x419
  github.com/openshift/cluster-version-operator/pkg/clusterconditions/cache.(*Cache).Match(0xc0002d3ae0, {0x220ded8, 0xc00041e550}, 0xc0033948d0)
          /go/src/github.com/openshift/cluster-version-operator/pkg/clusterconditions/cache/cache.go:132 +0x982
  github.com/openshift/cluster-version-operator/pkg/clusterconditions.(*conditionRegistry).Match(0xc000016760, {0x220ded8, 0xc00041e550}, {0xc0033948a0, 0x1, 0x0?})
…openshift#964)

* availableupdates: do not reset lastTransitionTime on unchanged status

The code in `evaluateConditionalUpdates` correctly uses
`SetStatusCondition` to set conditions, which only updates the
`LastTransitionTime` field when `Status` differs between the original
and updated state. Previously though, the original state always
contained empty conditions, because conditional updates are always
obtained from OSUS and the fresh structure was never updated with
existing conditions from the in-cluster status.

* review: use existing mock condition instead of new code

* review: use real queue instead of a mock
The function had returned the original pointer since it landed in
db150e6 (cvo: Perform status updates in a single thread,
2018-11-03, openshift#45).  But locking the operator structure to return a
pointer reference is a bit risky, because after the lock is released
you're still holding a pointer into that data, but lack easy access to
the lock to guard against simultaneous access.  For example, you could
have setAvailableUpdates updating the structure, while simultaneously
operatorMetrics.Collect, Operator.syncStatus, or
Operator.mergeReleaseMetadata is looking at their pointer reference to
the old data.

There wasn't actually much exposure, because writes all happened to
flow through setAvailableUpdates, and setAvailableUpdates's only
changes were:

* Bumping the u.LastSyncOrConfigChange Time.
* Replacing the availableUpdates pointer with a new pointer.

and neither of those should significantly disrupt any of the consumers.

But switching to a copy doesn't cost much resource wise, and it
protects us from a number of possible ways that this could break in
the future if setAvailableUpdates does less full-pointer-replacement
or one of the consumers starts to care about LastSyncOrConfigChange
reliably lining up with the rest of the availableUpdates content.

It does mean we need to update the copy logic as we add new properties
to the structure, but we'd need to do that even if we used
deepcopy-gen or similar to automate the copy generation.
Instead of waiting for the next round of evaluation, which might take
minutes.  For example, in 4.14.0-rc.1 testing [1]:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472/artifacts/e2e-aws-ovn-serial/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-78644f4679-q8sdm_cluster-version-operator.log | grep availableupdate
  I0915 18:21:24.184272       1 availableupdates.go:50] First attempt to retrieve available updates
  I0915 18:21:24.845512       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:21:39.836566       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:21:39.843398       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:21:54.835464       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:23:16.769850       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:23:16.784421       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:23:39.842269       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:23:39.862590       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:21:24Z, will try later.
  I0915 18:24:09.837669       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T18:21:24Z
  I0915 18:24:24.843569       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:24:09Z, will try later.
  I0915 18:25:24.839869       1 availableupdates.go:58] Available updates were recently retrieved, with less than 2m28.489200644s elapsed since 2023-09-15T18:24:09Z, will try later.
  ...
  I0915 20:26:07.109093       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:22:23Z
  I0915 20:29:50.769739       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:26:07Z
  I0915 20:33:34.432215       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:29:50Z
  I0915 20:37:18.093261       1 availableupdates.go:52] Retrieving available updates again, because more than 2m28.489200644s has elapsed since 2023-09-15T20:33:34Z

I'm not entirely clear on what the triggers were there, with 3m44s
between those final entries.  Operator.Run sets up:

  wait.UntilWithContext(runContext, func(runContext context.Context) {
    optr.worker(runContext, optr.availableUpdatesQueue, optr.availableUpdatesSync)
  }, time.Second)

and [2] docs UntilWithContext:

  UntilWithContext loops until context is done, running f every period.

  UntilWithContext is syntactic sugar on top of JitterUntilWithContext
  with zero jitter factor and with sliding = true (which means the
  timer for period starts after the f completes).

So that should be waking up, draining the queue, sleeping a second,
waking back up, draining the queue again, and on forever.  Perhaps we
are just backing off to the slowest DefaultControllerRateLimiter
period [3], but I expect processNextWorkItem's calling handleErr is
calling Forget on the queue, because I don't see any of its
error-branch logging:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472/artifacts/e2e-aws-ovn-serial/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-78644f4679-q8sdm_cluster-version-operator.log | grep 'Error handling\|out of the queue'
  ...no hits...

That suggests nothing is slowing down our queue processing from
once-per-second (plus evaluation time).  But what's feeding the queue
items to process?  There only Add calls seem to be in
clusterVersionEventHandler, but checking audit logs:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472/artifacts/e2e-aws-ovn-serial/gather-audit-logs/artifacts/audit-logs.tar | tar -xz --strip-components=2
  $ zgrep -h clusterversion kube-apiserver/*audit*.log.gz | jq -r 'select(.verb != "get" and .verb != "list" and .verb != "watch") | .stageTimestamp + " " + (.responseStatus.code | tostring) + " " + .verb + " " + .objectRef.subresource + " " + .user.username' | sort
  ...
  2023-09-15T18:26:24.841812Z 200 update status system:serviceaccount:openshift-cluster-version:default
  2023-09-15T18:26:24.858507Z 200 update status system:serviceaccount:openshift-cluster-version:default
  2023-09-15T18:29:39.835307Z 200 update status system:serviceaccount:openshift-cluster-version:default
  2023-09-15T18:37:39.836698Z 200 update status system:serviceaccount:openshift-cluster-version:default

which are all hours before these 20:26 and similar update retrievals.
I suspect this due to resyncPeriod(o.ResyncInterval) being passed to
NewFilteredSharedInformerFactory when generating the ClusterVersion
informer, putting a lower bound on the UpdateFunc event-handler
frequency.

My goal is to set the stage for faster cache-warming after receiving a
batch of new PromQL update risks, as described in 530a509
(pkg/cvo/availableupdates: Prioritize conditional risks for largest
target version, 2023-03-06, openshift#909).  I still have not adjusted the
caching logic, so at the moment, it only gives us faster updates on
the "that PromQL is still throttled" loop.  The AddAfter avoids
hot-looping on:

1. Can I evaluate the risks?
2. No? Requeue and return to 1 right now.

and instead gives us:

1. Can I evaluate the risks?
2. No? Requeue and return to 1 around a second from know.

The new addUpdate avoids injecting the same Recommended=True target
into availableUpdates multiple times while trying to evaluate another
conditional update, now that we have the !needFreshFetch case, where
we recycle the previous structure data without the fresh Cincinnati
fetch to clear earlier additions.  Without the addUpdate pivot, we get
[5]:

  status:
    availableUpdates:
    - image: registry.ci.openshift.org/ocp/release@sha256:e385a786f122c6c0e8848ecb9901f510676438f17af8a5c4c206807a9bc0bf28
      version: 4.15.0-0.nightly-2023-10-19-222222
    - image: registry.ci.openshift.org/ocp/release@sha256:e385a786f122c6c0e8848ecb9901f510676438f17af8a5c4c206807a9bc0bf28
      version: 4.15.0-0.nightly-2023-10-19-222222
    - image: registry.ci.openshift.org/ocp/release@sha256:e385a786f122c6c0e8848ecb9901f510676438f17af8a5c4c206807a9bc0bf28
      version: 4.15.0-0.nightly-2023-10-19-222222
    ...
    conditionalUpdates:
    - conditions:
      - lastTransitionTime: "2023-09-21T09:29:30Z"
        message: The update is recommended, because none of the conditional update
          risks apply to this cluster.
        reason: AsExpected
        status: "True"
        type: Recommended
      release:
        image: registry.ci.openshift.org/ocp/release@sha256:e385a786f122c6c0e8848ecb9901f510676438f17af8a5c4c206807a9bc0bf28
        version: 4.15.0-0.nightly-2023-10-19-222222
      risks:
      - matchingRules:
        - promql:
            promql: |-
              cluster_infrastructure_provider{type=~"nonexist"}
              or
              0 * cluster_infrastructure_provider
          type: PromQL
        message: Clusters on nonexist provider, this imaginary bug can happen.
        name: SomeInfrastructureThing
        url: https://bug.example.com/c
     ...
    - conditions:
      - lastTransitionTime: "2023-09-21T09:29:31Z"
        message: |-
          On clusters on default invoker user, this imaginary bug can happen. https://bug.example.com/a

          Could not evaluate exposure to update risk SomeChannelThing (evaluation is throttled until 09:29:32Z)
            SomeChannelThing description: On clusters with the channel set to 'buggy', this imaginary bug can happen.
            SomeChannelThing URL: https://bug.example.com/b
        reason: MultipleReasons
        status: "False"
        type: Recommended
      release:
        image: registry.ci.openshift.org/ocp/release@sha256:66c753e8b75d172f2a3f7ba13363383a76ecbc7ecdc00f3a423bef4ea8560405
        version: 4.15.0-0.nightly-2023-10-17-000000
      risks:
      - matchingRules:
        - promql:
            promql: cluster_installer
          type: PromQL
        message: On clusters on default invoker user, this imaginary bug can happen.
        name: SomeInvokerThing
        url: https://bug.example.com/a
      - matchingRules:
        - promql:
            promql: |-
              group(cluster_version_available_updates{channel="buggy"})
              or
              0 * group(cluster_version_available_updates{channel!="buggy"})
          type: PromQL
        message: On clusters with the channel set to 'buggy', this imaginary bug can
          happen.
        name: SomeChannelThing
        url: https://bug.example.com/b

lasting until the next Cincinnati fetch cleared out the
availableUpdates redundancy.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1702743868887273472
[2]: https://pkg.go.dev/k8s.io/apimachinery/pkg/util/wait#UntilWithContext
[3]: https://pkg.go.dev/k8s.io/client-go/util/workqueue#DefaultControllerRateLimiter
[4]: https://github.com/kubernetes/client-go/blob/v0.28.2/util/workqueue/default_rate_limiters.go#L39
[5]: openshift#939 (comment)
530a509 (pkg/cvo/availableupdates: Prioritize conditional risks for
largest target version, 2023-03-06, openshift#909) prioritized the order in
which risks were evaluated.  But we were still waiting 10 minutes
between different PromQL evaluations while evaluating conditional
update risks.  The original 10m requirement is from the enhancement
[1], and was implemented in ca186ed (pkg/clusterconditions/cache:
Add a cache wrapper for client-side throttling, 2021-11-10, openshift#663).
But discussing with Lala, Scott, and Ben, we feel like the addressing
the demonstrated user experience need of low-latency risk evaluation
[2] is worth reducing the throttling to 1s per expression evaluation.
We still have MinForCondition set to an hour, so with this commit, a
cluster-version operator evaluating three risks will move from a
timeline like:

1. 0s, hear about risks that depend on PromQL A, B, and C.  Evaluate A for the first time.
2. 10m, evaluate B for the first time (MinBetweenMatches after 1).
3. 20m, evaluate C for the first time (MinBetweenMatches after 2).
4. 1h, evaluate A again (MinForCondition after 1, also well past MinBetweenMatches after 3).
5. 1h10m, evaluate B again (MinForCondition after 2 and MinBetweenMatches after 4).
6. 1h20m, evaluate C again (MinForCondition after 3 and MinBetweenMatches after 5).
7. 2h, evaluate A again (MinForCondition after 4, also well past MinBetweenMatches after 6).
8. 2h10m, evaluate B again (MinForCondition after 5 and MinBetweenMatches after 7).
9. 2h20m, evaluate C again (MinForCondition after 6 and MinBetweenMatches after 8).

to a timeline like:

1. 0s, hear about risks that depend on PromQL A, B, and C.  Evaluate A for the first time.
2. 1s, evaluate B for the first time (MinBetweenMatches after 1).
3. 2s, evaluate C for the first time (MinBetweenMatches after 2).
4. 1h, evaluate A again (MinForCondition after 1, also well past MinBetweenMatches after 3).
5. 1h1s, evaluate B again (MinForCondition after 2 and MinBetweenMatches after 4).
6. 1h2s, evaluate C again (MinForCondition after 3 and MinBetweenMatches after 5).
7. 2h, evaluate A again (MinForCondition after 4, also well past MinBetweenMatches after 6).
8. 2h1s, evaluate B again (MinForCondition after 5 and MinBetweenMatches after 7).
9. 2h2s, evaluate C again (MinForCondition after 6 and MinBetweenMatches after 8).

We could deliver faster cache warming while preserving spaced out
refresh evaluation by splitting MinBetweenMatches into a 1s
MinBetweenMatchesInitial and 10m MinBetweenMatchesWhenCached, which
would produce timelines like:

1. 0s, hear about risks that depend on PromQL A, B, and C.  Evaluate A for the first time.
2. 1s, evaluate B for the first time (MinBetweenMatchesInitial after 1).
3. 2s, evaluate C for the first time (MinBetweenMatchesInitial after 2).
4. 1h, evaluate A again (MinForCondition after 1, also well past MinBetweenMatchesWhenCached after 3).
5. 1h10m, evaluate B again (MinForCondition after 2 and MinBetweenMatchesWhenCached after 4).
6. 1h20m, evaluate C again (MinForCondition after 3 and MinBetweenMatchesWhenCached after 5).
7. 2h, evaluate A again (MinForCondition after 4, also well past MinBetweenMatchesWhenCached after 6).
8. 2h10m, evaluate B again (MinForCondition after 5 and MinBetweenMatchesWhenCached after 7).
9. 2h20m, evaluate C again (MinForCondition after 6 and MinBetweenMatchesWhenCached after 8).

but again discussing with Lala, Scott, and Ben, the code complexity to
deliver that distinction does not seem to be worth thet protection it
delivers to the PromQL engine.  And really, PromQL engines concerned
about load should harden themselves, including via Retry-After [3]
that allow clients to back off gracefully when the service needs that,
instead of relying on clients to guess about the load the service
could handle and back off without insight into actual server capacity.

[1]: https://github.com/openshift/enhancements/blame/158111ce156aac7fa6063a47c00e129c13033aec/enhancements/update/targeted-update-edge-blocking.md#L323-L325
[2]: https://issues.redhat.com/browse/OCPBUGS-19512
[3]: https://www.rfc-editor.org/rfc/rfc9110#name-retry-after
@openshift-merge-robot
Copy link
Contributor

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 25, 2023
@wking
Copy link
Member Author

wking commented Oct 25, 2023

oops, wrong base branch 🤦

@wking wking closed this Oct 25, 2023
@wking
Copy link
Member Author

wking commented Oct 25, 2023

replaced with #988.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 25, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 25, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 25, 2023

@wking: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/gofmt db85bf1 link true /test gofmt
ci/prow/images db85bf1 link true /test images
ci/prow/e2e-agnostic-ovn db85bf1 link true /test e2e-agnostic-ovn
ci/prow/e2e-hypershift db85bf1 link true /test e2e-hypershift
ci/prow/unit db85bf1 link true /test unit
ci/prow/e2e-agnostic-ovn-upgrade-out-of-change db85bf1 link true /test e2e-agnostic-ovn-upgrade-out-of-change
ci/prow/e2e-agnostic-operator db85bf1 link true /test e2e-agnostic-operator
ci/prow/e2e-hypershift-conformance db85bf1 link true /test e2e-hypershift-conformance
ci/prow/lint db85bf1 link true /test lint
ci/prow/e2e-agnostic-ovn-upgrade-into-change db85bf1 link true /test e2e-agnostic-ovn-upgrade-into-change

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants