-
Notifications
You must be signed in to change notification settings - Fork 533
enhancements/update/targeted-update-edge-blocking: Propose a new enhancement #426
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
ea8b2ad to
21d882c
Compare
228fa57 to
d2f0889
Compare
|
adding @bwplotka for a once over |
|
I have concerns around how the impacted _id's are stored, and how they reconcile with AMS - tollbooth needs to remain the source of truth here, but if its a 2nd layer reconcile operation, we are potentially drifting. cc @abhgupta |
Is this a worry about cached PromQL results going stale, or something more subtle? |
b4b81a8 to
f21bfb3
Compare
|
+1 to enable this ability to manage clusters with this kind of granularity. One Thought: |
There's a section in the enhancement discussing this. A presubmit CI job would be hard, because there are many releases with no update paths on any given day and you'd need a one-shot graph-builder and policy-engine. But it's possible. And a poller that hits the production service with no |
sdodson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with this in general I'm not a fan of the OSD managed=true example because I believe if they choose to pursue that they should maintain their own policy engine and upstream. But if that gets our foot in the door that may be worth accepting as long as we're only setting that for a small portion of edges for specific identified reasons.
|
|
||
| ## Design Details | ||
|
|
||
| ### Test Plan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either here or in Risks and Mitigations, should we add something about profiling of the proposed promql? ie: at the time that we add a rule it should complete in < 100ms or that rule will not merge? (100ms just an example)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either here or in Risks and Mitigations, should we add something about profiling of the proposed promql?
Would we have a presubmit job with creds for our internal Telemetry service? I dunno if that even has a public endpoint... Or did you see this as "if we get a new promql string, we will manually profile it using our personal Telemetry creds before approving the PR", or something?
|
|
||
| Clusters which opt out of Telemetry are likely still able to connect to the update service. | ||
| When they do, they will be [blocked as cache-misses](#enhanced-graph-data-schema-for-blocking-edges). | ||
| I think that conservative approach is acceptable, because we cannot tell if the non-reporting cluster is vulnerable to the known issues or not. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this is acceptable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Who needs to sign off on this vs. the possibility of breaking affected clusters because they weren't reporting Telemetry and the update service said "maybe ok?" and recommended the edge?
| #### Exposing blocking reasons | ||
|
|
||
| This enhancement provides no mechanism for telling clients if or why a recommended update edge has been blocked, because [the Cincinnati graph format][cincinnati-spec] provides no mechanism for sharing metadata about recommended edges, or even the existence of not-recommended edges. | ||
| Clients might be tempted to want that information to second-guess a graph-data decision to block the edge, but I am [not convinced the information is actionable](overriding-blocked-edges/README.md). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @vrutkovs , the enhancement gives that idea for sure. Hopefully we wont in that situation.
600f8b4 to
0d8c588
Compare
|
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
|
Stale issues rot after 30d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle rotten |
|
/lifecycle frozen We want this; review is just slow |
|
@wking asked me to comment here: New to performing a cluster upgrade, and I was confused with the existence of 4.7 cluster versions/channels but no valid paths from 4.6.x clusters. I attempted to upgrade via: My assumption was that I'd done something wrong or needed to manually change channels: It wasn't clear to me that there wasn't a path from 4.6 to 4.7 in this case. Batting around some ideas with @wking, I think it would be more clear if the upgrade version was listed, but marked as unavailable, though I'm not sure how to accomplish that (nor am I suggesting that's the only way). So, just as a datapoint, the "no news is good news" philosophy is great, but only as long as your confident in your process. As someone doing this for the first time, I was sure I'd done something wrong, or was missing a step, and had to ask for assistance. Hope this newbie perspective helps! |
…rm parameter I still prefer the flexibility of a PromQL approach [1]. But that is a bigger update-service shift, and I haven't been able to convince folks it's worth doing. This alternative proposal guesses that "platform" will help with targeting most of our target-able blockers, and be quicker to implement. [1]: openshift#426
|
I've floated #737 with a scoped-down version of this proposal that only supports targeting by platform using a client-provided query parameter (so no service-side PromQL). Hopefully that's small enough to unstick discussion while still being a useful limit on blocked-edge impact. |
…rm parameter I still prefer the flexibility of a PromQL approach [1]. But that is a bigger update-service shift, and I haven't been able to convince folks it's worth doing. This alternative proposal guesses that "platform" will help with targeting most of our target-able blockers, and be quicker to implement. [1]: openshift#426
| @@ -0,0 +1,25 @@ | |||
| digraph updateDecisions { | |||
| new [ label="New edge available" ]; | |||
| test [ label="Do local tests pass?" ]; | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I expect majority of customers to run tests before doing upgrade. However I am confident that not all customers or use cases needs to runs tests before doing upgrade. So having another node where customers directly going for upgrades (by setting up maintenance windows) will be a better representation of what our customers are doing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've pushed:
$ git --no-pager diff 0d8c588..f606f41 -- enhancements/update/overriding-blocked-edges/
diff --git a/enhancements/update/overriding-blocked-edges/README.md b/enhancements/update/overriding-blocked-edges/README.md
index b011337..775ef0b 100644
--- a/enhancements/update/overriding-blocked-edges/README.md
+++ b/enhancements/update/overriding-blocked-edges/README.md
@@ -38,11 +38,12 @@ Between those two, and similar efforts, I don't think that broadcasting edge-pul
Because do we support them if they hit some new bug?
Our docs around this are [currently wiggly][off-graph-support], but I expect there will be a lot of pain if we offer blanket support for folks taking any blocked edge we have ever served in a fast/stable channel.
-Also note that the flow is always going through "Do local tests pass?".
+Also note that the flow is always going through "Are there failing local tests?".
We want users testing in candidate and fast to help turn up issues with their particular cluster flavor that we might not be catching in CI.
We also want them reporting Telemetry/Insights when possible too, so we can pick out lower-signal feedback that they might miss.
Or at least filing bugs if they find an issue and aren't reporting it via Telemetry/Insights.
Phased roll-outs ([when we get them][phased-rollouts]) allow us to minimize the impact of "nobody explicitly tested this flavor/workflow (enough times, for flaky failures)", but is ideally not the first line of defense.
+But having no local tests and relying on the rest of the fleet reporting Telemetry/Insights is still supported, so users could work through the flow with "We have no failing local tests, because we don't run local tests" if they want.
If I was administering a production cluster, I'd personally be conservative and avoid both dashed and orange "yes" paths.
diff --git a/enhancements/update/overriding-blocked-edges/flow.dot b/enhancements/update/overriding-blocked-edges/flow.dot
index bd27504..42c8926 100644
--- a/enhancements/update/overriding-blocked-edges/flow.dot
+++ b/enhancements/update/overriding-blocked-edges/flow.dot
@@ -1,6 +1,6 @@
digraph updateDecisions {
new [ label="New edge available" ];
- test [ label="Do local tests pass?" ];
+ test [ label="Are there failing local tests?" ];
pulled [ label="Does Red Hat pull the edge?" ];
localConfidence [ label="Are we confident in\nour local test coverage?" ];
redHatConfidence [ label="Are we confident Red Hat's\nreasoning is exhaustive?" ];
@@ -10,9 +10,9 @@ digraph updateDecisions {
wait [ label="Wait for a new edge" ];
new -> test;
- test -> wait [ label="no" ];
+ test -> wait [ label="yes" ];
wait -> new;
- test -> pulled [ label="yes" ];
+ test -> pulled [ label="no" ];
pulled -> update [ label="no" ];
pulled -> localConfidence [ label="yes" ];
localConfidence -> update [ label="yes"; color="orange"; style="dashed" ];
diff --git a/enhancements/update/overriding-blocked-edges/flow.svg b/enhancements/update/overriding-blocked-edges/flow.svg
...to try to address this use-case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But having no local tests and relying on the rest of the fleet reporting Telemetry/Insights is still supported, so users could work through the flow with "We have no failing local tests, because we don't run local tests" if they want.
I can live with the current flow, but I would have made it explicit.
| pulled [ label="Does Red Hat pull the edge?" ]; | ||
| localConfidence [ label="Are we confident in\nour local test coverage?" ]; | ||
| redHatConfidence [ label="Are we confident Red Hat's\nreasoning is exhaustive?" ]; | ||
| bug [ label="What bug(s) did Red Hat link?"; color="orange"; style="dashed" ]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be we should mention about the communication about the bug will happen through the text only errata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm agnostic about if/how this information gets pushed out, because I don't think customers should use it for anything. My understanding of the text-only errata was that customers would use it to say "ahh, good to know that RH is not pulling the edge for no reason" and that the intention was not trying to get customers to decide if they were impacted by the edge or not. Do you see a use-case for users deciding to ignore our lack-of-recommendation once we have targeted edge blocking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you see a use-case for users deciding to ignore our lack-of-recommendation once we have targeted edge blocking?
Nope
| localConfidence [ label="Are we confident in\nour local test coverage?" ]; | ||
| redHatConfidence [ label="Are we confident Red Hat's\nreasoning is exhaustive?" ]; | ||
| bug [ label="What bug(s) did Red Hat link?"; color="orange"; style="dashed" ]; | ||
| apply [ label="Do the bug(s) apply to my clusters?" ]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I do not think this is correct. When RH removes an edge the only way for customers to update is to force the update and customers are not supposed to do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need to Force. You can oc adm upgrade --allow-explicit-upgrade --to-image $PULLSPEC, regardless of whether your configured upstream update service recommends an update or not. And with the support pivot from openshift/openshift-docs#32091, we will still support folks who update over once-supported-but-no-longer-recommended edges. Although personally, I can't think of a situation where I would recommend that. But still, the node is in the flow graph, because some users have historically been tempted to do this, so I'm including it so I can explain why I think it's not a useful idea.
| And [alerting on available updates][alert-on-available-updates] will inform folks who have had an edge restored that they have been ignoring the (new) update opportunity. | ||
| Between those two, and similar efforts, I don't think that broadcasting edge-pulled motivations is a useful activity, and I think encouraging users to actively consume our blocked-edge reasoning in order to green-light their own off-graph updates is actively harmful. | ||
| Because do we support them if they hit some new bug? | ||
| Our docs around this are [currently wiggly][off-graph-support], but I expect there will be a lot of pain if we offer blanket support for folks taking any blocked edge we have ever served in a fast/stable channel. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should make it clear in the docs. I do not think RH's support is waived off if they have Cincinnati running locally which does not have the edge blocked. But using --allow-explicit-upgrade or --force will definitely make CEE uncomfortable. But even if they end up using --force and then call CEE for some issue , we will help them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
openshift/openshift-docs#32091 is still in flight, but it replaces the wiggle section with a clearer support statement.
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: sdodson, vrutkovs The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
|
||
| Incoming bugs are evaluated to determine an impact statement based on [a generic template][rhbz-1858026-impact-statement-request]. | ||
| Some bugs only impact specific platforms, or clusters with other specific features. | ||
| For example, rhbz#1858026 [only impacted][rhbz-1858026-impact-statement] clusters with the `None` platform which were created as 4.1 clusters and subsequently updated via 4.2, 4.3, and 4.4 to reach 4.5. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you have more example (Which I think you have) please add here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need an exhaustive list? I thought a single example would be sufficient to show that this sort of thing happens. What would additional examples add beyond that?
| Some bugs only impact specific platforms, or clusters with other specific features. | ||
| For example, rhbz#1858026 [only impacted][rhbz-1858026-impact-statement] clusters with the `None` platform which were created as 4.1 clusters and subsequently updated via 4.2, 4.3, and 4.4 to reach 4.5. | ||
| In those cases there is currently tension between wanting to protect vulnerable clusters by blocking the edge vs. wanting to avoid inconveniencing clusters which we know are not vulnerable and whose administrators may have been planning on taking the recommended update. | ||
| This enhancement aims to reduce that tension. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also add that , we will publish text only errata when we remove edges from stable and fast channel to the latest Y stream release. Customers will get the bug reference from the errata and they will find out if the bug is specific for a platform. We can expect customers asking us not block them when they are not on the platform if a bug is specific to another platform. Also it is logical to reduce the impact radius of the edge removable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we roll discussion of the text-only errata into the earlier thread? I think it's orthogonal to this PR, but don't want to argue that out in two parallel threads ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically I am trying to pitch that by releasing text only erratas we are making the case for targeted edge blocking. Its the next logical step to do.
| This query will be submitted to a configurable Prometheus service and should return a set of matching records with `_id` labels. | ||
| Clusters whose [submitted `id` query parameter][cincinnati-for-openshift-request] is in the set of returned IDs are allowed if the value of the matching record is 1 and blocked if the value of the matching record is 0. | ||
| Clusters whose submitted `id` is not in the result set, or which provide no `id` parameter, are also blocked. | ||
| Blocking too many edges is better than blocking too few, because you can recover from the former, but possibly brick clusters on the latter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think this is a good idea. IMO we should do the opposite i.e. we should not block the cluster if the query does not provide no id parameter. Because I am afraid that we will block clusters and we wont be knowing which clusters are blocked. When we block a cluster we should which ones are blocked so that we can unblock them in future. CEE can reach out to them if required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is very unlikely that the CVO will ever submit a request without an id parameter. We can monitor this in Telemetry/Insights by looking for clusters which do not have expected recommended updates. And failing closed sets up "I'm surprised I'm blocked -> open support ticket -> we fix Cincinnati to restore the recommendation", which is a lot safer than failing open's "I'm not blocked, but I should have been -> update -> boom".
I keep repeating this in various internal discussions. Collect it all up with a bow so I can just drop links ;).
Generated with: $ (cd enhancements/update/overriding-blocked-edges; dot -Tsvg flow.dot >flow.svg) using: $ dot -V dot - graphviz version 2.46.0 (0)
0d8c588 to
f606f41
Compare
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: sdodson, vrutkovs The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@wking: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
| When a request is received, the submitted [`channel` query parameter][cincinnati-for-openshift-request] limits the set of remaining edges. | ||
| If any of the remaining edges have `clusters.promql` entries, a new, targeted-edge-blocking policy engine plugin will exclude the edge if the [`id` query parameter][cincinnati-for-openshift-request] is a member of the query's resulting ID set. | ||
| If the request does not set the `id` parameter, the plugin should block all conditional edges, and does not need to check a cache or make PromQL queries. | ||
| To perform the PromQL request, the update service will be extended with configuration for the new policy enging plugin. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/enging/engine/
|
Replaced by the v3 #821, where I'm now recommending cluster-side decisions, with the update service declaring conditional edges and telling the cluster-side tools how to decide if the cluster is vulnerable to the known bugs impacting its update decisions. |
CC @dofinn, @LalatenduMohanty, @sdodson, @steveej, @vrutkovs