Skip to content

Conversation

@wking
Copy link
Member

@wking wking commented Aug 7, 2020

@wking wking force-pushed the targeted-edge-blocking branch from ea8b2ad to 21d882c Compare August 7, 2020 19:08
@wking wking changed the title enhancements/update/targeted-update-edge-blocking: Propose a new enhancement WIP: enhancements/update/targeted-update-edge-blocking: Propose a new enhancement Aug 7, 2020
@openshift-ci-robot openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 7, 2020
@wking wking force-pushed the targeted-edge-blocking branch 2 times, most recently from 228fa57 to d2f0889 Compare August 11, 2020 16:26
@wking
Copy link
Member Author

wking commented Aug 11, 2020

Pulled the WIP, now that I've pushed 228fa57 -> d2f0889, filling in the drawbacks and alternatives sections.

@wking wking changed the title WIP: enhancements/update/targeted-update-edge-blocking: Propose a new enhancement enhancements/update/targeted-update-edge-blocking: Propose a new enhancement Aug 11, 2020
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 11, 2020
@kbsingh
Copy link

kbsingh commented Aug 12, 2020

adding @bwplotka for a once over

@kbsingh
Copy link

kbsingh commented Aug 12, 2020

I have concerns around how the impacted _id's are stored, and how they reconcile with AMS - tollbooth needs to remain the source of truth here, but if its a 2nd layer reconcile operation, we are potentially drifting.

cc @abhgupta

@wking
Copy link
Member Author

wking commented Aug 12, 2020

I have concerns around how the impacted _id's are stored, and how they reconcile with AMS - tollbooth needs to remain the source of truth here...

Is this a worry about cached PromQL results going stale, or something more subtle?

@wking wking force-pushed the targeted-edge-blocking branch 6 times, most recently from b4b81a8 to f21bfb3 Compare August 12, 2020 18:47
@dofinn
Copy link

dofinn commented Aug 12, 2020

+1 to enable this ability to manage clusters with this kind of granularity.

One Thought:
Would Cincinnati-graph-data require a monitor job to review generated graphs every time an edge is blocked. Something to logically check that we never ever leave a cluster with no available upgrade path ? I would like to be assured that if this ever occurs that RedHat/Customer can be notified.

@wking
Copy link
Member Author

wking commented Aug 13, 2020

Something to logically check that we never ever leave a cluster with no available upgrade path ?

There's a section in the enhancement discussing this. A presubmit CI job would be hard, because there are many releases with no update paths on any given day and you'd need a one-shot graph-builder and policy-engine. But it's possible. And a poller that hits the production service with no id set and alerts on old-enough dead ends in fast channels seems pretty straightforward.

Copy link
Member

@sdodson sdodson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with this in general I'm not a fan of the OSD managed=true example because I believe if they choose to pursue that they should maintain their own policy engine and upstream. But if that gets our foot in the door that may be worth accepting as long as we're only setting that for a small portion of edges for specific identified reasons.


## Design Details

### Test Plan
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either here or in Risks and Mitigations, should we add something about profiling of the proposed promql? ie: at the time that we add a rule it should complete in < 100ms or that rule will not merge? (100ms just an example)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either here or in Risks and Mitigations, should we add something about profiling of the proposed promql?

Would we have a presubmit job with creds for our internal Telemetry service? I dunno if that even has a public endpoint... Or did you see this as "if we get a new promql string, we will manually profile it using our personal Telemetry creds before approving the PR", or something?

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 17, 2020

Clusters which opt out of Telemetry are likely still able to connect to the update service.
When they do, they will be [blocked as cache-misses](#enhanced-graph-data-schema-for-blocking-edges).
I think that conservative approach is acceptable, because we cannot tell if the non-reporting cluster is vulnerable to the known issues or not.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this is acceptable.

Copy link
Member Author

@wking wking Aug 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who needs to sign off on this vs. the possibility of breaking affected clusters because they weren't reporting Telemetry and the update service said "maybe ok?" and recommended the edge?

#### Exposing blocking reasons

This enhancement provides no mechanism for telling clients if or why a recommended update edge has been blocked, because [the Cincinnati graph format][cincinnati-spec] provides no mechanism for sharing metadata about recommended edges, or even the existence of not-recommended edges.
Clients might be tempted to want that information to second-guess a graph-data decision to block the edge, but I am [not convinced the information is actionable](overriding-blocked-edges/README.md).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @vrutkovs , the enhancement gives that idea for sure. Hopefully we wont in that situation.

@openshift-bot
Copy link

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci-robot openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 18, 2020
@openshift-bot
Copy link

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci-robot openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 18, 2020
@wking
Copy link
Member Author

wking commented Dec 19, 2020

/lifecycle frozen

We want this; review is just slow

@openshift-ci-robot openshift-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Dec 19, 2020
@clcollins
Copy link
Member

@wking asked me to comment here:

New to performing a cluster upgrade, and I was confused with the existence of 4.7 cluster versions/channels but no valid paths from 4.6.x clusters.

I attempted to upgrade via:

ocm create upgradepolicy --cluster=<CLUSTERID>
? Select policy type manual
There are no available upgrades

My assumption was that I'd done something wrong or needed to manually change channels:

echo '{"version": {"channel_group": "candidate-4.7"}}' | ocm patch /api/clusters_mgmt/v1/clusters/<CLUSTERID>
{
  "kind": "Error",
  "id": "400",
  "href": "/api/clusters_mgmt/v1/errors/400",
  "code": "CLUSTERS-MGMT-400",
  "reason": "The cluster's current version '4.6.21' is not available for the desired channel group 'candidate-4.7'",
  "operation_id": "<CLUSTERUUID>"
}

It wasn't clear to me that there wasn't a path from 4.6 to 4.7 in this case.

Batting around some ideas with @wking, I think it would be more clear if the upgrade version was listed, but marked as unavailable, though I'm not sure how to accomplish that (nor am I suggesting that's the only way).

So, just as a datapoint, the "no news is good news" philosophy is great, but only as long as your confident in your process. As someone doing this for the first time, I was sure I'd done something wrong, or was missing a step, and had to ask for assistance.

Hope this newbie perspective helps!

wking added a commit to wking/openshift-enhancements that referenced this pull request Apr 16, 2021
…rm parameter

I still prefer the flexibility of a PromQL approach [1].  But that is
a bigger update-service shift, and I haven't been able to convince
folks it's worth doing.  This alternative proposal guesses that
"platform" will help with targeting most of our target-able blockers,
and be quicker to implement.

[1]: openshift#426
@wking
Copy link
Member Author

wking commented Apr 17, 2021

I've floated #737 with a scoped-down version of this proposal that only supports targeting by platform using a client-provided query parameter (so no service-side PromQL). Hopefully that's small enough to unstick discussion while still being a useful limit on blocked-edge impact.

wking added a commit to wking/openshift-enhancements that referenced this pull request Apr 17, 2021
…rm parameter

I still prefer the flexibility of a PromQL approach [1].  But that is
a bigger update-service shift, and I haven't been able to convince
folks it's worth doing.  This alternative proposal guesses that
"platform" will help with targeting most of our target-able blockers,
and be quicker to implement.

[1]: openshift#426
@@ -0,0 +1,25 @@
digraph updateDecisions {
new [ label="New edge available" ];
test [ label="Do local tests pass?" ];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expect majority of customers to run tests before doing upgrade. However I am confident that not all customers or use cases needs to runs tests before doing upgrade. So having another node where customers directly going for upgrades (by setting up maintenance windows) will be a better representation of what our customers are doing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've pushed:

$ git --no-pager diff 0d8c588..f606f41 -- enhancements/update/overriding-blocked-edges/
diff --git a/enhancements/update/overriding-blocked-edges/README.md b/enhancements/update/overriding-blocked-edges/README.md
index b011337..775ef0b 100644
--- a/enhancements/update/overriding-blocked-edges/README.md
+++ b/enhancements/update/overriding-blocked-edges/README.md
@@ -38,11 +38,12 @@ Between those two, and similar efforts, I don't think that broadcasting edge-pul
 Because do we support them if they hit some new bug?
 Our docs around this are [currently wiggly][off-graph-support], but I expect there will be a lot of pain if we offer blanket support for folks taking any blocked edge we have ever served in a fast/stable channel.
 
-Also note that the flow is always going through "Do local tests pass?".
+Also note that the flow is always going through "Are there failing local tests?".
 We want users testing in candidate and fast to help turn up issues with their particular cluster flavor that we might not be catching in CI.
 We also want them reporting Telemetry/Insights when possible too, so we can pick out lower-signal feedback that they might miss.
 Or at least filing bugs if they find an issue and aren't reporting it via Telemetry/Insights.
 Phased roll-outs ([when we get them][phased-rollouts]) allow us to minimize the impact of "nobody explicitly tested this flavor/workflow (enough times, for flaky failures)", but is ideally not the first line of defense.
+But having no local tests and relying on the rest of the fleet reporting Telemetry/Insights is still supported, so users could work through the flow with "We have no failing local tests, because we don't run local tests" if they want.
 
 If I was administering a production cluster, I'd personally be conservative and avoid both dashed and orange "yes" paths.
 
diff --git a/enhancements/update/overriding-blocked-edges/flow.dot b/enhancements/update/overriding-blocked-edges/flow.dot
index bd27504..42c8926 100644
--- a/enhancements/update/overriding-blocked-edges/flow.dot
+++ b/enhancements/update/overriding-blocked-edges/flow.dot
@@ -1,6 +1,6 @@
 digraph updateDecisions {
        new [ label="New edge available" ];
-       test [ label="Do local tests pass?" ];
+       test [ label="Are there failing local tests?" ];
        pulled [ label="Does Red Hat pull the edge?" ];
        localConfidence [ label="Are we confident in\nour local test coverage?" ];
        redHatConfidence [ label="Are we confident Red Hat's\nreasoning is exhaustive?" ];
@@ -10,9 +10,9 @@ digraph updateDecisions {
        wait [ label="Wait for a new edge" ];
 
        new -> test;
-       test -> wait [ label="no" ];
+       test -> wait [ label="yes" ];
        wait -> new;
-       test -> pulled [ label="yes" ];
+       test -> pulled [ label="no" ];
        pulled -> update [ label="no" ];
        pulled -> localConfidence [ label="yes" ];
        localConfidence -> update [ label="yes"; color="orange"; style="dashed" ];
diff --git a/enhancements/update/overriding-blocked-edges/flow.svg b/enhancements/update/overriding-blocked-edges/flow.svg
...

to try to address this use-case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But having no local tests and relying on the rest of the fleet reporting Telemetry/Insights is still supported, so users could work through the flow with "We have no failing local tests, because we don't run local tests" if they want.

I can live with the current flow, but I would have made it explicit.

pulled [ label="Does Red Hat pull the edge?" ];
localConfidence [ label="Are we confident in\nour local test coverage?" ];
redHatConfidence [ label="Are we confident Red Hat's\nreasoning is exhaustive?" ];
bug [ label="What bug(s) did Red Hat link?"; color="orange"; style="dashed" ];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be we should mention about the communication about the bug will happen through the text only errata.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm agnostic about if/how this information gets pushed out, because I don't think customers should use it for anything. My understanding of the text-only errata was that customers would use it to say "ahh, good to know that RH is not pulling the edge for no reason" and that the intention was not trying to get customers to decide if they were impacted by the edge or not. Do you see a use-case for users deciding to ignore our lack-of-recommendation once we have targeted edge blocking?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you see a use-case for users deciding to ignore our lack-of-recommendation once we have targeted edge blocking?

Nope

localConfidence [ label="Are we confident in\nour local test coverage?" ];
redHatConfidence [ label="Are we confident Red Hat's\nreasoning is exhaustive?" ];
bug [ label="What bug(s) did Red Hat link?"; color="orange"; style="dashed" ];
apply [ label="Do the bug(s) apply to my clusters?" ];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I do not think this is correct. When RH removes an edge the only way for customers to update is to force the update and customers are not supposed to do that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to Force. You can oc adm upgrade --allow-explicit-upgrade --to-image $PULLSPEC, regardless of whether your configured upstream update service recommends an update or not. And with the support pivot from openshift/openshift-docs#32091, we will still support folks who update over once-supported-but-no-longer-recommended edges. Although personally, I can't think of a situation where I would recommend that. But still, the node is in the flow graph, because some users have historically been tempted to do this, so I'm including it so I can explain why I think it's not a useful idea.

And [alerting on available updates][alert-on-available-updates] will inform folks who have had an edge restored that they have been ignoring the (new) update opportunity.
Between those two, and similar efforts, I don't think that broadcasting edge-pulled motivations is a useful activity, and I think encouraging users to actively consume our blocked-edge reasoning in order to green-light their own off-graph updates is actively harmful.
Because do we support them if they hit some new bug?
Our docs around this are [currently wiggly][off-graph-support], but I expect there will be a lot of pain if we offer blanket support for folks taking any blocked edge we have ever served in a fast/stable channel.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should make it clear in the docs. I do not think RH's support is waived off if they have Cincinnati running locally which does not have the edge blocked. But using --allow-explicit-upgrade or --force will definitely make CEE uncomfortable. But even if they end up using --force and then call CEE for some issue , we will help them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

openshift/openshift-docs#32091 is still in flight, but it replaces the wiggle section with a clearer support statement.

@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sdodson, vrutkovs

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment


Incoming bugs are evaluated to determine an impact statement based on [a generic template][rhbz-1858026-impact-statement-request].
Some bugs only impact specific platforms, or clusters with other specific features.
For example, rhbz#1858026 [only impacted][rhbz-1858026-impact-statement] clusters with the `None` platform which were created as 4.1 clusters and subsequently updated via 4.2, 4.3, and 4.4 to reach 4.5.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have more example (Which I think you have) please add here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need an exhaustive list? I thought a single example would be sufficient to show that this sort of thing happens. What would additional examples add beyond that?

Some bugs only impact specific platforms, or clusters with other specific features.
For example, rhbz#1858026 [only impacted][rhbz-1858026-impact-statement] clusters with the `None` platform which were created as 4.1 clusters and subsequently updated via 4.2, 4.3, and 4.4 to reach 4.5.
In those cases there is currently tension between wanting to protect vulnerable clusters by blocking the edge vs. wanting to avoid inconveniencing clusters which we know are not vulnerable and whose administrators may have been planning on taking the recommended update.
This enhancement aims to reduce that tension.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also add that , we will publish text only errata when we remove edges from stable and fast channel to the latest Y stream release. Customers will get the bug reference from the errata and they will find out if the bug is specific for a platform. We can expect customers asking us not block them when they are not on the platform if a bug is specific to another platform. Also it is logical to reduce the impact radius of the edge removable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we roll discussion of the text-only errata into the earlier thread? I think it's orthogonal to this PR, but don't want to argue that out in two parallel threads ;)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically I am trying to pitch that by releasing text only erratas we are making the case for targeted edge blocking. Its the next logical step to do.

This query will be submitted to a configurable Prometheus service and should return a set of matching records with `_id` labels.
Clusters whose [submitted `id` query parameter][cincinnati-for-openshift-request] is in the set of returned IDs are allowed if the value of the matching record is 1 and blocked if the value of the matching record is 0.
Clusters whose submitted `id` is not in the result set, or which provide no `id` parameter, are also blocked.
Blocking too many edges is better than blocking too few, because you can recover from the former, but possibly brick clusters on the latter.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think this is a good idea. IMO we should do the opposite i.e. we should not block the cluster if the query does not provide no id parameter. Because I am afraid that we will block clusters and we wont be knowing which clusters are blocked. When we block a cluster we should which ones are blocked so that we can unblock them in future. CEE can reach out to them if required.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is very unlikely that the CVO will ever submit a request without an id parameter. We can monitor this in Telemetry/Insights by looking for clusters which do not have expected recommended updates. And failing closed sets up "I'm surprised I'm blocked -> open support ticket -> we fix Cincinnati to restore the recommendation", which is a lot safer than failing open's "I'm not blocked, but I should have been -> update -> boom".

wking added 3 commits May 11, 2021 11:50
I keep repeating this in various internal discussions.  Collect it all
up with a bow so I can just drop links ;).
Generated with:

  $ (cd enhancements/update/overriding-blocked-edges; dot -Tsvg flow.dot >flow.svg)

using:

  $ dot -V
  dot - graphviz version 2.46.0 (0)
@wking wking force-pushed the targeted-edge-blocking branch from 0d8c588 to f606f41 Compare May 11, 2021 18:51
@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 11, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sdodson, vrutkovs

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 11, 2021

@wking: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/markdownlint f606f41 link /test markdownlint

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

When a request is received, the submitted [`channel` query parameter][cincinnati-for-openshift-request] limits the set of remaining edges.
If any of the remaining edges have `clusters.promql` entries, a new, targeted-edge-blocking policy engine plugin will exclude the edge if the [`id` query parameter][cincinnati-for-openshift-request] is a member of the query's resulting ID set.
If the request does not set the `id` parameter, the plugin should block all conditional edges, and does not need to check a cache or make PromQL queries.
To perform the PromQL request, the update service will be extended with configuration for the new policy enging plugin.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/enging/engine/

@wking
Copy link
Member Author

wking commented Jul 2, 2021

Replaced by the v3 #821, where I'm now recommending cluster-side decisions, with the update service declaring conditional edges and telling the cluster-side tools how to decide if the cluster is vulnerable to the known bugs impacting its update decisions.

@wking wking closed this Jul 2, 2021
@wking wking deleted the targeted-edge-blocking branch July 2, 2021 04:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants