Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
61 changes: 61 additions & 0 deletions enhancements/update/overriding-blocked-edges/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
# Overriding Blocked Edges

We work hard to ensure update and release stability before even cutting candidate release images, but sometimes bugs slip through, as you can see [by our blocked edges][blocked-edges].
This can make users uncomfortable.
For example, a user might:

1. See a new A -> B edge appear in their channel.
2. Begin sending stage clusters over the edge to test its stability for their cluster flavor.
3. Become confident in the update's stability for their flavor.
4. Schedule a maintenance window to update their production cluster(s).
5. Have the edge pulled from their channel, causing distress when the maintenance window arrives and the A -> B edge is no longer recommended.

There have been many requests about rendering update graphs or pushing notifications about edge-pulls to mitigate these concerns, but I'm still not clear on how useful those would be.
I understand the user's decision tree for production clusters to flow something like:

<div style="text-align:center">
<img src="flow.svg" width="100%" />
</div>

I've marked the risky flows dashed and orange and the conservative flows thick and blue.

For me the sticking point is: who would ever say "yes" to "Are we confident Red Hat's reasoning is exhaustive?"?
If you feel like there are bugs you missed in your local testing but that there are no bugs which Red Hat has missed, why are you even doing local testing?
I've marked the "What bug(s) did Red Hat link?" node dashed and orange to show that I consider it a risky pathway that we do not need to enable with a convenient, structured response.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming customers look in to cincinnati-graph-data repository to see what BZ caused the edge removal is not right. Currently we have not gone down the path for customers i.e. not communicated about cincinnati-graph-data to customers explicitly AFAIK

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming customers look in to cincinnati-graph-data repository to see what BZ caused the edge removal is not right. Currently we have not gone down the path for customers i.e. not communicated about cincinnati-graph-data to customers explicitly AFAIK

I am not recommending customers look at cincinnati-graph-data. I am pointing out that knowing which bugs RH considered when blocking an edge doesn't seem like particularly useful information. If you buy that argument, do we care about if/how users can discover the list of bugs? I am happy they are public, but also happy they are in free-form comments and not exposed via a structured API. I would not be too put out if they were private. I am against exposing them in a public, structured API because that gives the impression that folks should care about them, and I don't see a workflow where that would make sense.


We try to update the edge-blocking comments as we discover new reasons, but:

1. There are probably additional edge-blocking reasons that we are not aware of at the moment.
2. We don't look too hard at edges after we pull them from the graph.

For pulling edges, we just need to find *one* sufficiently severe bug on the edge.
For initiating an update, users need confidence that there are no likely, severe bugs affecting their clusters.
These are two different things.

Getting [targeted edge blocking](../targeted-update-edge-blocking.md) in place will allow us to decrease the amount of collateral damage where we currently have to pull an edge for all clusters to protect a known, vulnerable subset of clusters.
And [alerting on available updates][alert-on-available-updates] will inform folks who have had an edge restored that they have been ignoring the (new) update opportunity.
Between those two, and similar efforts, I don't think that broadcasting edge-pulled motivations is a useful activity, and I think encouraging users to actively consume our blocked-edge reasoning in order to green-light their own off-graph updates is actively harmful.
Because do we support them if they hit some new bug?
Our docs around this are [currently wiggly][off-graph-support], but I expect there will be a lot of pain if we offer blanket support for folks taking any blocked edge we have ever served in a fast/stable channel.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should make it clear in the docs. I do not think RH's support is waived off if they have Cincinnati running locally which does not have the edge blocked. But using --allow-explicit-upgrade or --force will definitely make CEE uncomfortable. But even if they end up using --force and then call CEE for some issue , we will help them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

openshift/openshift-docs#32091 is still in flight, but it replaces the wiggle section with a clearer support statement.


Also note that the flow is always going through "Are there failing local tests?".
We want users testing in candidate and fast to help turn up issues with their particular cluster flavor that we might not be catching in CI.
We also want them reporting Telemetry/Insights when possible too, so we can pick out lower-signal feedback that they might miss.
Or at least filing bugs if they find an issue and aren't reporting it via Telemetry/Insights.
Phased roll-outs ([when we get them][phased-rollouts]) allow us to minimize the impact of "nobody explicitly tested this flavor/workflow (enough times, for flaky failures)", but is ideally not the first line of defense.
But having no local tests and relying on the rest of the fleet reporting Telemetry/Insights is still supported, so users could work through the flow with "We have no failing local tests, because we don't run local tests" if they want.

If I was administering a production cluster, I'd personally be conservative and avoid both dashed and orange "yes" paths.

If a user came to me claiming "yes" to "Are we confident in our local test coverage?", I'd mention the existence of `oc adm upgrade --allow-explicit-upgrade --to-image ...`, and then immediately start trying to talk them into "no" (for both the local confidence and the "Are we confident Red Hat's reasoning is exhaustive?" nodes) to get them over to the "Wait for a new edge" endpoint.
But if they stick to "yes" and waived support, it's their cluster and their decision to make.
Testing clusters and others that are easily replaceable can attempt risky updates without going through this decision tree, because if they have issues during the update or on the target release, the administrators can just tear down the impacted cluster and provision a replacement on the original release.
This sort of testing with expendable clusters is how administrators would address the "local test" portions of the production cluster decision tree.

And hopefully updates soon become boring, reliable details, and folks just set [an auto-update switch][auto-update] and forget about scheduling updates entirely.

[alert-on-available-updates]: https://github.com/openshift/cluster-version-operator/pull/415
[auto-update]: https://github.com/openshift/enhancements/pull/124
[blocked-edges]: https://github.com/openshift/cincinnati-graph-data/tree/master/blocked-edges
[off-graph-support]: https://github.com/openshift/openshift-docs/blame/0a4d88729eccc2323ff319346e7824ca2f964b9e/modules/understanding-upgrade-channels.adoc#L101-L102
[phased-rollouts]: https://github.com/openshift/enhancements/pull/427
25 changes: 25 additions & 0 deletions enhancements/update/overriding-blocked-edges/flow.dot
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
digraph updateDecisions {
new [ label="New edge available" ];
test [ label="Are there failing local tests?" ];
pulled [ label="Does Red Hat pull the edge?" ];
localConfidence [ label="Are we confident in\nour local test coverage?" ];
redHatConfidence [ label="Are we confident Red Hat's\nreasoning is exhaustive?" ];
bug [ label="What bug(s) did Red Hat link?"; color="orange"; style="dashed" ];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be we should mention about the communication about the bug will happen through the text only errata.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm agnostic about if/how this information gets pushed out, because I don't think customers should use it for anything. My understanding of the text-only errata was that customers would use it to say "ahh, good to know that RH is not pulling the edge for no reason" and that the intention was not trying to get customers to decide if they were impacted by the edge or not. Do you see a use-case for users deciding to ignore our lack-of-recommendation once we have targeted edge blocking?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you see a use-case for users deciding to ignore our lack-of-recommendation once we have targeted edge blocking?

Nope

apply [ label="Do the bug(s) apply to my clusters?" ];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I do not think this is correct. When RH removes an edge the only way for customers to update is to force the update and customers are not supposed to do that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to Force. You can oc adm upgrade --allow-explicit-upgrade --to-image $PULLSPEC, regardless of whether your configured upstream update service recommends an update or not. And with the support pivot from openshift/openshift-docs#32091, we will still support folks who update over once-supported-but-no-longer-recommended edges. Although personally, I can't think of a situation where I would recommend that. But still, the node is in the flow graph, because some users have historically been tempted to do this, so I'm including it so I can explain why I think it's not a useful idea.

update [ label="Production update" ];
wait [ label="Wait for a new edge" ];

new -> test;
test -> wait [ label="yes" ];
wait -> new;
test -> pulled [ label="no" ];
pulled -> update [ label="no" ];
pulled -> localConfidence [ label="yes" ];
localConfidence -> update [ label="yes"; color="orange"; style="dashed" ];
localConfidence -> redHatConfidence [ label="no"; color="blue"; penwidth="3" ];
redHatConfidence -> wait [ label="no"; color="blue"; penwidth="3" ];
redHatConfidence -> bug [ label="yes"; color="orange"; style="dashed" ];
bug -> apply;
apply -> wait [ label="yes" ];
apply -> update [ label="no" ];
}
157 changes: 157 additions & 0 deletions enhancements/update/overriding-blocked-edges/flow.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading