From 22cb5507379a25eda13cedadd2f27dea6459ee40 Mon Sep 17 00:00:00 2001
From: "W. Trevor King" <wking@tremily.us>
Date: Wed, 12 Aug 2020 09:53:46 -0700
Subject: [PATCH 1/3] enhancements/update/overriding-blocked-edges: Discuss
 this decision tree

I keep repeating this in various internal discussions.  Collect it all
up with a bow so I can just drop links ;).
---
 .../update/overriding-blocked-edges/README.md | 62 +++++++++++++++++++
 .../update/overriding-blocked-edges/flow.dot  | 25 ++++++++
 2 files changed, 87 insertions(+)
 create mode 100644 enhancements/update/overriding-blocked-edges/README.md
 create mode 100644 enhancements/update/overriding-blocked-edges/flow.dot
diff --git a/enhancements/update/overriding-blocked-edges/README.md b/enhancements/update/overriding-blocked-edges/README.md
new file mode 100644
index 0000000000..d4789f3608
--- /dev/null
+++ b/enhancements/update/overriding-blocked-edges/README.md
@@ -0,0 +1,62 @@
+# Overriding Blocked Edges
+
+We work hard to ensure update and release stability before even cutting candidate release images, but sometimes bugs slip through, as you can see [by our blocked edges][blocked-edges].
+This can make users uncomfortable.
+For example, a user might:
+
+1. See a new A -> B edge appear in their channel.
+2. Begin sending stage clusters over the edge to test its stability for their cluster flavor.
+3. Become confident in the update's stability for their flavor.
+4. Schedule a maintenance window to update their production cluster(s).
+5. Have the edge pulled from their channel, causing distress when the maintenance window arrives and the A -> B edge is no longer recommended.
+
+There have been many requests about rendering update graphs or pushing notifications about edge-pulls to mitigate these concerns, but I'm still not clear on how useful those would be.
+I understand the user's decision tree for production clusters to flow something like:
+
+<div style="text-align:center">
+  <img src="flow.svg" width="100%" />
+</div>
+
+I've marked the risky flows dashed and orange and the conservative flows thick and blue.
+
+For me the sticking point is: who would ever say "yes" to "Are we confident Red Hat's reasoning is exhaustive?"?
+If you feel like there are bugs you missed in your local testing but that there are no bugs which Red Hat has missed, why are you even doing local testing?
+I've marked the "What bug(s) did Red Hat link?" node dashed and orange to show that I consider it a risky pathway that we do not need to enable with a convenient, structured response.
+
+We try to update the edge-blocking comments as we discover new reasons, but:
+
+1. There are probably additional edge-blocking reasons that we are not aware of at the moment.
+2. We don't look too hard at edges after we pull them from the graph.
+
+For pulling edges, we just need to find *one* sufficiently severe bug on the edge.
+For initiating an update, users need confidence that there are no likely, severe bugs affecting their clusters.
+These are two different things.
+
+Getting [targeted edge blocking][targeted-edge-blocking] in place will allow us to decrease the amount of collateral damage where we currently have to pull an edge for all clusters to protect a known, vulnerable subset of clusters.
+And [alerting on available updates][alert-on-available-updates] will inform folks who have had an edge restored that they have been ignoring the (new) update opportunity.
+Between those two, and similar efforts, I don't think that broadcasting edge-pulled motivations is a useful activity, and I think encouraging users to actively consume our blocked-edge reasoning in order to green-light their own off-graph updates is actively harmful.
+Because do we support them if they hit some new bug?
+Our docs around this are [currently wiggly][off-graph-support], but I expect there will be a lot of pain if we offer blanket support for folks taking any blocked edge we have ever served in a fast/stable channel.
+
+Also note that the flow is always going through "Are there failing local tests?".
+We want users testing in candidate and fast to help turn up issues with their particular cluster flavor that we might not be catching in CI.
+We also want them reporting Telemetry/Insights when possible too, so we can pick out lower-signal feedback that they might miss.
+Or at least filing bugs if they find an issue and aren't reporting it via Telemetry/Insights.
+Phased roll-outs ([when we get them][phased-rollouts]) allow us to minimize the impact of "nobody explicitly tested this flavor/workflow (enough times, for flaky failures)", but is ideally not the first line of defense.
+But having no local tests and relying on the rest of the fleet reporting Telemetry/Insights is still supported, so users could work through the flow with "We have no failing local tests, because we don't run local tests" if they want.
+
+If I was administering a production cluster, I'd personally be conservative and avoid both dashed and orange "yes" paths.
+
+If a user came to me claiming "yes" to "Are we confident in our local test coverage?", I'd mention the existence of `oc adm upgrade --allow-explicit-upgrade --to-image ...`, and then immediately start trying to talk them into "no" (for both the local confidence and the "Are we confident Red Hat's reasoning is exhaustive?" nodes) to get them over to the "Wait for a new edge" endpoint.
+But if they stick to "yes" and waived support, it's their cluster and their decision to make.
+Testing clusters and others that are easily replaceable can attempt risky updates without going through this decision tree, because if they have issues during the update or on the target release, the administrators can just tear down the impacted cluster and provision a replacement on the original release.
+This sort of testing with expendable clusters is how administrators would address the "local test" portions of the production cluster decision tree.
+
+And hopefully updates soon become boring, reliable details, and folks just set [an auto-update switch][auto-update] and forget about scheduling updates entirely.
+
+[alert-on-available-updates]: https://github.com/openshift/cluster-version-operator/pull/415
+[auto-update]: https://github.com/openshift/enhancements/pull/124
+[blocked-edges]: https://github.com/openshift/cincinnati-graph-data/tree/master/blocked-edges
+[off-graph-support]: https://github.com/openshift/openshift-docs/blame/0a4d88729eccc2323ff319346e7824ca2f964b9e/modules/understanding-upgrade-channels.adoc#L101-L102
+[phased-rollouts]: https://github.com/openshift/enhancements/pull/427
+[targeted-edge-blocking]: https://github.com/openshift/enhancements/pull/426
diff --git a/enhancements/update/overriding-blocked-edges/flow.dot b/enhancements/update/overriding-blocked-edges/flow.dot
new file mode 100644
index 0000000000..42c892670c
--- /dev/null
+++ b/enhancements/update/overriding-blocked-edges/flow.dot
@@ -0,0 +1,25 @@
+digraph updateDecisions {
+	new [ label="New edge available" ];
+	test [ label="Are there failing local tests?" ];
+	pulled [ label="Does Red Hat pull the edge?" ];
+	localConfidence [ label="Are we confident in\nour local test coverage?" ];
+	redHatConfidence [ label="Are we confident Red Hat's\nreasoning is exhaustive?" ];
+	bug [ label="What bug(s) did Red Hat link?"; color="orange"; style="dashed" ];
+	apply [ label="Do the bug(s) apply to my clusters?" ];
+	update [ label="Production update" ];
+	wait [ label="Wait for a new edge" ];
+
+	new -> test;
+	test -> wait [ label="yes" ];
+	wait -> new;
+	test -> pulled [ label="no" ];
+	pulled -> update [ label="no" ];
+	pulled -> localConfidence [ label="yes" ];
+	localConfidence -> update [ label="yes"; color="orange"; style="dashed" ];
+	localConfidence -> redHatConfidence [ label="no"; color="blue"; penwidth="3" ];
+	redHatConfidence -> wait [ label="no"; color="blue"; penwidth="3" ];
+	redHatConfidence -> bug [ label="yes"; color="orange"; style="dashed" ];
+	bug -> apply;
+	apply -> wait [ label="yes" ];
+	apply -> update [ label="no" ];
+}

From 585b81abb4fcc33591e52874a235118cf4dd3e97 Mon Sep 17 00:00:00 2001
From: "W. Trevor King" <wking@tremily.us>
Date: Wed, 12 Aug 2020 09:56:14 -0700
Subject: [PATCH 2/3] enhancements/update/overriding-blocked-edges/flow: Render
 DOT to SVG

Generated with:

  $ (cd enhancements/update/overriding-blocked-edges; dot -Tsvg flow.dot >flow.svg)

using:

  $ dot -V
  dot - graphviz version 2.46.0 (0)
---
 .../update/overriding-blocked-edges/flow.svg  | 157 ++++++++++++++++++
 1 file changed, 157 insertions(+)
 create mode 100644 enhancements/update/overriding-blocked-edges/flow.svg

diff --git a/enhancements/update/overriding-blocked-edges/flow.svg b/enhancements/update/overriding-blocked-edges/flow.svg
new file mode 100644
index 0000000000..b5164f53dd
--- /dev/null
+++ b/enhancements/update/overriding-blocked-edges/flow.svg
@@ -0,0 +1,157 @@
+<?xml version="1.0" encoding="UTF-8" standalone="no"?>
+<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN"
+ "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
+<!-- Generated by graphviz version 2.46.0 (0)
+ -->
+<!-- Title: updateDecisions Pages: 1 -->
+<svg width="515pt" height="674pt"
+ viewBox="0.00 0.00 514.94 674.48" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink">
+<g id="graph0" class="graph" transform="scale(1 1) rotate(0) translate(4 670.48)">
+<title>updateDecisions</title>
+<polygon fill="white" stroke="transparent" points="-4,4 -4,-670.48 510.94,-670.48 510.94,4 -4,4"/>
+<!-- new -->
+<g id="node1" class="node">
+<title>new</title>
+<ellipse fill="none" stroke="black" cx="421.39" cy="-648.48" rx="80.69" ry="18"/>
+<text text-anchor="middle" x="421.39" y="-644.78" font-family="Times,serif" font-size="14.00">New edge available</text>
+</g>
+<!-- test -->
+<g id="node2" class="node">
+<title>test</title>
+<ellipse fill="none" stroke="black" cx="352.39" cy="-575.48" rx="110.48" ry="18"/>
+<text text-anchor="middle" x="352.39" y="-571.78" font-family="Times,serif" font-size="14.00">Are there failing local tests?</text>
+</g>
+<!-- new&#45;&gt;test -->
+<g id="edge1" class="edge">
+<title>new&#45;&gt;test</title>
+<path fill="none" stroke="black" d="M405.04,-630.65C396.4,-621.77 385.64,-610.69 376.11,-600.88"/>
+<polygon fill="black" stroke="black" points="378.59,-598.41 369.11,-593.68 373.57,-603.29 378.59,-598.41"/>
+</g>
+<!-- pulled -->
+<g id="node3" class="node">
+<title>pulled</title>
+<ellipse fill="none" stroke="black" cx="200.39" cy="-488.48" rx="112.38" ry="18"/>
+<text text-anchor="middle" x="200.39" y="-484.78" font-family="Times,serif" font-size="14.00">Does Red Hat pull the edge?</text>
+</g>
+<!-- test&#45;&gt;pulled -->
+<g id="edge4" class="edge">
+<title>test&#45;&gt;pulled</title>
+<path fill="none" stroke="black" d="M323.09,-558.09C299.09,-544.67 264.94,-525.58 238.77,-510.94"/>
+<polygon fill="black" stroke="black" points="240.31,-507.79 229.87,-505.96 236.89,-513.9 240.31,-507.79"/>
+<text text-anchor="middle" x="293.39" y="-528.28" font-family="Times,serif" font-size="14.00">no</text>
+</g>
+<!-- wait -->
+<g id="node9" class="node">
+<title>wait</title>
+<ellipse fill="none" stroke="black" cx="424.39" cy="-18" rx="82.59" ry="18"/>
+<text text-anchor="middle" x="424.39" y="-14.3" font-family="Times,serif" font-size="14.00">Wait for a new edge</text>
+</g>
+<!-- test&#45;&gt;wait -->
+<g id="edge2" class="edge">
+<title>test&#45;&gt;wait</title>
+<path fill="none" stroke="black" d="M388.36,-558.37C413.79,-544.25 443.39,-521.04 443.39,-489.48 443.39,-489.48 443.39,-489.48 443.39,-104 443.39,-84.09 438.37,-62.17 433.48,-45.58"/>
+<polygon fill="black" stroke="black" points="436.82,-44.52 430.51,-36.01 430.13,-46.6 436.82,-44.52"/>
+<text text-anchor="middle" x="452.89" y="-284.17" font-family="Times,serif" font-size="14.00">yes</text>
+</g>
+<!-- localConfidence -->
+<g id="node4" class="node">
+<title>localConfidence</title>
+<ellipse fill="none" stroke="black" cx="200.39" cy="-392.61" rx="102.56" ry="26.74"/>
+<text text-anchor="middle" x="200.39" y="-396.41" font-family="Times,serif" font-size="14.00">Are we confident in</text>
+<text text-anchor="middle" x="200.39" y="-381.41" font-family="Times,serif" font-size="14.00">our local test coverage?</text>
+</g>
+<!-- pulled&#45;&gt;localConfidence -->
+<g id="edge6" class="edge">
+<title>pulled&#45;&gt;localConfidence</title>
+<path fill="none" stroke="black" d="M200.39,-470.26C200.39,-458.96 200.39,-443.75 200.39,-429.86"/>
+<polygon fill="black" stroke="black" points="203.89,-429.58 200.39,-419.58 196.89,-429.58 203.89,-429.58"/>
+<text text-anchor="middle" x="209.89" y="-441.28" font-family="Times,serif" font-size="14.00">yes</text>
+</g>
+<!-- update -->
+<g id="node8" class="node">
+<title>update</title>
+<ellipse fill="none" stroke="black" cx="75.39" cy="-18" rx="75.29" ry="18"/>
+<text text-anchor="middle" x="75.39" y="-14.3" font-family="Times,serif" font-size="14.00">Production update</text>
+</g>
+<!-- pulled&#45;&gt;update -->
+<g id="edge5" class="edge">
+<title>pulled&#45;&gt;update</title>
+<path fill="none" stroke="black" d="M122.9,-475.3C77.65,-463.28 29.39,-439.67 29.39,-393.61 29.39,-393.61 29.39,-393.61 29.39,-104 29.39,-81.91 41.55,-60.07 53.37,-44.06"/>
+<polygon fill="black" stroke="black" points="56.42,-45.83 59.82,-35.79 50.91,-41.52 56.42,-45.83"/>
+<text text-anchor="middle" x="36.39" y="-231.8" font-family="Times,serif" font-size="14.00">no</text>
+</g>
+<!-- redHatConfidence -->
+<g id="node5" class="node">
+<title>redHatConfidence</title>
+<ellipse fill="none" stroke="black" cx="240.39" cy="-287.87" rx="118.17" ry="26.74"/>
+<text text-anchor="middle" x="240.39" y="-291.67" font-family="Times,serif" font-size="14.00">Are we confident Red Hat&#39;s</text>
+<text text-anchor="middle" x="240.39" y="-276.67" font-family="Times,serif" font-size="14.00">reasoning is exhaustive?</text>
+</g>
+<!-- localConfidence&#45;&gt;redHatConfidence -->
+<g id="edge8" class="edge">
+<title>localConfidence&#45;&gt;redHatConfidence</title>
+<path fill="none" stroke="blue" stroke-width="3" d="M210.49,-365.68C215.37,-353.15 221.3,-337.92 226.59,-324.33"/>
+<polygon fill="blue" stroke="blue" stroke-width="3" points="229.94,-325.38 230.3,-314.79 223.41,-322.84 229.94,-325.38"/>
+<text text-anchor="middle" x="230.39" y="-336.54" font-family="Times,serif" font-size="14.00">no</text>
+</g>
+<!-- localConfidence&#45;&gt;update -->
+<g id="edge7" class="edge">
+<title>localConfidence&#45;&gt;update</title>
+<path fill="none" stroke="orange" stroke-dasharray="5,2" d="M143.38,-370.05C110.41,-353.49 75.39,-326.93 75.39,-288.87 75.39,-288.87 75.39,-288.87 75.39,-104 75.39,-84.75 75.39,-63.05 75.39,-46.4"/>
+<polygon fill="orange" stroke="orange" points="78.89,-46.26 75.39,-36.26 71.89,-46.26 78.89,-46.26"/>
+<text text-anchor="middle" x="84.89" y="-188.3" font-family="Times,serif" font-size="14.00">yes</text>
+</g>
+<!-- bug -->
+<g id="node6" class="node">
+<title>bug</title>
+<ellipse fill="none" stroke="orange" stroke-dasharray="5,2" cx="240.39" cy="-192" rx="120.48" ry="18"/>
+<text text-anchor="middle" x="240.39" y="-188.3" font-family="Times,serif" font-size="14.00">What bug(s) did Red Hat link?</text>
+</g>
+<!-- redHatConfidence&#45;&gt;bug -->
+<g id="edge10" class="edge">
+<title>redHatConfidence&#45;&gt;bug</title>
+<path fill="none" stroke="orange" stroke-dasharray="5,2" d="M240.39,-260.97C240.39,-248.34 240.39,-233.14 240.39,-220.32"/>
+<polygon fill="orange" stroke="orange" points="243.89,-220.22 240.39,-210.22 236.89,-220.22 243.89,-220.22"/>
+<text text-anchor="middle" x="249.89" y="-231.8" font-family="Times,serif" font-size="14.00">yes</text>
+</g>
+<!-- redHatConfidence&#45;&gt;wait -->
+<g id="edge9" class="edge">
+<title>redHatConfidence&#45;&gt;wait</title>
+<path fill="none" stroke="blue" stroke-width="3" d="M297.7,-264.37C322.95,-251.82 351.01,-233.85 369.39,-210 407.42,-160.69 419.16,-85.99 422.79,-46.2"/>
+<polygon fill="blue" stroke="blue" stroke-width="3" points="426.28,-46.43 423.59,-36.18 419.3,-45.87 426.28,-46.43"/>
+<text text-anchor="middle" x="411.39" y="-144.8" font-family="Times,serif" font-size="14.00">no</text>
+</g>
+<!-- apply -->
+<g id="node7" class="node">
+<title>apply</title>
+<ellipse fill="none" stroke="black" cx="240.39" cy="-105" rx="137.28" ry="18"/>
+<text text-anchor="middle" x="240.39" y="-101.3" font-family="Times,serif" font-size="14.00">Do the bug(s) apply to my clusters?</text>
+</g>
+<!-- bug&#45;&gt;apply -->
+<g id="edge11" class="edge">
+<title>bug&#45;&gt;apply</title>
+<path fill="none" stroke="black" d="M240.39,-173.8C240.39,-162.16 240.39,-146.55 240.39,-133.24"/>
+<polygon fill="black" stroke="black" points="243.89,-133.18 240.39,-123.18 236.89,-133.18 243.89,-133.18"/>
+</g>
+<!-- apply&#45;&gt;update -->
+<g id="edge13" class="edge">
+<title>apply&#45;&gt;update</title>
+<path fill="none" stroke="black" d="M208.19,-87.41C181.36,-73.59 143.06,-53.86 114.48,-39.14"/>
+<polygon fill="black" stroke="black" points="116.06,-36.01 105.57,-34.55 112.86,-42.24 116.06,-36.01"/>
+<text text-anchor="middle" x="175.39" y="-57.8" font-family="Times,serif" font-size="14.00">no</text>
+</g>
+<!-- apply&#45;&gt;wait -->
+<g id="edge12" class="edge">
+<title>apply&#45;&gt;wait</title>
+<path fill="none" stroke="black" d="M275.87,-87.61C306.1,-73.65 349.61,-53.55 381.73,-38.71"/>
+<polygon fill="black" stroke="black" points="383.28,-41.85 390.89,-34.48 380.34,-35.5 383.28,-41.85"/>
+<text text-anchor="middle" x="353.89" y="-57.8" font-family="Times,serif" font-size="14.00">yes</text>
+</g>
+<!-- wait&#45;&gt;new -->
+<g id="edge3" class="edge">
+<title>wait&#45;&gt;new</title>
+<path fill="none" stroke="black" d="M447.85,-35.32C466.82,-50.64 490.39,-75.41 490.39,-104 490.39,-576.48 490.39,-576.48 490.39,-576.48 490.39,-596.83 475.64,-613.54 459.78,-625.62"/>
+<polygon fill="black" stroke="black" points="457.6,-622.88 451.46,-631.52 461.64,-628.59 457.6,-622.88"/>
+</g>
+</g>
+</svg>

From f606f4146a6de7fcf4f2fb2567f849094b7c463f Mon Sep 17 00:00:00 2001
From: "W. Trevor King" <wking@tremily.us>
Date: Fri, 7 Aug 2020 12:07:00 -0700
Subject: [PATCH 3/3] enhancements/update/targeted-update-edge-blocking:
 Propose a new enhancement

---
 .../update/overriding-blocked-edges/README.md |   3 +-
 .../update/targeted-update-edge-blocking.md   | 248 ++++++++++++++++++
 2 files changed, 249 insertions(+), 2 deletions(-)
 create mode 100644 enhancements/update/targeted-update-edge-blocking.md

diff --git a/enhancements/update/overriding-blocked-edges/README.md b/enhancements/update/overriding-blocked-edges/README.md
index d4789f3608..775ef0be90 100644
--- a/enhancements/update/overriding-blocked-edges/README.md
+++ b/enhancements/update/overriding-blocked-edges/README.md
@@ -32,7 +32,7 @@ For pulling edges, we just need to find *one* sufficiently severe bug on the edg
 For initiating an update, users need confidence that there are no likely, severe bugs affecting their clusters.
 These are two different things.
 
-Getting [targeted edge blocking][targeted-edge-blocking] in place will allow us to decrease the amount of collateral damage where we currently have to pull an edge for all clusters to protect a known, vulnerable subset of clusters.
+Getting [targeted edge blocking](../targeted-update-edge-blocking.md) in place will allow us to decrease the amount of collateral damage where we currently have to pull an edge for all clusters to protect a known, vulnerable subset of clusters.
 And [alerting on available updates][alert-on-available-updates] will inform folks who have had an edge restored that they have been ignoring the (new) update opportunity.
 Between those two, and similar efforts, I don't think that broadcasting edge-pulled motivations is a useful activity, and I think encouraging users to actively consume our blocked-edge reasoning in order to green-light their own off-graph updates is actively harmful.
 Because do we support them if they hit some new bug?
@@ -59,4 +59,3 @@ And hopefully updates soon become boring, reliable details, and folks just set [
 [blocked-edges]: https://github.com/openshift/cincinnati-graph-data/tree/master/blocked-edges
 [off-graph-support]: https://github.com/openshift/openshift-docs/blame/0a4d88729eccc2323ff319346e7824ca2f964b9e/modules/understanding-upgrade-channels.adoc#L101-L102
 [phased-rollouts]: https://github.com/openshift/enhancements/pull/427
-[targeted-edge-blocking]: https://github.com/openshift/enhancements/pull/426
diff --git a/enhancements/update/targeted-update-edge-blocking.md b/enhancements/update/targeted-update-edge-blocking.md
new file mode 100644
index 0000000000..723d3e4145
--- /dev/null
+++ b/enhancements/update/targeted-update-edge-blocking.md
@@ -0,0 +1,248 @@
+---
+title: targeted-update-edge-blocking
+authors:
+  - "@wking"
+reviewers:
+  - "@dofinn"
+  - "@LalatenduMohanty"
+  - "@sdodson"
+  - "@steveeJ"
+  - "@vrutkovs"
+approvers:
+  - TBD
+creation-date: 2020-07-07
+last-updated: 2020-08-12
+status: implementable
+---
+
+# Targeted Update Edge Blocking
+
+## Release Signoff Checklist
+
+- [x] Enhancement is `implementable`
+- [x] Design details are appropriately documented from clear requirements
+- [x] Test plan is defined
+- [x] Graduation criteria for dev preview, tech preview, GA
+- [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/)
+
+## Summary
+
+This enhancement proposes a mechanism for blocking edges for the subset of clusters considered vulnerable to known issues with a particular update or target release.
+
+## Motivation
+
+When managing the [Cincinnati][cincinnati-spec] update graph [for OpenShift][cincinnati-for-openshift-design], we sometimes discover issues with particular release images or updates between them.
+Once an issue is discovered, we [block edges][block-edges] so we no longer recommend risky updates, or updates to risky releases.
+
+Note: as described [in the documentation][support-documentation], supported updates are still supported even if incoming edges are blocked, and Red Hat will eventually provide supported update paths from any supported release to the latest supported release in its z-stream.
+
+Incoming bugs are evaluated to determine an impact statement based on [a generic template][rhbz-1858026-impact-statement-request].
+Some bugs only impact specific platforms, or clusters with other specific features.
+For example, rhbz#1858026 [only impacted][rhbz-1858026-impact-statement] clusters with the `None` platform which were created as 4.1 clusters and subsequently updated via 4.2, 4.3, and 4.4 to reach 4.5.
+In those cases there is currently tension between wanting to protect vulnerable clusters by blocking the edge vs. wanting to avoid inconveniencing clusters which we know are not vulnerable and whose administrators may have been planning on taking the recommended update.
+This enhancement aims to reduce that tension.
+
+### Goals
+
+* [Cincinnati graph-data][graph-data] maintainers will have the ability to block edges for the subset of clusters matching a particular Prometheus query.
+
+### Non-Goals
+
+* Blocking edges based on data that is not [uploaded to Telemetry][uploaded-telemetry].
+* Blocking edges for subsets of clusters when the update service is unable to reach or authenticate with an aggregating Prometheus server.
+    For example, users running a local update service will not be able to access Red Hat's internal Telemetry to determine the requesting cluster's platform, etc.
+* Exactly scoping the set of blocked clusters to those which would have been impacted by the issue.
+    For example, some issues may be races where the impacted cluster set is a random subset of the vulnerable cluster set.
+    Any targeting of the blocked edges will reduce the number of blocked clusters which would have not been impacted, and thus reduce the contention between protecting vulnerable clusters and inconveniencing invulnerable clusters.
+* Specifying a particular update service implementation.
+    This enhancement floats some ideas, but the details of the chosen approach are up to each update service's maintainers.
+
+## Proposal
+
+### Enhanced graph-data schema for blocking edges
+
+[The blocked-edges schema][block-edges] will be extended with the following new properties:
+
+* `clusters` (optional, [object][json-object]), defining the subset of affected clusters.
+    If any `clusters` property matches a given cluster, the edge should be blocked for that cluster.
+    * `promql` (optional, [object][json-string]), with a [PromQL][] query describing affected clusters.
+        This query will be submitted to a configurable Prometheus service and should return a set of matching records with `_id` labels.
+        Clusters whose [submitted `id` query parameter][cincinnati-for-openshift-request] is in the set of returned IDs are allowed if the value of the matching record is 1 and blocked if the value of the matching record is 0.
+        Clusters whose submitted `id` is not in the result set, or which provide no `id` parameter, are also blocked.
+        Blocking too many edges is better than blocking too few, because you can recover from the former, but possibly brick clusters on the latter.
+
+        The result of the query may be cached by the update service, so queries which generate unstable result sets should be avoided.
+        Clusters whose submitted `id` is not in the result set may be considered cache misses and, if a cache refresh still fails to include them, may be cached as `blocked`.
+
+[The schema version][graph-data-schema-version] would also be bumped to 1.1.0, because this is a backwards-compatible change.
+Consumers who only understand graph-data schema 1.0.0 would ignore the `clusters` property and block the edge for all clusters.
+While blocking the edge for all clusters is more aggressive than the graph-data maintainers intended, it is failing on the side of safety.
+Blocking these edges for all clusters is also not as aggressive as complaining about an unrecognized 2.0.0 version and failing to serve the entire graph.
+
+### Update service support for the enhanced schema
+
+The following recommendations are geared towards the [openshift/cincinnati][cincinnati].
+Maintainers of other update service implementations may or may not be able to apply them to their own implementation.
+
+The graph-builder's graph-data scraper should learn about [the new 1.1.0 schema](#enhanced-graph-data-schema-for-blocking-edges), and record any `clusters` properties in edge metadata.
+Blocked edges with `clusters` containing a conditional rule like `promql` are "conditional edges" in the following discussion, because the presence of an edge in the recommended update graph returned to a client depends on whether the rule conditions apply to the client cluster or not.
+
+When a request is received, the submitted [`channel` query parameter][cincinnati-for-openshift-request] limits the set of remaining edges.
+If any of the remaining edges have `clusters.promql` entries, a new, targeted-edge-blocking policy engine plugin will exclude the edge if the [`id` query parameter][cincinnati-for-openshift-request] is a member of the query's resulting ID set.
+If the request does not set the `id` parameter, the plugin should block all conditional edges, and does not need to check a cache or make PromQL queries.
+To perform the PromQL request, the update service will be extended with configuration for the new policy enging plugin.
+The new plugin will have a configurable Prometheus connection, including a URI, authentication credentials, and an optional set of trusted X.509 certificate authorities.
+
+To reliably and efficiently return cluster-specific responses, the ID set returned by configured queries may be cached (although see [the *ID flooding* section](#denial-of-service-via-id-flooding), regardless of caching).
+There are a few possibilities for caching; although selecting a particular caching implementation is up to the update service's maintainers.
+The policy engine should refresh the cache on cache misses, as discussed in [the schema section](#enhanced-graph-data-schema-for-blocking-edges).
+
+For every graph-builder response, the policy engine may aggregate queries for all edges and warm the cache of new queries.
+This will make the initial client request faster, at the expense of maintaining a cache that may not be needed before it goes stale.
+
+The policy engine should also be prepared for, and alert on, misconfigured PromQL that results in request failures.
+In the event of such a failure, the fallback behavior should be blocking the relevant edge for all clusters until the misconfiguration is fixed.
+
+### User Stories
+
+#### Bugs which impact a subset of clusters
+
+As described in [the *Motivation* section](#motivation), enabling things like "we'd like to block this edge for clusters born in 4.1 with the `None` platform".
+
+#### OpenShift Dedicated
+
+[OpenShift Dedicated][dedicated] (OSD) restricts the recommended update sets for managed clusters more aggressively than the default service.
+For example, OSD blocked edges into 4.3.19 temporarily while [rhbz#1838007][rhbz-1838007] was investigated, while the default service did not.
+With this enhancement, OSD-specific edge blocking could be accomplished with entries like:
+
+```yaml
+to: 4.3.19
+from: .*
+clusters:
+  # show unmanaged clusters the edge, but block it from managed clusters
+  prometheus: |
+    max by (_id) (0*subscription_labels{managed="true"})
+    or
+    max by (_id) (subscription_labels{managed="false"})
+```
+
+The query returns `{_id="..."}=0` results for `managed="true"` clusters (blocking the edge) and `{_id="..."}=1` results for `managed="false"` clusters (allowing the edge).
+Clusters with both `managed="true"` and `managed="false"` records in the current time window, because a cluster transitioned into or out of the managed state, will fall into the safer "block" bucket (the `or` semantics are defined [here][PromQL-or]).
+
+### Risks and Mitigations
+
+#### Divergent Prometheus services
+
+There is a risk that [published graph-data information](distribute-secondary-metadata-as-container-image.md) which assumes access to Red Hat's internal Telemetry will block edges when consumed by update services without access to the internal Telemetry.
+The risk is small initially, when we have no such edges.
+If targeted blocking becomes widespread, a simple but tedious mitigation would be asking users to maintain their own graph-data (e.g. removing all targeted edge-blocking references) and build their own graph-data image.
+
+A more powerful but complicated fix would be documenting a process for running a local Telemetry aggregator and configuring the local update service to run queries against that aggregator.
+This documentation would need to cover mechanisms for adding any expected metrics that Red Hat currently adds in internal Telemetry which come up in targeted edge queries, such as [`subscription_labels`](#openshift-dedicated).
+
+Users who did not want to maintain their own graph-data or a local Telemetry aggregator could also deploy their local update service with the targeted edge blocking plugin disabled.
+This will leave all conditional edges *enabled*, with the local administrators taking responsibility for otherwise blocking or avoiding conditional edges to which their client cluster set may be vulnerable.
+
+#### Clusters not reporting data
+
+Clusters which have [opted out of uploading Telemetry][uploaded-telemetry-opt-out] or which run on restricted networks and so are unable to report Telemetry will default to the blocked state (more discussion on unknown ID handling in [the *query coverage* section](#query-coverage)).
+Being unable to report Telemetry while still being able to connect to Red Hat's hosted update service seems unlikely; if a cluster can do one (e.g. via a proxy), it should be able to do both.
+Clusters which connect to neither Telemetry nor Red Hat's hosted update service may run a local update service, and that is discussed in [the *divergent Prometheus services* section](#divergent-prometheus-services).
+
+Clusters which opt out of Telemetry are likely still able to connect to the update service.
+When they do, they will be [blocked as cache-misses](#enhanced-graph-data-schema-for-blocking-edges).
+I think that conservative approach is acceptable, because we cannot tell if the non-reporting cluster is vulnerable to the known issues or not.
+Local administrators should instead test their updates locally and make their own decisions about update safety, as discussed in [the *exposing blocking reasons* section](#exposing-blocking-reasons).
+
+#### Exposing blocking reasons
+
+This enhancement provides no mechanism for telling clients if or why a recommended update edge has been blocked, because [the Cincinnati graph format][cincinnati-spec] provides no mechanism for sharing metadata about recommended edges, or even the existence of not-recommended edges.
+Clients might be tempted to want that information to second-guess a graph-data decision to block the edge, but I am [not convinced the information is actionable](overriding-blocked-edges/README.md).
+
+#### Stranding supported clusters
+
+As described [in the documentation][support-documentation], supported updates are still supported even if incoming edges are blocked, and Red Hat will eventually provide supported update paths from any supported release to the latest supported release in its z-stream.
+There is a risk, with the dynamic, per-cluster graph, that targeted edge blocking removes all outgoing update recommendations for some clusters on supported releases.
+The risk is highest for [clusters which are not reporting data](#clusters-not-reporting-data), so an easy way to audit would be to request a graph without specifing an `id` parameter (which [blocks all conditional edges](#enhanced-graph-data-schema-for-blocking-edges)) and to look for old, dead-end releases.
+
+#### Denial of service via ID flooding
+
+Malicious clients may flood an update service with graph requests rotating through a series of nominal `id` parameters in an attempt to consume the update service's cache capacity and/or PromQL volume.
+Update service maintainers should consider rate-limiting requests from a single source, requiring token-based authentication, and other mechanisms for limiting the denial of service exposure of expensive query requests.
+
+## Design Details
+
+### Test Plan
+
+[The graph-data repository][graph-data] should grow a presubmit test to enforce as much of the new schema as is practical.
+Validating any PromQL beyond "it's a string" is probably more trouble than its worth, because consuming update services should [alert on and safely handle](#update-service-support-for-the-enhanced-schema) such misconfiguration.
+
+Unit-testing the behavior in [openshift/cincinnati][cincinnati] with a mock Prometheus should be sufficient for [update service support](#update-service-support-for-the-enhanced-schema).
+
+### Graduation Criteria
+
+This will be released directly to GA.
+
+## Implementation History
+
+Major milestones in the life cycle of a proposal should be tracked in `Implementation
+History`.
+
+## Drawbacks
+
+Dynamic edge status that is dependent on Prometheus queries makes [the graph-data repository][graph-data] a less authoritative view of the graph served to a given client at a given time, as discussed in [the *risks and mititgations* section](#risks-and-mitigations).
+This is mitigated by the recommendation to avoid unstable queries, so while we may not be able to reconstruct historical graphs perfectly, we will still be able to make fairly accurate guesses.
+
+## Alternatives
+
+### Additional data sources
+
+The update service could switch on data scraped from Insights tarballs or other sources instead of Prometheus.
+And we could extend `clusters` in future work to allow for that.
+With this initial enhancement, I focused on Prometheus because it already exposes an API and would thus be the easiest source to integrate.
+
+#### Client-provided identifiers
+
+As a subset of possible additional data sources, clients could provide more identifying features in [their requests][cincinnati-for-openshift-request].
+While things like `platform=azure` would be fairly straightforward, we are unlikely to foresee all of the parameters we would need to match issues we discover in the future.
+Using Prometheus queries and Telemetry increases the likelihood of being able to narrowly scope edge blocking to the set of vulnerable clusters.
+However, as discussed in [the *non-goals* section](#non-goals), scoping doesn't have to be perfect.
+And client-provided identifiers removes the need for service-side queries and caching, avoiding [some *denial-of-service concerns*](#denial-of-service-via-id-flooding).
+So, like other additional data sources, we may still add support for targeting based on client-provided identifiers in the future.
+
+### Query coverage
+
+[The `promql` proposal](#enhanced-graph-data-schema-for-blocking-edges) specifies a single query that allows the update service to distinguish allowed cluster IDs (matching records with the value 1), blocked cluster IDs (matching records with the value 0), and unrecognized IDs (matching records with other values, or no matching records at all).
+This allows the update service to, if it wants, select different cache expiration times for unrecognized IDs.
+For example, the update service might say:
+
+> I haven't heard about ID 123...  This is the first time a client has asked about that cluster; maybe it's new and the submitted metrics have not made it through the Telemetry pipeline.  I will check again if they call back after 30 minutes, to see if it has shown up by then.  Falling back to the "block" default for now.
+
+Or:
+
+> Ah, I see ID 123... is explicitly in the block list.  Blocking, and no need to refresh for this cluster for the next day, because queries should not return unstable sets.
+
+This distinction is why `promql` returns both sets, instead of having it only return the blocked IDs, or only the allowed IDs.
+
+You could get a similar distinction with separate queries for allowed and blocked clusters, but you'd need to run at least the blocked query on each cache miss and refresh.
+You would also be exposed to confusion about "why is my allowed-query-matching cluster excluded?" for clusters which ended up matching both the allowed and blocked queries, because matching the blocked query would take precedence for the safety reasons discussed in [the proposal](#enhanced-graph-data-schema-for-blocking-edges).
+
+[block-edges]: https://github.com/openshift/cincinnati-graph-data/tree/29e2d0bc2bf1dbdbe07d0d7dd91ee97e11d62f28#block-edges
+[blocking-4.5.3]: https://github.com/openshift/cincinnati-graph-data/commit/8e965b65e2974d0628ea775c96694f797cd02b1e#diff-72977867226ea437c178e5a90d5d7ba8
+[cincinnati]: https://github.com/openshift/cincinnati
+[cincinnati-for-openshift-design]: https://github.com/openshift/cincinnati/blob/master/docs/design/openshift.md
+[cincinnati-for-openshift-request]: https://github.com/openshift/cincinnati/blob/master/docs/design/openshift.md#request
+[cincinnati-spec]: https://github.com/openshift/cincinnati/blob/master/docs/design/cincinnati.md
+[dedicated]: https://www.openshift.com/products/dedicated/
+[graph-data]: https://github.com/openshift/cincinnati-graph-data
+[graph-data-schema-version]: https://github.com/openshift/cincinnati-graph-data/tree/29e2d0bc2bf1dbdbe07d0d7dd91ee97e11d62f28#schema-version
+[json-object]: https://tools.ietf.org/html/rfc8259#section-4
+[json-string]: https://tools.ietf.org/html/rfc8259#section-7
+[PromQL]: https://prometheus.io/docs/prometheus/latest/querying/basics/
+[PromQL-or]: https://prometheus.io/docs/prometheus/latest/querying/operators/#logical-set-binary-operators
+[rhbz-1838007]: https://bugzilla.redhat.com/show_bug.cgi?id=1838007
+[rhbz-1858026-impact-statement-request]: https://bugzilla.redhat.com/show_bug.cgi?id=1858026#c26
+[rhbz-1858026-impact-statement]: https://bugzilla.redhat.com/show_bug.cgi?id=1858026#c28
+[support-documentation]: https://docs.openshift.com/container-platform/4.5/updating/updating-cluster-between-minor.html#upgrade-version-paths
+[uploaded-telemetry]: https://docs.openshift.com/container-platform/4.5/support/remote_health_monitoring/showing-data-collected-by-remote-health-monitoring.html#showing-data-collected-from-the-cluster_showing-data-collected-by-remote-health-monitoring
+[uploaded-telemetry-opt-out]: https://docs.openshift.com/container-platform/4.5/support/remote_health_monitoring/opting-out-of-remote-health-reporting.html