-
Notifications
You must be signed in to change notification settings - Fork 65
COS-1942: blocked-edges/4.12.*: Declare AWSOldBootImages #2909
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
COS-1942: blocked-edges/4.12.*: Declare AWSOldBootImages #2909
Conversation
|
/test e2e-latest-cincinnati Error in must-gather step: |
03b0ba4 to
1f31ee0
Compare
|
/lgtm Not sure whether we want to merge this while COS-1942 is still technically in progress /shrug |
This one is sticky, because we don't have PromQL access to boot-image age, so we cannot automatically distinguish between "born in 4.(<=2) and still uses the old boot images" and "born in 4.(<=2), but has subsequently updated boot images". And because of a cluster-version operator bug, we don't necessarily have access to the cluster's born-in release anyway. The CVO bug fix went back via: * 4.11.0 https://bugzilla.redhat.com/show_bug.cgi?id=2097067#c12 * 4.10.24 https://bugzilla.redhat.com/show_bug.cgi?id=2108292#c6 * 4.9.45 https://bugzilla.redhat.com/show_bug.cgi?id=2108619#c6 * 4.8.47 https://bugzilla.redhat.com/show_bug.cgi?id=2109962#c6 * 4.7.59 https://bugzilla.redhat.com/show_bug.cgi?id=2117347#c8 * 4.6.61 https://bugzilla.redhat.com/show_bug.cgi?id=2118489#c6 So it's possible for someone born in 4.2 to have spend a whole bunch of time in 4.9.z and be reporting a 4.9.0-rc.* or something as their born-in version. Work around that by declaring this risk for AWS clusters where the born-in version is 4.9 or earlier, expecting that we'll have this issue fixed soonish, so folks with old boot images will be able to update to a later 4.12, and allowing us to be overly broad/cautious with the risk matching here. I wrote the 4.12.0-rc.0 content manually, and then copied over to the other 4.12s with: $ curl -s 'https://api.openshift.com/api/upgrades_info/graph?channel=candidate-4.12' | jq -r '.nodes[].version' | grep '^4[.]12[.]' | grep -v '^4[.]12[.]0-rc[.]0$' | while read V; do sed "s/4[.]12[.]0-rc[.]0/${V}/g" blocked-edges/4.12.0-rc.0-AWSOldBootImage.yaml > "blocked-edges/${V}-AWSOldBootImage.yaml"; done
1f31ee0 to
034fa01
Compare
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: petr-muller, sdodson, wking The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
I think we should go ahead with this given our understanding of the problem. We can come back and revert if we're wrong. Just my opinion, will leave it up to OTA folks to remove the hold. |
|
@wking: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
/hold cancel |
[1] is still POST for 4.13, with no backport bug yet for 4.12. Generated with: $ curl -s 'https://api.openshift.com/api/upgrades_info/graph?channel=candidate-4.12' | jq -r '.nodes[].version' | grep '^4[.]12[.]' | grep -v '^4[.]12[.]0-rc[.]0$' | while read V; do sed "s/4[.]12[.]0-rc[.]0/${V}/g" blocked-edges/4.12.0-rc.0-AWSOldBootImage.yaml > "blocked-edges/${V}-AWSOldBootImage.yaml"; done like the original 034fa01 (blocked-edges/4.12.*: Declare AWSOldBootImages, 2022-12-14, openshift#2909). [1]: https://issues.redhat.com/browse/OCPBUGS-4769
Like 034fa01 (blocked-edges/4.12.*: Declare AWSOldBootImages, 2022-12-14, openshift#2909), this one is sticky, because we don't have PromQL access to boot-image age, so we cannot automatically distinguish between "born in 4.1 and still uses the old boot images" and "born in 4.1, but has subsequently updated boot images". And because of a cluster-version operator bug, we don't necessarily have access to the cluster's born-in release anyway. The CVO bug fix went back via: * 4.11.0 https://bugzilla.redhat.com/show_bug.cgi?id=2097067#c12 * 4.10.24 https://bugzilla.redhat.com/show_bug.cgi?id=2108292#c6 * 4.9.45 https://bugzilla.redhat.com/show_bug.cgi?id=2108619#c6 * 4.8.47 https://bugzilla.redhat.com/show_bug.cgi?id=2109962#c6 * 4.7.59 https://bugzilla.redhat.com/show_bug.cgi?id=2117347#c8 * 4.6.61 https://bugzilla.redhat.com/show_bug.cgi?id=2118489#c6 So it's possible for someone born in 4.1 to have spend a whole bunch of time in 4.9.z and be reporting a 4.9.0-rc.* or something as their born-in version. Work around that by declaring this risk for AWS clusters where the born-in version is 4.9 or earlier, expecting that we'll have this issue fixed soonish, so folks with old boot images will be able to update to a later 4.12, and allowing us to be overly broad/cautious with the risk matching here. I wrote the 4.11.0 content manually, and then copied over to the other 4.11s with: $ curl -s 'https://api.openshift.com/api/upgrades_info/graph?channel=candidate-4.11' | jq -r '.nodes[].version' | grep '^4[.]11[.]' | grep -v '^4[.]11[.]0$' | while read V; do sed "s/4[.]11[.]0/${V}/g" blocked-edges/4.11.0-AWSOldBootImageLackAfterburn.yaml > "blocked-edges/${V}-AWSOldBootImageLackAfterburn.yaml"; done
Like 034fa01 (blocked-edges/4.12.*: Declare AWSOldBootImages, 2022-12-14, openshift#2909), this one is sticky, because we don't have PromQL access to boot-image age, so we cannot automatically distinguish between "born in 4.1 and still uses the old boot images" and "born in 4.1, but has subsequently updated boot images". And because of a cluster-version operator bug, we don't necessarily have access to the cluster's born-in release anyway. The CVO bug fix went back via: * 4.11.0 https://bugzilla.redhat.com/show_bug.cgi?id=2097067#c12 * 4.10.24 https://bugzilla.redhat.com/show_bug.cgi?id=2108292#c6 * 4.9.45 https://bugzilla.redhat.com/show_bug.cgi?id=2108619#c6 * 4.8.47 https://bugzilla.redhat.com/show_bug.cgi?id=2109962#c6 * 4.7.59 https://bugzilla.redhat.com/show_bug.cgi?id=2117347#c8 * 4.6.61 https://bugzilla.redhat.com/show_bug.cgi?id=2118489#c6 So it's possible for someone born in 4.1 to have spend a whole bunch of time in 4.9.z and be reporting a 4.9.0-rc.* or something as their born-in version. Work around that by declaring this risk for AWS clusters where the born-in version is 4.9 or earlier, expecting that we'll have this issue fixed soonish, so folks with old boot images will be able to update to a later 4.11, and allowing us to be overly broad/cautious with the risk matching here. I wrote the 4.11.0 content manually, and then copied over to the other 4.11s with: $ curl -s 'https://api.openshift.com/api/upgrades_info/graph?channel=candidate-4.11' | jq -r '.nodes[].version' | grep '^4[.]11[.]' | grep -v '^4[.]11[.]0$' | while read V; do sed "s/4[.]11[.]0/${V}/g" blocked-edges/4.11.0-AWSOldBootImageLackAfterburn.yaml > "blocked-edges/${V}-AWSOldBootImageLackAfterburn.yaml"; done
…* upgrade paths Like 034fa01 (blocked-edges/4.12.*: Declare AWSOldBootImages, 2022-12-14, openshift#2909) and 957626a , this cnditional risk is sticky, because we don't have PromQL access to boot-image age, so we cannot automatically distinguish between "born in 4.1 and still uses the old boot images" and "born in 4.1, but has subsequently updated boot images". And because of a cluster-version operator bug, we don't necessarily have access to the cluster's born-in release anyway. The CVO bug fix went back via: * 4.11.0 https://bugzilla.redhat.com/show_bug.cgi?id=2097067#c12 * 4.10.24 https://bugzilla.redhat.com/show_bug.cgi?id=2108292#c6 * 4.9.45 https://bugzilla.redhat.com/show_bug.cgi?id=2108619#c6 * 4.8.47 https://bugzilla.redhat.com/show_bug.cgi?id=2109962#c6 * 4.7.59 https://bugzilla.redhat.com/show_bug.cgi?id=2117347#c8 * 4.6.61 https://bugzilla.redhat.com/show_bug.cgi?id=2118489#c6 So it's possible for someone born in 4.1 to have spend a whole bunch of time in 4.9.z and be reporting a 4.9.0-rc.* or something as their born-in version. Work around that by declaring this risk for AWS clusters where the born-in version is 4.9 or earlier, expecting that we'll have this issue fixed soonish, so folks with old boot images will be able to update to a later 4.11, and allowing us to be overly broad/cautious with the risk matching here. Signed-off-by: Lalatendu Mohanty <[email protected]>
…* upgrade paths Like 034fa01 (blocked-edges/4.12.*: Declare AWSOldBootImages, 2022-12-14, openshift#2909) and 957626a , this cnditional risk is sticky, because we don't have PromQL access to boot-image age, so we cannot automatically distinguish between "born in 4.1 and still uses the old boot images" and "born in 4.1, but has subsequently updated boot images". And because of a cluster-version operator bug, we don't necessarily have access to the cluster's born-in release anyway. The CVO bug fix went back via: * 4.11.0 https://bugzilla.redhat.com/show_bug.cgi?id=2097067#c12 * 4.10.24 https://bugzilla.redhat.com/show_bug.cgi?id=2108292#c6 * 4.9.45 https://bugzilla.redhat.com/show_bug.cgi?id=2108619#c6 * 4.8.47 https://bugzilla.redhat.com/show_bug.cgi?id=2109962#c6 * 4.7.59 https://bugzilla.redhat.com/show_bug.cgi?id=2117347#c8 * 4.6.61 https://bugzilla.redhat.com/show_bug.cgi?id=2118489#c6 So it's possible for someone born in 4.1 to have spend a whole bunch of time in 4.9.z and be reporting a 4.9.0-rc.* or something as their born-in version. Work around that by declaring this risk for AWS clusters where the born-in version is 4.9 or earlier, expecting that we'll have this issue fixed soonish, so folks with old boot images will be able to update to a later 4.11, and allowing us to be overly broad/cautious with the risk matching here. Signed-off-by: Lalatendu Mohanty <[email protected]>
…* upgrade paths Like 034fa01 (blocked-edges/4.12.*: Declare AWSOldBootImages, 2022-12-14, openshift#2909) and 957626a , this cnditional risk is sticky, because we don't have PromQL access to boot-image age, so we cannot automatically distinguish between "born in 4.1 and still uses the old boot images" and "born in 4.1, but has subsequently updated boot images". And because of a cluster-version operator bug, we don't necessarily have access to the cluster's born-in release anyway. The CVO bug fix went back via: * 4.11.0 https://bugzilla.redhat.com/show_bug.cgi?id=2097067#c12 * 4.10.24 https://bugzilla.redhat.com/show_bug.cgi?id=2108292#c6 * 4.9.45 https://bugzilla.redhat.com/show_bug.cgi?id=2108619#c6 * 4.8.47 https://bugzilla.redhat.com/show_bug.cgi?id=2109962#c6 * 4.7.59 https://bugzilla.redhat.com/show_bug.cgi?id=2117347#c8 * 4.6.61 https://bugzilla.redhat.com/show_bug.cgi?id=2118489#c6 So it's possible for someone born in 4.1 to have spend a whole bunch of time in 4.9.z and be reporting a 4.9.0-rc.* or something as their born-in version. Work around that by declaring this risk for AWS clusters where the born-in version is 4.9 or earlier, expecting that we'll have this issue fixed soonish, so folks with old boot images will be able to update to a later 4.11, and allowing us to be overly broad/cautious with the risk matching here. Signed-off-by: Lalatendu Mohanty <[email protected]>
…* upgrade paths Like 034fa01 (blocked-edges/4.12.*: Declare AWSOldBootImages, 2022-12-14, openshift#2909) and 957626a , this cnditional risk is sticky, because we don't have PromQL access to boot-image age, so we cannot automatically distinguish between "born in 4.1 and still uses the old boot images" and "born in 4.1, but has subsequently updated boot images". And because of a cluster-version operator bug, we don't necessarily have access to the cluster's born-in release anyway. The CVO bug fix went back via: * 4.11.0 https://bugzilla.redhat.com/show_bug.cgi?id=2097067#c12 * 4.10.24 https://bugzilla.redhat.com/show_bug.cgi?id=2108292#c6 * 4.9.45 https://bugzilla.redhat.com/show_bug.cgi?id=2108619#c6 * 4.8.47 https://bugzilla.redhat.com/show_bug.cgi?id=2109962#c6 * 4.7.59 https://bugzilla.redhat.com/show_bug.cgi?id=2117347#c8 * 4.6.61 https://bugzilla.redhat.com/show_bug.cgi?id=2118489#c6 So it's possible for someone born in 4.1 to have spend a whole bunch of time in 4.9.z and be reporting a 4.9.0-rc.* or something as their born-in version. Work around that by declaring this risk for AWS clusters where the born-in version is 4.9 or earlier, expecting that we'll have this issue fixed soonish, so folks with old boot images will be able to update to a later 4.11, and allowing us to be overly broad/cautious with the risk matching here. Signed-off-by: Lalatendu Mohanty <[email protected]>
…* upgrade paths Like 034fa01 (blocked-edges/4.12.*: Declare AWSOldBootImages, 2022-12-14, openshift#2909) and 957626a , this cnditional risk is sticky, because we don't have PromQL access to boot-image age, so we cannot automatically distinguish between "born in 4.1 and still uses the old boot images" and "born in 4.1, but has subsequently updated boot images". And because of a cluster-version operator bug, we don't necessarily have access to the cluster's born-in release anyway. The CVO bug fix went back via: * 4.11.0 https://bugzilla.redhat.com/show_bug.cgi?id=2097067#c12 * 4.10.24 https://bugzilla.redhat.com/show_bug.cgi?id=2108292#c6 * 4.9.45 https://bugzilla.redhat.com/show_bug.cgi?id=2108619#c6 * 4.8.47 https://bugzilla.redhat.com/show_bug.cgi?id=2109962#c6 * 4.7.59 https://bugzilla.redhat.com/show_bug.cgi?id=2117347#c8 * 4.6.61 https://bugzilla.redhat.com/show_bug.cgi?id=2118489#c6 So it's possible for someone born in 4.1 to have spend a whole bunch of time in 4.9.z and be reporting a 4.9.0-rc.* or something as their born-in version. Work around that by declaring this risk for AWS clusters where the born-in version is 4.9 or earlier, expecting that we'll have this issue fixed soonish, so folks with old boot images will be able to update to a later 4.11, and allowing us to be overly broad/cautious with the risk matching here. Signed-off-by: Lalatendu Mohanty <[email protected]>
…* upgrade paths Like 034fa01 (blocked-edges/4.12.*: Declare AWSOldBootImages, 2022-12-14, openshift#2909) and 957626a , this cnditional risk is sticky, because we don't have PromQL access to boot-image age, so we cannot automatically distinguish between "born in 4.1 and still uses the old boot images" and "born in 4.1, but has subsequently updated boot images". And because of a cluster-version operator bug, we don't necessarily have access to the cluster's born-in release anyway. The CVO bug fix went back via: * 4.11.0 https://bugzilla.redhat.com/show_bug.cgi?id=2097067#c12 * 4.10.24 https://bugzilla.redhat.com/show_bug.cgi?id=2108292#c6 * 4.9.45 https://bugzilla.redhat.com/show_bug.cgi?id=2108619#c6 * 4.8.47 https://bugzilla.redhat.com/show_bug.cgi?id=2109962#c6 * 4.7.59 https://bugzilla.redhat.com/show_bug.cgi?id=2117347#c8 * 4.6.61 https://bugzilla.redhat.com/show_bug.cgi?id=2118489#c6 So it's possible for someone born in 4.1 to have spend a whole bunch of time in 4.9.z and be reporting a 4.9.0-rc.* or something as their born-in version. Work around that by declaring this risk for AWS clusters where the born-in version is 4.9 or earlier, expecting that we'll have this issue fixed soonish, so folks with old boot images will be able to update to a later 4.11, and allowing us to be overly broad/cautious with the risk matching here. Signed-off-by: Lalatendu Mohanty <[email protected]>
…* upgrade paths Like 034fa01 (blocked-edges/4.12.*: Declare AWSOldBootImages, 2022-12-14, openshift#2909) and 957626a , this cnditional risk is sticky, because we don't have PromQL access to boot-image age, so we cannot automatically distinguish between "born in 4.1 and still uses the old boot images" and "born in 4.1, but has subsequently updated boot images". And because of a cluster-version operator bug, we don't necessarily have access to the cluster's born-in release anyway. The CVO bug fix went back via: * 4.11.0 https://bugzilla.redhat.com/show_bug.cgi?id=2097067#c12 * 4.10.24 https://bugzilla.redhat.com/show_bug.cgi?id=2108292#c6 * 4.9.45 https://bugzilla.redhat.com/show_bug.cgi?id=2108619#c6 * 4.8.47 https://bugzilla.redhat.com/show_bug.cgi?id=2109962#c6 * 4.7.59 https://bugzilla.redhat.com/show_bug.cgi?id=2117347#c8 * 4.6.61 https://bugzilla.redhat.com/show_bug.cgi?id=2118489#c6 So it's possible for someone born in 4.1 to have spend a whole bunch of time in 4.9.z and be reporting a 4.9.0-rc.* or something as their born-in version. Work around that by declaring this risk for AWS clusters where the born-in version is 4.9 or earlier, expecting that we'll have this issue fixed soonish, so folks with old boot images will be able to update to a later 4.11, and allowing us to be overly broad/cautious with the risk matching here. Signed-off-by: Lalatendu Mohanty <[email protected]>
Miguel points out that the exposure set is more complicated [1] than what I'd done in 45eb9ea (blocked-edges/4.14*: Declare AzureDefaultVMType, openshift#4541). It's: * Azure born in 4.8 or earlier are exposed. Both ARO (which creates clusters with Hive?) and clusters created via openshift-installer. * ARO clusters created in 4.13 and earlier are exposed. Generated by updating the 4.14.1 risk by hand, and then running: $ curl -s 'https://api.openshift.com/api/upgrades_info/graph?channel=candidate-4.14&arch=amd64' | jq -r '.nodes[] | .version' | grep '^4[.]14[.]' | grep -v '^4[.]14[.][01]$' | while read VERSION; do sed "s/4.14.1/${VERSION}/" blocked-edges/4.14.1-AzureDefaultVMType.yaml > "blocked-edges/${VERSION}-AzureDefaultVMType.yaml"; done Breaking down the logic for my new PromQL: a. First stanza, using topk is likely unecessary, but if we do happen to have multiple matches for some reason, we'll take the highest. That gives us a "we match" 1 (if any aggregated entries were 1) or a "we don't match" (if they were all 0), instead of "we're having a hard time figuring out" Recommended=Unknown. a. If the cluster is ARO (using cluster_operator_conditions, as in ba09198 (MCO-958: Blocking edges to 4.14.2+ and 4.13.25+, 2023-12-15, openshift#4524), first stanza is 1. Otherwise, 'or' falls back to... b. Nested block, again with the cautious topk: a. If there are no cluster_operator_conditions, don't return a time series. This ensures that "we didn't match a.a, but we might be ARO, and just have cluster_operator_conditions aggregation broken" gives us a Recommended=Unknown evaluation failure. b. Nested block, again with the cautious topk: a. born_by_4_9 yes case, with 4.(<=9) instead of the desired 4.(<=8) because of the "old CVO bugs make it hard to distinguish between 4.(<=9) birth-versions" issue discussed in 034fa01 (blocked-edges/4.12.*: Declare AWSOldBootImages, 2022-12-14, openshift#2909). Otherwise, 'or' falls back to... b. A check to ensure cluster_version{type="initial"} is working. This ensures that "we didn't match the a.b.b.a born_by_4_9 yes case, but we be that old, and just have cluster_version aggregation broken" gives us a Recommended=Unknown evaluation failure. b. Second stanza, again with the cautious topk: a. cluster_infrastructure_provider is Azure. Otherwise, 'or' falls back to... b. If there are no cluster_infrastructure_provider, don't return a time series. This ensures that "we didn't match b.a, but we might be Azure, and just have cluster_infrastructure_provider aggregation broken" gives us a Recommended=Unknown evaluation failure. So walking some cases: * Non-Azure cluster, cluster_operator_conditions, cluster_version, and cluster_infrastructure_provider all working: * a.a matches no series (not ARO). Fall back to... * a.b.a confirms cluster_operator_conditions is working. * a.b.b could be 1 or 0 for cluster_version. * b.a matches no series (not Azure). * b.b gives 0 (confirming cluster_infrastructure_provider is working). * (1 or 0) * 0 = 0, cluster does not match. * Non-Azure cluster, cluster_version is broken: * a.a matches no series (not ARO). Fall back to... * a.b.a confirms cluster_operator_conditions is working. * a.b.b matches no series (cluster_version is broken). * b.a matches no series (not Azure). * b.b gives 0 (confirming cluster_infrastructure_provider is working). * (no-match) * 0 = no-match, evaluation fails, Recommended=Unknown. Admin gets to figure out what's broken with cluster_version and/or manually assess their exposure based on the message and linked URI. * Non-ARO Azure cluster born in 4.9, all time-series working: * a.a matches no series (not ARO). Fall back to... * a.b.a confirms cluster_operator_conditions is working. * a.b.b.a matches born_by_4_9 yes. * b.a matches (Azure). * 1 * 1 = 1, cluster matches. * ARO cluster born in 4.9, all time-series working: * a.a matches (ARO). * b.a matches (Azure). * 1 * 1 = 1, cluster matches. * ARO cluster born in 4.13, all time-series working (this is the case I'm fixing with this commit): * a.a matches (ARO). * b.a matches (Azure). * 1 * 1 = 1, cluster matches. * ARO cluster, cluster_operator_conditions is broken. * a.a matches no series (cluster_operator_conditions) is broken. * a.b.a matches no series (cluster_operator_conditions) is broken. * b.a matches (Azure) * (no-match) * 1 = no-match, evaluation fails, Recommended=Unknown. * ARO cluster, cluster_infrastructure_provider is broken. * a.a matches (ARO). * b.a matches no series (cluster_infrastructure_provider) is broken. * b.b matches no series (cluster_infrastructure_provider) is broken. * 1 * (no-match) = no-match, evaluation fails, Recommended=Unknown. We could add logic like a cluster_operator_conditions{name="aro"} check to the (b) stanza if we wanted to bakein "all ARO clusters are Azure" knowledge to successfully evaluate this case. But I'd guess cluster_infrastructure_provider is working in most ARO clusters, and this PromQL is already complicated enough, so I haven't bothered with that level of tuning. * ...lots of other combinations... [1]: https://issues.redhat.com/browse/OCPCLOUD-2409?focusedId=23694976&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-23694976
Miguel points out that the exposure set is more complicated [1] than what I'd done in 45eb9ea (blocked-edges/4.14*: Declare AzureDefaultVMType, openshift#4541). It's: * Azure born in 4.8 or earlier are exposed. Both ARO (which creates clusters with Hive?) and clusters created via openshift-installer. * ARO clusters created in 4.13 and earlier are exposed. Generated by updating the 4.14.1 risk by hand, and then running: $ curl -s 'https://api.openshift.com/api/upgrades_info/graph?channel=candidate-4.14&arch=amd64' | jq -r '.nodes[] | .version' | grep '^4[.]14[.]' | grep -v '^4[.]14[.][01]$' | while read VERSION; do sed "s/4.14.1/${VERSION}/" blocked-edges/4.14.1-AzureDefaultVMType.yaml > "blocked-edges/${VERSION}-AzureDefaultVMType.yaml"; done Breaking down the logic for my new PromQL: a. First stanza, using topk is likely unecessary, but if we do happen to have multiple matches for some reason, we'll take the highest. That gives us a "we match" 1 (if any aggregated entries were 1) or a "we don't match" (if they were all 0), instead of "we're having a hard time figuring out" Recommended=Unknown. a. If the cluster is ARO (using cluster_operator_conditions, as in ba09198 (MCO-958: Blocking edges to 4.14.2+ and 4.13.25+, 2023-12-15, openshift#4524), first stanza is 1. Otherwise, 'or' falls back to... b. Nested block, again with the cautious topk: a. If there are no cluster_operator_conditions, don't return a time series. This ensures that "we didn't match a.a, but we might be ARO, and just have cluster_operator_conditions aggregation broken" gives us a Recommended=Unknown evaluation failure. b. Nested block, again with the cautious topk: a. born_by_4_9 yes case, with 4.(<=9) instead of the desired 4.(<=8) because of the "old CVO bugs make it hard to distinguish between 4.(<=9) birth-versions" issue discussed in 034fa01 (blocked-edges/4.12.*: Declare AWSOldBootImages, 2022-12-14, openshift#2909). Otherwise, 'or' falls back to... b. A check to ensure cluster_version{type="initial"} is working. This ensures that "we didn't match the a.b.b.a born_by_4_9 yes case, but we be that old, and just have cluster_version aggregation broken" gives us a Recommended=Unknown evaluation failure. b. Second stanza, again with the cautious topk: a. cluster_infrastructure_provider is Azure. Otherwise, 'or' falls back to... b. If there are no cluster_infrastructure_provider, don't return a time series. This ensures that "we didn't match b.a, but we might be Azure, and just have cluster_infrastructure_provider aggregation broken" gives us a Recommended=Unknown evaluation failure. So walking some cases: * Non-Azure cluster, cluster_operator_conditions, cluster_version, and cluster_infrastructure_provider all working: * a.a matches no series (not ARO). Fall back to... * a.b.a confirms cluster_operator_conditions is working. * a.b.b could be 1 or 0 for cluster_version. * b.a matches no series (not Azure). * b.b gives 0 (confirming cluster_infrastructure_provider is working). * (1 or 0) * 0 = 0, cluster does not match. * Non-Azure cluster, cluster_version is broken: * a.a matches no series (not ARO). Fall back to... * a.b.a confirms cluster_operator_conditions is working. * a.b.b matches no series (cluster_version is broken). * b.a matches no series (not Azure). * b.b gives 0 (confirming cluster_infrastructure_provider is working). * (no-match) * 0 = no-match, evaluation fails, Recommended=Unknown. Admin gets to figure out what's broken with cluster_version and/or manually assess their exposure based on the message and linked URI. * Non-ARO Azure cluster born in 4.9, all time-series working: * a.a matches no series (not ARO). Fall back to... * a.b.a confirms cluster_operator_conditions is working. * a.b.b.a matches born_by_4_9 yes. * b.a matches (Azure). * 1 * 1 = 1, cluster matches. * ARO cluster born in 4.9, all time-series working: * a.a matches (ARO). * b.a matches (Azure). * 1 * 1 = 1, cluster matches. * ARO cluster born in 4.13, all time-series working (this is the case I'm fixing with this commit): * a.a matches (ARO). * b.a matches (Azure). * 1 * 1 = 1, cluster matches. * ARO cluster, cluster_operator_conditions is broken. * a.a matches no series (cluster_operator_conditions) is broken. * a.b.a matches no series (cluster_operator_conditions) is broken. * b.a matches (Azure) * (no-match) * 1 = no-match, evaluation fails, Recommended=Unknown. * ARO cluster, cluster_infrastructure_provider is broken. * a.a matches (ARO). * b.a matches no series (cluster_infrastructure_provider) is broken. * b.b matches no series (cluster_infrastructure_provider) is broken. * 1 * (no-match) = no-match, evaluation fails, Recommended=Unknown. We could add logic like a cluster_operator_conditions{name="aro"} check to the (b) stanza if we wanted to bakein "all ARO clusters are Azure" knowledge to successfully evaluate this case. But I'd guess cluster_infrastructure_provider is working in most ARO clusters, and this PromQL is already complicated enough, so I haven't bothered with that level of tuning. * ...lots of other combinations... [1]: https://issues.redhat.com/browse/OCPCLOUD-2409?focusedId=23694976&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-23694976
Miguel points out that the exposure set is more complicated [1] than what I'd done in 45eb9ea (blocked-edges/4.14*: Declare AzureDefaultVMType, openshift#4541). It's: * Azure born in 4.8 or earlier are exposed. Both ARO (which creates clusters with Hive?) and clusters created via openshift-installer. * ARO clusters created in 4.13 and earlier are exposed. Generated by updating the 4.14.1 risk by hand, and then running: $ curl -s 'https://api.openshift.com/api/upgrades_info/graph?channel=candidate-4.14&arch=amd64' | jq -r '.nodes[] | .version' | grep '^4[.]14[.]' | grep -v '^4[.]14[.][01]$' | while read VERSION; do sed "s/4.14.1/${VERSION}/" blocked-edges/4.14.1-AzureDefaultVMType.yaml > "blocked-edges/${VERSION}-AzureDefaultVMType.yaml"; done Breaking down the logic for my new PromQL: a. First stanza, using topk is likely unecessary, but if we do happen to have multiple matches for some reason, we'll take the highest. That gives us a "we match" 1 (if any aggregated entries were 1) or a "we don't match" (if they were all 0), instead of "we're having a hard time figuring out" Recommended=Unknown. a. If the cluster is ARO (using cluster_operator_conditions, as in ba09198 (MCO-958: Blocking edges to 4.14.2+ and 4.13.25+, 2023-12-15, openshift#4524), first stanza is 1. Otherwise, 'or' falls back to... b. Nested block, again with the cautious topk: a. If there are no cluster_operator_conditions, don't return a time series. This ensures that "we didn't match a.a, but we might be ARO, and just have cluster_operator_conditions aggregation broken" gives us a Recommended=Unknown evaluation failure. b. Nested block, again with the cautious topk: a. born_by_4_9 yes case, with 4.(<=9) instead of the desired 4.(<=8) because of the "old CVO bugs make it hard to distinguish between 4.(<=9) birth-versions" issue discussed in 034fa01 (blocked-edges/4.12.*: Declare AWSOldBootImages, 2022-12-14, openshift#2909). Otherwise, 'or' falls back to... b. A check to ensure cluster_version{type="initial"} is working. This ensures that "we didn't match the a.b.b.a born_by_4_9 yes case, but we be that old, and just have cluster_version aggregation broken" gives us a Recommended=Unknown evaluation failure. b. Second stanza, again with the cautious topk: a. cluster_infrastructure_provider is Azure. Otherwise, 'or' falls back to... b. If there are no cluster_infrastructure_provider, don't return a time series. This ensures that "we didn't match b.a, but we might be Azure, and just have cluster_infrastructure_provider aggregation broken" gives us a Recommended=Unknown evaluation failure. All of the _id filtering is for use in hosted clusters or other PromQL stores that include multiple clusters. More background in 5cb2e93 (blocked-edges/4.11.*-KeepalivedMulticastSkew: Explicit _id="", 2023-05-09, openshift#3591). So walking some cases: * Non-Azure cluster, cluster_operator_conditions, cluster_version, and cluster_infrastructure_provider all working: * a.a matches no series (not ARO). Fall back to... * a.b.a confirms cluster_operator_conditions is working. * a.b.b could be 1 or 0 for cluster_version. * b.a matches no series (not Azure). * b.b gives 0 (confirming cluster_infrastructure_provider is working). * (1 or 0) * 0 = 0, cluster does not match. * Non-Azure cluster, cluster_version is broken: * a.a matches no series (not ARO). Fall back to... * a.b.a confirms cluster_operator_conditions is working. * a.b.b matches no series (cluster_version is broken). * b.a matches no series (not Azure). * b.b gives 0 (confirming cluster_infrastructure_provider is working). * (no-match) * 0 = no-match, evaluation fails, Recommended=Unknown. Admin gets to figure out what's broken with cluster_version and/or manually assess their exposure based on the message and linked URI. * Non-ARO Azure cluster born in 4.9, all time-series working: * a.a matches no series (not ARO). Fall back to... * a.b.a confirms cluster_operator_conditions is working. * a.b.b.a matches born_by_4_9 yes. * b.a matches (Azure). * 1 * 1 = 1, cluster matches. * ARO cluster born in 4.9, all time-series working: * a.a matches (ARO). * b.a matches (Azure). * 1 * 1 = 1, cluster matches. * ARO cluster born in 4.13, all time-series working (this is the case I'm fixing with this commit): * a.a matches (ARO). * b.a matches (Azure). * 1 * 1 = 1, cluster matches. * ARO cluster, cluster_operator_conditions is broken. * a.a matches no series (cluster_operator_conditions) is broken. * a.b.a matches no series (cluster_operator_conditions) is broken. * b.a matches (Azure) * (no-match) * 1 = no-match, evaluation fails, Recommended=Unknown. * ARO cluster, cluster_infrastructure_provider is broken. * a.a matches (ARO). * b.a matches no series (cluster_infrastructure_provider) is broken. * b.b matches no series (cluster_infrastructure_provider) is broken. * 1 * (no-match) = no-match, evaluation fails, Recommended=Unknown. We could add logic like a cluster_operator_conditions{name="aro"} check to the (b) stanza if we wanted to bakein "all ARO clusters are Azure" knowledge to successfully evaluate this case. But I'd guess cluster_infrastructure_provider is working in most ARO clusters, and this PromQL is already complicated enough, so I haven't bothered with that level of tuning. * ...lots of other combinations... [1]: https://issues.redhat.com/browse/OCPCLOUD-2409?focusedId=23694976&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-23694976
034fa01 (blocked-edges/4.12.*: Declare AWSOldBootImages, 2022-12-14, openshift#2909) explains why we need to look for 4.9-or-earlier instead of looking for the 4.8-or-earlier condition this risk is associated with. I'm also adding _id="" to the queries as a pattern to support HyperShift and other systems that could query the cluster's data out of a PromQL engine that stored data for multiple clusters. More context in 5cb2e93 (blocked-edges/4.11.*-KeepalivedMulticastSkew: Explicit _id="", 2023-05-09, openshift#3591).
034fa01 (blocked-edges/4.12.*: Declare AWSOldBootImages, 2022-12-14, openshift#2909) explains why we need to look for 4.9-or-earlier instead of looking for the 4.8-or-earlier condition this risk is associated with. I'm also adding _id="" to the queries as a pattern to support HyperShift and other systems that could query the cluster's data out of a PromQL engine that stored data for multiple clusters. More context in 5cb2e93 (blocked-edges/4.11.*-KeepalivedMulticastSkew: Explicit _id="", 2023-05-09, openshift#3591). Fixed in rc.4, because it has the new minor_min from f8316da (build-suggestions/4.16: Set minor_min to 4.15.17, 2024-06-06, openshift#5352): $ oc adm release info quay.io/openshift-release-dev/ocp-release:4.16.0-rc.3-x86_64 | grep Upgrades Upgrades: 4.15.11, 4.15.12, 4.15.13, 4.15.14, 4.15.15, 4.15.16, 4.16.0-ec.1, 4.16.0-ec.2, 4.16.0-ec.3, 4.16.0-ec.4, 4.16.0-ec.5, 4.16.0-ec.6, 4.16.0-rc.0, 4.16.0-rc.1, 4.16.0-rc.2 $ oc adm release info quay.io/openshift-release-dev/ocp-release:4.16.0-rc.4-x86_64 | grep Upgrades Upgrades: 4.15.17, 4.16.0-ec.1, 4.16.0-ec.2, 4.16.0-ec.3, 4.16.0-ec.4, 4.16.0-ec.5, 4.16.0-ec.6, 4.16.0-rc.0, 4.16.0-rc.1, 4.16.0-rc.2, 4.16.0-rc.3 and the fix for [1] is in [2]. [1]: https://issues.redhat.com/browse/OCPBUGS-34492 [2]: https://amd64.ocp.releases.ci.openshift.org/releasestream/4-stable/release/4.15.16
This one is sticky, because we don't have PromQL access to boot-image age, so we cannot automatically distinguish between "born in 4.(<=2) and still uses the old boot images" and "born in 4.(<=2), but has subsequently updated boot images". And because of a cluster-version operator bug, we don't necessarily have access to the cluster's born-in release anyway. The CVO bug fix went back via:
So it's possible for someone born in 4.2 to have spend a whole bunch of time in 4.9.z and be reporting a
4.9.0-rc.*or something as their born-in version. Work around that by declaring this risk for AWS clusters where the born-in version is 4.9 or earlier, expecting that we'll have this issue fixed soonish, so folks with old boot images will be able to update to a later 4.12, and allowing us to be overly broad/cautious with the risk matching here.I wrote the 4.12.0-rc.0 content manually, and then copied over to the other 4.12s with:
$ curl -s 'https://api.openshift.com/api/upgrades_info/graph?channel=candidate-4.12' | jq -r '.nodes[].version' | grep '^4[.]12[.]' | grep -v '^4[.]12[.]0-rc[.]0$' | while read V; do sed "s/4[.]12[.]0-rc[.]0/${V}/g" blocked-edges/4.12.0-rc.0-AWSOldBootImage.yaml > "blocked-edges/${V}-AWSOldBootImage.yaml"; done