Skip to content

Conversation

@wking
Copy link
Member

@wking wking commented Dec 15, 2022

This one is sticky, because we don't have PromQL access to boot-image age, so we cannot automatically distinguish between "born in 4.(<=2) and still uses the old boot images" and "born in 4.(<=2), but has subsequently updated boot images". And because of a cluster-version operator bug, we don't necessarily have access to the cluster's born-in release anyway. The CVO bug fix went back via:

So it's possible for someone born in 4.2 to have spend a whole bunch of time in 4.9.z and be reporting a 4.9.0-rc.* or something as their born-in version. Work around that by declaring this risk for AWS clusters where the born-in version is 4.9 or earlier, expecting that we'll have this issue fixed soonish, so folks with old boot images will be able to update to a later 4.12, and allowing us to be overly broad/cautious with the risk matching here.

I wrote the 4.12.0-rc.0 content manually, and then copied over to the other 4.12s with:

$ curl -s 'https://api.openshift.com/api/upgrades_info/graph?channel=candidate-4.12' | jq -r '.nodes[].version' | grep '^4[.]12[.]' | grep -v '^4[.]12[.]0-rc[.]0$' | while read V; do sed "s/4[.]12[.]0-rc[.]0/${V}/g" blocked-edges/4.12.0-rc.0-AWSOldBootImage.yaml > "blocked-edges/${V}-AWSOldBootImage.yaml"; done

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 15, 2022
@petr-muller
Copy link
Member

petr-muller commented Dec 15, 2022

/test e2e-latest-cincinnati

Error in must-gather step:

INFO[2022-12-15T08:44:17Z] Running step e2e-latest-cincinnati-gather-must-gather. 
INFO[2022-12-15T08:45:27Z] Logs for container test in pod e2e-latest-cincinnati-gather-must-gather: 
INFO[2022-12-15T08:45:27Z] Running must-gather...
error: yaml: line 7: did not find expected key 

weird

@wking wking force-pushed the 4.12-vs-4.2-aws-boot-images branch 2 times, most recently from 03b0ba4 to 1f31ee0 Compare December 15, 2022 17:13
@petr-muller
Copy link
Member

petr-muller commented Dec 16, 2022

/lgtm
/hold

Not sure whether we want to merge this while COS-1942 is still technically in progress /shrug

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 16, 2022
@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 16, 2022
This one is sticky, because we don't have PromQL access to boot-image
age, so we cannot automatically distinguish between "born in 4.(<=2)
and still uses the old boot images" and "born in 4.(<=2), but has
subsequently updated boot images".  And because of a cluster-version
operator bug, we don't necessarily have access to the cluster's
born-in release anyway.  The CVO bug fix went back via:

* 4.11.0 https://bugzilla.redhat.com/show_bug.cgi?id=2097067#c12
* 4.10.24 https://bugzilla.redhat.com/show_bug.cgi?id=2108292#c6
* 4.9.45 https://bugzilla.redhat.com/show_bug.cgi?id=2108619#c6
* 4.8.47 https://bugzilla.redhat.com/show_bug.cgi?id=2109962#c6
* 4.7.59 https://bugzilla.redhat.com/show_bug.cgi?id=2117347#c8
* 4.6.61 https://bugzilla.redhat.com/show_bug.cgi?id=2118489#c6

So it's possible for someone born in 4.2 to have spend a whole bunch
of time in 4.9.z and be reporting a 4.9.0-rc.* or something as their
born-in version.  Work around that by declaring this risk for AWS
clusters where the born-in version is 4.9 or earlier, expecting that
we'll have this issue fixed soonish, so folks with old boot images
will be able to update to a later 4.12, and allowing us to be overly
broad/cautious with the risk matching here.

I wrote the 4.12.0-rc.0 content manually, and then copied over to the
other 4.12s with:

  $ curl -s 'https://api.openshift.com/api/upgrades_info/graph?channel=candidate-4.12' | jq -r '.nodes[].version' | grep '^4[.]12[.]' | grep -v '^4[.]12[.]0-rc[.]0$' | while read V; do sed "s/4[.]12[.]0-rc[.]0/${V}/g" blocked-edges/4.12.0-rc.0-AWSOldBootImage.yaml > "blocked-edges/${V}-AWSOldBootImage.yaml"; done
@wking wking force-pushed the 4.12-vs-4.2-aws-boot-images branch from 1f31ee0 to 034fa01 Compare December 19, 2022 18:23
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Dec 19, 2022
@wking
Copy link
Member Author

wking commented Dec 19, 2022

Rebased around #2919 to pick up rc.5 with 1f31ee0 -> 034fa01.

@sdodson
Copy link
Member

sdodson commented Dec 19, 2022

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 19, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 19, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: petr-muller, sdodson, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@sdodson
Copy link
Member

sdodson commented Dec 19, 2022

Not sure whether we want to merge this while COS-1942 is still technically in progress /shrug

I think we should go ahead with this given our understanding of the problem. We can come back and revert if we're wrong. Just my opinion, will leave it up to OTA folks to remove the hold.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 19, 2022

@wking: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@LalatenduMohanty
Copy link
Member

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 19, 2022
@openshift-merge-robot openshift-merge-robot merged commit 48dd256 into openshift:master Dec 19, 2022
@wking wking deleted the 4.12-vs-4.2-aws-boot-images branch December 19, 2022 19:08
wking added a commit to wking/cincinnati-graph-data that referenced this pull request Jan 4, 2023
[1] is still POST for 4.13, with no backport bug yet for 4.12.
Generated with:

  $ curl -s 'https://api.openshift.com/api/upgrades_info/graph?channel=candidate-4.12' | jq -r '.nodes[].version' | grep '^4[.]12[.]' | grep -v '^4[.]12[.]0-rc[.]0$' | while read V; do sed "s/4[.]12[.]0-rc[.]0/${V}/g" blocked-edges/4.12.0-rc.0-AWSOldBootImage.yaml > "blocked-edges/${V}-AWSOldBootImage.yaml"; done

like the original 034fa01 (blocked-edges/4.12.*: Declare
AWSOldBootImages, 2022-12-14, openshift#2909).

[1]: https://issues.redhat.com/browse/OCPBUGS-4769
wking added a commit to wking/cincinnati-graph-data that referenced this pull request Mar 2, 2023
Like 034fa01 (blocked-edges/4.12.*: Declare AWSOldBootImages,
2022-12-14, openshift#2909), this one is sticky, because we don't have PromQL
access to boot-image age, so we cannot automatically distinguish
between "born in 4.1 and still uses the old boot images" and "born
in 4.1, but has subsequently updated boot images".  And because of
a cluster-version operator bug, we don't necessarily have access to
the cluster's born-in release anyway.  The CVO bug fix went back via:

* 4.11.0 https://bugzilla.redhat.com/show_bug.cgi?id=2097067#c12
* 4.10.24 https://bugzilla.redhat.com/show_bug.cgi?id=2108292#c6
* 4.9.45 https://bugzilla.redhat.com/show_bug.cgi?id=2108619#c6
* 4.8.47 https://bugzilla.redhat.com/show_bug.cgi?id=2109962#c6
* 4.7.59 https://bugzilla.redhat.com/show_bug.cgi?id=2117347#c8
* 4.6.61 https://bugzilla.redhat.com/show_bug.cgi?id=2118489#c6

So it's possible for someone born in 4.1 to have spend a whole bunch
of time in 4.9.z and be reporting a 4.9.0-rc.* or something as their
born-in version.  Work around that by declaring this risk for AWS
clusters where the born-in version is 4.9 or earlier, expecting that
we'll have this issue fixed soonish, so folks with old boot images
will be able to update to a later 4.12, and allowing us to be overly
broad/cautious with the risk matching here.

I wrote the 4.11.0 content manually, and then copied over to the other
4.11s with:

  $ curl -s 'https://api.openshift.com/api/upgrades_info/graph?channel=candidate-4.11' | jq -r '.nodes[].version' | grep '^4[.]11[.]' | grep -v '^4[.]11[.]0$' | while read V; do sed "s/4[.]11[.]0/${V}/g" blocked-edges/4.11.0-AWSOldBootImageLackAfterburn.yaml > "blocked-edges/${V}-AWSOldBootImageLackAfterburn.yaml"; done
wking added a commit to wking/cincinnati-graph-data that referenced this pull request Mar 2, 2023
Like 034fa01 (blocked-edges/4.12.*: Declare AWSOldBootImages,
2022-12-14, openshift#2909), this one is sticky, because we don't have PromQL
access to boot-image age, so we cannot automatically distinguish
between "born in 4.1 and still uses the old boot images" and "born
in 4.1, but has subsequently updated boot images".  And because of
a cluster-version operator bug, we don't necessarily have access to
the cluster's born-in release anyway.  The CVO bug fix went back via:

* 4.11.0 https://bugzilla.redhat.com/show_bug.cgi?id=2097067#c12
* 4.10.24 https://bugzilla.redhat.com/show_bug.cgi?id=2108292#c6
* 4.9.45 https://bugzilla.redhat.com/show_bug.cgi?id=2108619#c6
* 4.8.47 https://bugzilla.redhat.com/show_bug.cgi?id=2109962#c6
* 4.7.59 https://bugzilla.redhat.com/show_bug.cgi?id=2117347#c8
* 4.6.61 https://bugzilla.redhat.com/show_bug.cgi?id=2118489#c6

So it's possible for someone born in 4.1 to have spend a whole bunch
of time in 4.9.z and be reporting a 4.9.0-rc.* or something as their
born-in version.  Work around that by declaring this risk for AWS
clusters where the born-in version is 4.9 or earlier, expecting that
we'll have this issue fixed soonish, so folks with old boot images
will be able to update to a later 4.11, and allowing us to be overly
broad/cautious with the risk matching here.

I wrote the 4.11.0 content manually, and then copied over to the other
4.11s with:

  $ curl -s 'https://api.openshift.com/api/upgrades_info/graph?channel=candidate-4.11' | jq -r '.nodes[].version' | grep '^4[.]11[.]' | grep -v '^4[.]11[.]0$' | while read V; do sed "s/4[.]11[.]0/${V}/g" blocked-edges/4.11.0-AWSOldBootImageLackAfterburn.yaml > "blocked-edges/${V}-AWSOldBootImageLackAfterburn.yaml"; done
LalatenduMohanty added a commit to LalatenduMohanty/cincinnati-graph-data that referenced this pull request Mar 14, 2023
…* upgrade paths

Like 034fa01 (blocked-edges/4.12.*: Declare AWSOldBootImages,
    2022-12-14, openshift#2909) and 957626a , this cnditional risk is sticky,
    because we don't have PromQL access to boot-image age, so we cannot automatically distinguish
    between "born in 4.1 and still uses the old boot images" and "born
    in 4.1, but has subsequently updated boot images".  And because of
    a cluster-version operator bug, we don't necessarily have access to
    the cluster's born-in release anyway.

    The CVO bug fix went back via:

    * 4.11.0 https://bugzilla.redhat.com/show_bug.cgi?id=2097067#c12
    * 4.10.24 https://bugzilla.redhat.com/show_bug.cgi?id=2108292#c6
    * 4.9.45 https://bugzilla.redhat.com/show_bug.cgi?id=2108619#c6
    * 4.8.47 https://bugzilla.redhat.com/show_bug.cgi?id=2109962#c6
    * 4.7.59 https://bugzilla.redhat.com/show_bug.cgi?id=2117347#c8
    * 4.6.61 https://bugzilla.redhat.com/show_bug.cgi?id=2118489#c6

    So it's possible for someone born in 4.1 to have spend a whole bunch
    of time in 4.9.z and be reporting a 4.9.0-rc.* or something as their
    born-in version.  Work around that by declaring this risk for AWS
    clusters where the born-in version is 4.9 or earlier, expecting that
    we'll have this issue fixed soonish, so folks with old boot images
    will be able to update to a later 4.11, and allowing us to be overly
    broad/cautious with the risk matching here.

Signed-off-by: Lalatendu Mohanty <[email protected]>
LalatenduMohanty added a commit to LalatenduMohanty/cincinnati-graph-data that referenced this pull request Mar 14, 2023
…* upgrade paths

Like 034fa01 (blocked-edges/4.12.*: Declare AWSOldBootImages,
    2022-12-14, openshift#2909) and 957626a , this cnditional risk is sticky,
    because we don't have PromQL access to boot-image age, so we cannot automatically distinguish
    between "born in 4.1 and still uses the old boot images" and "born
    in 4.1, but has subsequently updated boot images".  And because of
    a cluster-version operator bug, we don't necessarily have access to
    the cluster's born-in release anyway.

    The CVO bug fix went back via:

    * 4.11.0 https://bugzilla.redhat.com/show_bug.cgi?id=2097067#c12
    * 4.10.24 https://bugzilla.redhat.com/show_bug.cgi?id=2108292#c6
    * 4.9.45 https://bugzilla.redhat.com/show_bug.cgi?id=2108619#c6
    * 4.8.47 https://bugzilla.redhat.com/show_bug.cgi?id=2109962#c6
    * 4.7.59 https://bugzilla.redhat.com/show_bug.cgi?id=2117347#c8
    * 4.6.61 https://bugzilla.redhat.com/show_bug.cgi?id=2118489#c6

    So it's possible for someone born in 4.1 to have spend a whole bunch
    of time in 4.9.z and be reporting a 4.9.0-rc.* or something as their
    born-in version.  Work around that by declaring this risk for AWS
    clusters where the born-in version is 4.9 or earlier, expecting that
    we'll have this issue fixed soonish, so folks with old boot images
    will be able to update to a later 4.11, and allowing us to be overly
    broad/cautious with the risk matching here.

Signed-off-by: Lalatendu Mohanty <[email protected]>
LalatenduMohanty added a commit to LalatenduMohanty/cincinnati-graph-data that referenced this pull request Mar 14, 2023
…* upgrade paths

Like 034fa01 (blocked-edges/4.12.*: Declare AWSOldBootImages,
    2022-12-14, openshift#2909) and 957626a , this cnditional risk is sticky,
    because we don't have PromQL access to boot-image age, so we cannot automatically distinguish
    between "born in 4.1 and still uses the old boot images" and "born
    in 4.1, but has subsequently updated boot images".  And because of
    a cluster-version operator bug, we don't necessarily have access to
    the cluster's born-in release anyway.

    The CVO bug fix went back via:

    * 4.11.0 https://bugzilla.redhat.com/show_bug.cgi?id=2097067#c12
    * 4.10.24 https://bugzilla.redhat.com/show_bug.cgi?id=2108292#c6
    * 4.9.45 https://bugzilla.redhat.com/show_bug.cgi?id=2108619#c6
    * 4.8.47 https://bugzilla.redhat.com/show_bug.cgi?id=2109962#c6
    * 4.7.59 https://bugzilla.redhat.com/show_bug.cgi?id=2117347#c8
    * 4.6.61 https://bugzilla.redhat.com/show_bug.cgi?id=2118489#c6

    So it's possible for someone born in 4.1 to have spend a whole bunch
    of time in 4.9.z and be reporting a 4.9.0-rc.* or something as their
    born-in version.  Work around that by declaring this risk for AWS
    clusters where the born-in version is 4.9 or earlier, expecting that
    we'll have this issue fixed soonish, so folks with old boot images
    will be able to update to a later 4.11, and allowing us to be overly
    broad/cautious with the risk matching here.

Signed-off-by: Lalatendu Mohanty <[email protected]>
LalatenduMohanty added a commit to LalatenduMohanty/cincinnati-graph-data that referenced this pull request Mar 14, 2023
…* upgrade paths

Like 034fa01 (blocked-edges/4.12.*: Declare AWSOldBootImages,
    2022-12-14, openshift#2909) and 957626a , this cnditional risk is sticky,
    because we don't have PromQL access to boot-image age, so we cannot automatically distinguish
    between "born in 4.1 and still uses the old boot images" and "born
    in 4.1, but has subsequently updated boot images".  And because of
    a cluster-version operator bug, we don't necessarily have access to
    the cluster's born-in release anyway.

    The CVO bug fix went back via:

    * 4.11.0 https://bugzilla.redhat.com/show_bug.cgi?id=2097067#c12
    * 4.10.24 https://bugzilla.redhat.com/show_bug.cgi?id=2108292#c6
    * 4.9.45 https://bugzilla.redhat.com/show_bug.cgi?id=2108619#c6
    * 4.8.47 https://bugzilla.redhat.com/show_bug.cgi?id=2109962#c6
    * 4.7.59 https://bugzilla.redhat.com/show_bug.cgi?id=2117347#c8
    * 4.6.61 https://bugzilla.redhat.com/show_bug.cgi?id=2118489#c6

    So it's possible for someone born in 4.1 to have spend a whole bunch
    of time in 4.9.z and be reporting a 4.9.0-rc.* or something as their
    born-in version.  Work around that by declaring this risk for AWS
    clusters where the born-in version is 4.9 or earlier, expecting that
    we'll have this issue fixed soonish, so folks with old boot images
    will be able to update to a later 4.11, and allowing us to be overly
    broad/cautious with the risk matching here.

Signed-off-by: Lalatendu Mohanty <[email protected]>
LalatenduMohanty added a commit to LalatenduMohanty/cincinnati-graph-data that referenced this pull request Mar 14, 2023
…* upgrade paths

Like 034fa01 (blocked-edges/4.12.*: Declare AWSOldBootImages,
    2022-12-14, openshift#2909) and 957626a , this cnditional risk is sticky,
    because we don't have PromQL access to boot-image age, so we cannot automatically distinguish
    between "born in 4.1 and still uses the old boot images" and "born
    in 4.1, but has subsequently updated boot images".  And because of
    a cluster-version operator bug, we don't necessarily have access to
    the cluster's born-in release anyway.

    The CVO bug fix went back via:

    * 4.11.0 https://bugzilla.redhat.com/show_bug.cgi?id=2097067#c12
    * 4.10.24 https://bugzilla.redhat.com/show_bug.cgi?id=2108292#c6
    * 4.9.45 https://bugzilla.redhat.com/show_bug.cgi?id=2108619#c6
    * 4.8.47 https://bugzilla.redhat.com/show_bug.cgi?id=2109962#c6
    * 4.7.59 https://bugzilla.redhat.com/show_bug.cgi?id=2117347#c8
    * 4.6.61 https://bugzilla.redhat.com/show_bug.cgi?id=2118489#c6

    So it's possible for someone born in 4.1 to have spend a whole bunch
    of time in 4.9.z and be reporting a 4.9.0-rc.* or something as their
    born-in version.  Work around that by declaring this risk for AWS
    clusters where the born-in version is 4.9 or earlier, expecting that
    we'll have this issue fixed soonish, so folks with old boot images
    will be able to update to a later 4.11, and allowing us to be overly
    broad/cautious with the risk matching here.

Signed-off-by: Lalatendu Mohanty <[email protected]>
LalatenduMohanty added a commit to LalatenduMohanty/cincinnati-graph-data that referenced this pull request Mar 14, 2023
…* upgrade paths

Like 034fa01 (blocked-edges/4.12.*: Declare AWSOldBootImages,
    2022-12-14, openshift#2909) and 957626a , this cnditional risk is sticky,
    because we don't have PromQL access to boot-image age, so we cannot automatically distinguish
    between "born in 4.1 and still uses the old boot images" and "born
    in 4.1, but has subsequently updated boot images".  And because of
    a cluster-version operator bug, we don't necessarily have access to
    the cluster's born-in release anyway.

    The CVO bug fix went back via:

    * 4.11.0 https://bugzilla.redhat.com/show_bug.cgi?id=2097067#c12
    * 4.10.24 https://bugzilla.redhat.com/show_bug.cgi?id=2108292#c6
    * 4.9.45 https://bugzilla.redhat.com/show_bug.cgi?id=2108619#c6
    * 4.8.47 https://bugzilla.redhat.com/show_bug.cgi?id=2109962#c6
    * 4.7.59 https://bugzilla.redhat.com/show_bug.cgi?id=2117347#c8
    * 4.6.61 https://bugzilla.redhat.com/show_bug.cgi?id=2118489#c6

    So it's possible for someone born in 4.1 to have spend a whole bunch
    of time in 4.9.z and be reporting a 4.9.0-rc.* or something as their
    born-in version.  Work around that by declaring this risk for AWS
    clusters where the born-in version is 4.9 or earlier, expecting that
    we'll have this issue fixed soonish, so folks with old boot images
    will be able to update to a later 4.11, and allowing us to be overly
    broad/cautious with the risk matching here.

Signed-off-by: Lalatendu Mohanty <[email protected]>
LalatenduMohanty added a commit to LalatenduMohanty/cincinnati-graph-data that referenced this pull request Mar 15, 2023
…* upgrade paths

Like 034fa01 (blocked-edges/4.12.*: Declare AWSOldBootImages,
    2022-12-14, openshift#2909) and 957626a , this cnditional risk is sticky,
    because we don't have PromQL access to boot-image age, so we cannot automatically distinguish
    between "born in 4.1 and still uses the old boot images" and "born
    in 4.1, but has subsequently updated boot images".  And because of
    a cluster-version operator bug, we don't necessarily have access to
    the cluster's born-in release anyway.

    The CVO bug fix went back via:

    * 4.11.0 https://bugzilla.redhat.com/show_bug.cgi?id=2097067#c12
    * 4.10.24 https://bugzilla.redhat.com/show_bug.cgi?id=2108292#c6
    * 4.9.45 https://bugzilla.redhat.com/show_bug.cgi?id=2108619#c6
    * 4.8.47 https://bugzilla.redhat.com/show_bug.cgi?id=2109962#c6
    * 4.7.59 https://bugzilla.redhat.com/show_bug.cgi?id=2117347#c8
    * 4.6.61 https://bugzilla.redhat.com/show_bug.cgi?id=2118489#c6

    So it's possible for someone born in 4.1 to have spend a whole bunch
    of time in 4.9.z and be reporting a 4.9.0-rc.* or something as their
    born-in version.  Work around that by declaring this risk for AWS
    clusters where the born-in version is 4.9 or earlier, expecting that
    we'll have this issue fixed soonish, so folks with old boot images
    will be able to update to a later 4.11, and allowing us to be overly
    broad/cautious with the risk matching here.

Signed-off-by: Lalatendu Mohanty <[email protected]>
wking added a commit to wking/cincinnati-graph-data that referenced this pull request Dec 21, 2023
Miguel points out that the exposure set is more complicated [1] than
what I'd done in 45eb9ea (blocked-edges/4.14*: Declare
AzureDefaultVMType, openshift#4541).  It's:

* Azure born in 4.8 or earlier are exposed.  Both ARO (which creates
  clusters with Hive?) and clusters created via openshift-installer.
* ARO clusters created in 4.13 and earlier are exposed.

Generated by updating the 4.14.1 risk by hand, and then running:

  $ curl -s 'https://api.openshift.com/api/upgrades_info/graph?channel=candidate-4.14&arch=amd64' | jq -r '.nodes[] | .version' | grep '^4[.]14[.]' | grep -v '^4[.]14[.][01]$' | while read VERSION; do sed "s/4.14.1/${VERSION}/" blocked-edges/4.14.1-AzureDefaultVMType.yaml > "blocked-edges/${VERSION}-AzureDefaultVMType.yaml"; done

Breaking down the logic for my new PromQL:

a. First stanza, using topk is likely unecessary, but if we do happen
   to have multiple matches for some reason, we'll take the highest.
   That gives us a "we match" 1 (if any aggregated entries were 1) or
   a "we don't match" (if they were all 0), instead of "we're having a
   hard time figuring out" Recommended=Unknown.

   a. If the cluster is ARO (using cluster_operator_conditions, as in
      ba09198 (MCO-958: Blocking edges to 4.14.2+ and 4.13.25+, 2023-12-15,
      openshift#4524), first stanza is 1.  Otherwise, 'or' falls back to...

   b. Nested block, again with the cautious topk:

      a. If there are no cluster_operator_conditions, don't return a
         time series.  This ensures that "we didn't match a.a, but we
         might be ARO, and just have cluster_operator_conditions
         aggregation broken" gives us a Recommended=Unknown evaluation
         failure.

      b. Nested block, again with the cautious topk:

         a. born_by_4_9 yes case, with 4.(<=9) instead of the desired
            4.(<=8) because of the "old CVO bugs make it hard to
            distinguish between 4.(<=9) birth-versions" issue
            discussed in 034fa01 (blocked-edges/4.12.*: Declare
            AWSOldBootImages, 2022-12-14, openshift#2909).  Otherwise, 'or'
            falls back to...

         b. A check to ensure cluster_version{type="initial"} is
            working.  This ensures that "we didn't match the a.b.b.a
            born_by_4_9 yes case, but we be that old, and just have
            cluster_version aggregation broken" gives us a
            Recommended=Unknown evaluation failure.

b. Second stanza, again with the cautious topk:

   a. cluster_infrastructure_provider is Azure.  Otherwise, 'or' falls
      back to...

   b. If there are no cluster_infrastructure_provider, don't return a
      time series.  This ensures that "we didn't match b.a, but we
      might be Azure, and just have cluster_infrastructure_provider
      aggregation broken" gives us a Recommended=Unknown evaluation
      failure.

So walking some cases:

* Non-Azure cluster, cluster_operator_conditions, cluster_version, and
  cluster_infrastructure_provider all working:
  * a.a matches no series (not ARO).  Fall back to...
  * a.b.a confirms cluster_operator_conditions is working.
  * a.b.b could be 1 or 0 for cluster_version.
  * b.a matches no series (not Azure).
  * b.b gives 0 (confirming cluster_infrastructure_provider is working).
  * (1 or 0) * 0 = 0, cluster does not match.
* Non-Azure cluster, cluster_version is broken:
  * a.a matches no series (not ARO).  Fall back to...
  * a.b.a confirms cluster_operator_conditions is working.
  * a.b.b matches no series (cluster_version is broken).
  * b.a matches no series (not Azure).
  * b.b gives 0 (confirming cluster_infrastructure_provider is working).
  * (no-match) * 0 = no-match, evaluation fails, Recommended=Unknown.
    Admin gets to figure out what's broken with cluster_version and/or
    manually assess their exposure based on the message and linked
    URI.
* Non-ARO Azure cluster born in 4.9, all time-series working:
  * a.a matches no series (not ARO).  Fall back to...
  * a.b.a confirms cluster_operator_conditions is working.
  * a.b.b.a matches born_by_4_9 yes.
  * b.a matches (Azure).
  * 1 * 1 = 1, cluster matches.
* ARO cluster born in 4.9, all time-series working:
  * a.a matches (ARO).
  * b.a matches (Azure).
  * 1 * 1 = 1, cluster matches.
* ARO cluster born in 4.13, all time-series working (this is the case
  I'm fixing with this commit):
  * a.a matches (ARO).
  * b.a matches (Azure).
  * 1 * 1 = 1, cluster matches.
* ARO cluster, cluster_operator_conditions is broken.
  * a.a matches no series (cluster_operator_conditions) is broken.
  * a.b.a matches no series (cluster_operator_conditions) is broken.
  * b.a matches (Azure)
  * (no-match) * 1 = no-match, evaluation fails, Recommended=Unknown.
* ARO cluster, cluster_infrastructure_provider is broken.
  * a.a matches (ARO).
  * b.a matches no series (cluster_infrastructure_provider) is broken.
  * b.b matches no series (cluster_infrastructure_provider) is broken.
  * 1 * (no-match) = no-match, evaluation fails, Recommended=Unknown.
    We could add logic like a cluster_operator_conditions{name="aro"}
    check to the (b) stanza if we wanted to bakein "all ARO clusters
    are Azure" knowledge to successfully evaluate this case.  But I'd
    guess cluster_infrastructure_provider is working in most ARO
    clusters, and this PromQL is already complicated enough, so I
    haven't bothered with that level of tuning.
* ...lots of other combinations...

[1]: https://issues.redhat.com/browse/OCPCLOUD-2409?focusedId=23694976&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-23694976
wking added a commit to wking/cincinnati-graph-data that referenced this pull request Dec 21, 2023
Miguel points out that the exposure set is more complicated [1] than
what I'd done in 45eb9ea (blocked-edges/4.14*: Declare
AzureDefaultVMType, openshift#4541).  It's:

* Azure born in 4.8 or earlier are exposed.  Both ARO (which creates
  clusters with Hive?) and clusters created via openshift-installer.
* ARO clusters created in 4.13 and earlier are exposed.

Generated by updating the 4.14.1 risk by hand, and then running:

  $ curl -s 'https://api.openshift.com/api/upgrades_info/graph?channel=candidate-4.14&arch=amd64' | jq -r '.nodes[] | .version' | grep '^4[.]14[.]' | grep -v '^4[.]14[.][01]$' | while read VERSION; do sed "s/4.14.1/${VERSION}/" blocked-edges/4.14.1-AzureDefaultVMType.yaml > "blocked-edges/${VERSION}-AzureDefaultVMType.yaml"; done

Breaking down the logic for my new PromQL:

a. First stanza, using topk is likely unecessary, but if we do happen
   to have multiple matches for some reason, we'll take the highest.
   That gives us a "we match" 1 (if any aggregated entries were 1) or
   a "we don't match" (if they were all 0), instead of "we're having a
   hard time figuring out" Recommended=Unknown.

   a. If the cluster is ARO (using cluster_operator_conditions, as in
      ba09198 (MCO-958: Blocking edges to 4.14.2+ and 4.13.25+, 2023-12-15,
      openshift#4524), first stanza is 1.  Otherwise, 'or' falls back to...

   b. Nested block, again with the cautious topk:

      a. If there are no cluster_operator_conditions, don't return a
         time series.  This ensures that "we didn't match a.a, but we
         might be ARO, and just have cluster_operator_conditions
         aggregation broken" gives us a Recommended=Unknown evaluation
         failure.

      b. Nested block, again with the cautious topk:

         a. born_by_4_9 yes case, with 4.(<=9) instead of the desired
            4.(<=8) because of the "old CVO bugs make it hard to
            distinguish between 4.(<=9) birth-versions" issue
            discussed in 034fa01 (blocked-edges/4.12.*: Declare
            AWSOldBootImages, 2022-12-14, openshift#2909).  Otherwise, 'or'
            falls back to...

         b. A check to ensure cluster_version{type="initial"} is
            working.  This ensures that "we didn't match the a.b.b.a
            born_by_4_9 yes case, but we be that old, and just have
            cluster_version aggregation broken" gives us a
            Recommended=Unknown evaluation failure.

b. Second stanza, again with the cautious topk:

   a. cluster_infrastructure_provider is Azure.  Otherwise, 'or' falls
      back to...

   b. If there are no cluster_infrastructure_provider, don't return a
      time series.  This ensures that "we didn't match b.a, but we
      might be Azure, and just have cluster_infrastructure_provider
      aggregation broken" gives us a Recommended=Unknown evaluation
      failure.

So walking some cases:

* Non-Azure cluster, cluster_operator_conditions, cluster_version, and
  cluster_infrastructure_provider all working:
  * a.a matches no series (not ARO).  Fall back to...
  * a.b.a confirms cluster_operator_conditions is working.
  * a.b.b could be 1 or 0 for cluster_version.
  * b.a matches no series (not Azure).
  * b.b gives 0 (confirming cluster_infrastructure_provider is working).
  * (1 or 0) * 0 = 0, cluster does not match.
* Non-Azure cluster, cluster_version is broken:
  * a.a matches no series (not ARO).  Fall back to...
  * a.b.a confirms cluster_operator_conditions is working.
  * a.b.b matches no series (cluster_version is broken).
  * b.a matches no series (not Azure).
  * b.b gives 0 (confirming cluster_infrastructure_provider is working).
  * (no-match) * 0 = no-match, evaluation fails, Recommended=Unknown.
    Admin gets to figure out what's broken with cluster_version and/or
    manually assess their exposure based on the message and linked
    URI.
* Non-ARO Azure cluster born in 4.9, all time-series working:
  * a.a matches no series (not ARO).  Fall back to...
  * a.b.a confirms cluster_operator_conditions is working.
  * a.b.b.a matches born_by_4_9 yes.
  * b.a matches (Azure).
  * 1 * 1 = 1, cluster matches.
* ARO cluster born in 4.9, all time-series working:
  * a.a matches (ARO).
  * b.a matches (Azure).
  * 1 * 1 = 1, cluster matches.
* ARO cluster born in 4.13, all time-series working (this is the case
  I'm fixing with this commit):
  * a.a matches (ARO).
  * b.a matches (Azure).
  * 1 * 1 = 1, cluster matches.
* ARO cluster, cluster_operator_conditions is broken.
  * a.a matches no series (cluster_operator_conditions) is broken.
  * a.b.a matches no series (cluster_operator_conditions) is broken.
  * b.a matches (Azure)
  * (no-match) * 1 = no-match, evaluation fails, Recommended=Unknown.
* ARO cluster, cluster_infrastructure_provider is broken.
  * a.a matches (ARO).
  * b.a matches no series (cluster_infrastructure_provider) is broken.
  * b.b matches no series (cluster_infrastructure_provider) is broken.
  * 1 * (no-match) = no-match, evaluation fails, Recommended=Unknown.
    We could add logic like a cluster_operator_conditions{name="aro"}
    check to the (b) stanza if we wanted to bakein "all ARO clusters
    are Azure" knowledge to successfully evaluate this case.  But I'd
    guess cluster_infrastructure_provider is working in most ARO
    clusters, and this PromQL is already complicated enough, so I
    haven't bothered with that level of tuning.
* ...lots of other combinations...

[1]: https://issues.redhat.com/browse/OCPCLOUD-2409?focusedId=23694976&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-23694976
wking added a commit to wking/cincinnati-graph-data that referenced this pull request Dec 21, 2023
Miguel points out that the exposure set is more complicated [1] than
what I'd done in 45eb9ea (blocked-edges/4.14*: Declare
AzureDefaultVMType, openshift#4541).  It's:

* Azure born in 4.8 or earlier are exposed.  Both ARO (which creates
  clusters with Hive?) and clusters created via openshift-installer.
* ARO clusters created in 4.13 and earlier are exposed.

Generated by updating the 4.14.1 risk by hand, and then running:

  $ curl -s 'https://api.openshift.com/api/upgrades_info/graph?channel=candidate-4.14&arch=amd64' | jq -r '.nodes[] | .version' | grep '^4[.]14[.]' | grep -v '^4[.]14[.][01]$' | while read VERSION; do sed "s/4.14.1/${VERSION}/" blocked-edges/4.14.1-AzureDefaultVMType.yaml > "blocked-edges/${VERSION}-AzureDefaultVMType.yaml"; done

Breaking down the logic for my new PromQL:

a. First stanza, using topk is likely unecessary, but if we do happen
   to have multiple matches for some reason, we'll take the highest.
   That gives us a "we match" 1 (if any aggregated entries were 1) or
   a "we don't match" (if they were all 0), instead of "we're having a
   hard time figuring out" Recommended=Unknown.

   a. If the cluster is ARO (using cluster_operator_conditions, as in
      ba09198 (MCO-958: Blocking edges to 4.14.2+ and 4.13.25+, 2023-12-15,
      openshift#4524), first stanza is 1.  Otherwise, 'or' falls back to...

   b. Nested block, again with the cautious topk:

      a. If there are no cluster_operator_conditions, don't return a
         time series.  This ensures that "we didn't match a.a, but we
         might be ARO, and just have cluster_operator_conditions
         aggregation broken" gives us a Recommended=Unknown evaluation
         failure.

      b. Nested block, again with the cautious topk:

         a. born_by_4_9 yes case, with 4.(<=9) instead of the desired
            4.(<=8) because of the "old CVO bugs make it hard to
            distinguish between 4.(<=9) birth-versions" issue
            discussed in 034fa01 (blocked-edges/4.12.*: Declare
            AWSOldBootImages, 2022-12-14, openshift#2909).  Otherwise, 'or'
            falls back to...

         b. A check to ensure cluster_version{type="initial"} is
            working.  This ensures that "we didn't match the a.b.b.a
            born_by_4_9 yes case, but we be that old, and just have
            cluster_version aggregation broken" gives us a
            Recommended=Unknown evaluation failure.

b. Second stanza, again with the cautious topk:

   a. cluster_infrastructure_provider is Azure.  Otherwise, 'or' falls
      back to...

   b. If there are no cluster_infrastructure_provider, don't return a
      time series.  This ensures that "we didn't match b.a, but we
      might be Azure, and just have cluster_infrastructure_provider
      aggregation broken" gives us a Recommended=Unknown evaluation
      failure.

All of the _id filtering is for use in hosted clusters or other PromQL
stores that include multiple clusters.  More background in 5cb2e93
(blocked-edges/4.11.*-KeepalivedMulticastSkew: Explicit _id="",
2023-05-09, openshift#3591).

So walking some cases:

* Non-Azure cluster, cluster_operator_conditions, cluster_version, and
  cluster_infrastructure_provider all working:
  * a.a matches no series (not ARO).  Fall back to...
  * a.b.a confirms cluster_operator_conditions is working.
  * a.b.b could be 1 or 0 for cluster_version.
  * b.a matches no series (not Azure).
  * b.b gives 0 (confirming cluster_infrastructure_provider is working).
  * (1 or 0) * 0 = 0, cluster does not match.
* Non-Azure cluster, cluster_version is broken:
  * a.a matches no series (not ARO).  Fall back to...
  * a.b.a confirms cluster_operator_conditions is working.
  * a.b.b matches no series (cluster_version is broken).
  * b.a matches no series (not Azure).
  * b.b gives 0 (confirming cluster_infrastructure_provider is working).
  * (no-match) * 0 = no-match, evaluation fails, Recommended=Unknown.
    Admin gets to figure out what's broken with cluster_version and/or
    manually assess their exposure based on the message and linked
    URI.
* Non-ARO Azure cluster born in 4.9, all time-series working:
  * a.a matches no series (not ARO).  Fall back to...
  * a.b.a confirms cluster_operator_conditions is working.
  * a.b.b.a matches born_by_4_9 yes.
  * b.a matches (Azure).
  * 1 * 1 = 1, cluster matches.
* ARO cluster born in 4.9, all time-series working:
  * a.a matches (ARO).
  * b.a matches (Azure).
  * 1 * 1 = 1, cluster matches.
* ARO cluster born in 4.13, all time-series working (this is the case
  I'm fixing with this commit):
  * a.a matches (ARO).
  * b.a matches (Azure).
  * 1 * 1 = 1, cluster matches.
* ARO cluster, cluster_operator_conditions is broken.
  * a.a matches no series (cluster_operator_conditions) is broken.
  * a.b.a matches no series (cluster_operator_conditions) is broken.
  * b.a matches (Azure)
  * (no-match) * 1 = no-match, evaluation fails, Recommended=Unknown.
* ARO cluster, cluster_infrastructure_provider is broken.
  * a.a matches (ARO).
  * b.a matches no series (cluster_infrastructure_provider) is broken.
  * b.b matches no series (cluster_infrastructure_provider) is broken.
  * 1 * (no-match) = no-match, evaluation fails, Recommended=Unknown.
    We could add logic like a cluster_operator_conditions{name="aro"}
    check to the (b) stanza if we wanted to bakein "all ARO clusters
    are Azure" knowledge to successfully evaluate this case.  But I'd
    guess cluster_infrastructure_provider is working in most ARO
    clusters, and this PromQL is already complicated enough, so I
    haven't bothered with that level of tuning.
* ...lots of other combinations...

[1]: https://issues.redhat.com/browse/OCPCLOUD-2409?focusedId=23694976&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-23694976
wking added a commit to wking/cincinnati-graph-data that referenced this pull request Jun 11, 2024
034fa01 (blocked-edges/4.12.*: Declare AWSOldBootImages,
2022-12-14, openshift#2909) explains why we need to look for 4.9-or-earlier
instead of looking for the 4.8-or-earlier condition this risk is
associated with.

I'm also adding _id="" to the queries as a pattern to support
HyperShift and other systems that could query the cluster's data out
of a PromQL engine that stored data for multiple clusters.  More
context in 5cb2e93 (blocked-edges/4.11.*-KeepalivedMulticastSkew:
Explicit _id="", 2023-05-09, openshift#3591).
wking added a commit to wking/cincinnati-graph-data that referenced this pull request Jun 11, 2024
034fa01 (blocked-edges/4.12.*: Declare AWSOldBootImages,
2022-12-14, openshift#2909) explains why we need to look for 4.9-or-earlier
instead of looking for the 4.8-or-earlier condition this risk is
associated with.

I'm also adding _id="" to the queries as a pattern to support
HyperShift and other systems that could query the cluster's data out
of a PromQL engine that stored data for multiple clusters.  More
context in 5cb2e93 (blocked-edges/4.11.*-KeepalivedMulticastSkew:
Explicit _id="", 2023-05-09, openshift#3591).

Fixed in rc.4, because it has the new minor_min from f8316da
(build-suggestions/4.16: Set minor_min to 4.15.17, 2024-06-06, openshift#5352):

  $ oc adm release info quay.io/openshift-release-dev/ocp-release:4.16.0-rc.3-x86_64 | grep Upgrades
    Upgrades: 4.15.11, 4.15.12, 4.15.13, 4.15.14, 4.15.15, 4.15.16, 4.16.0-ec.1, 4.16.0-ec.2, 4.16.0-ec.3, 4.16.0-ec.4, 4.16.0-ec.5, 4.16.0-ec.6, 4.16.0-rc.0, 4.16.0-rc.1, 4.16.0-rc.2
  $ oc adm release info quay.io/openshift-release-dev/ocp-release:4.16.0-rc.4-x86_64 | grep Upgrades
    Upgrades: 4.15.17, 4.16.0-ec.1, 4.16.0-ec.2, 4.16.0-ec.3, 4.16.0-ec.4, 4.16.0-ec.5, 4.16.0-ec.6, 4.16.0-rc.0, 4.16.0-rc.1, 4.16.0-rc.2, 4.16.0-rc.3

and the fix for [1] is in [2].

[1]: https://issues.redhat.com/browse/OCPBUGS-34492
[2]: https://amd64.ocp.releases.ci.openshift.org/releasestream/4-stable/release/4.15.16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants