OTA-964: blocked-edges/4.11.*-KeepalivedMulticastSkew: Explicit _id="" #3591

wking · 2023-05-09T21:02:30Z

In preparation for HyperShift, where we expect to have the cluster-version operator resolving update risks via a PromQL engine that stores data from multiple clusters. Metrics in that data store use an _id label to show which cluster the time-series belongs to (enhancement, implementation here and here). This label does not exist in the
thanos-querier.openshift-monitoring.svc.cluster.local store for stand-alone, self-hosted clusters, so limiting to _id="" is a no-op there. And future work for HyperShift-hosted cluster-version operators can teach those operators to substitute this placeholder with _id="theActualClusterID", so they match time-series from their cluster, and do not get distracted by time-series from other, sibling hosted clusters.

Generated with:

$ sed -i 's/cluster_infrastructure_provider{/cluster_infrastructure_provider{_id="",/;s/cluster_infrastructure_provider$/cluster_infrastructure_provider{_id=""}/;s/mcd_update_state{/mcd_update_state{_id="",/;s/mcd_update_state)/mcd_update_state{_id=""})/' blocked-edges/4.11.*-KeepalivedMulticastSkew.yaml

openshift-ci-robot · 2023-05-09T21:02:34Z

@wking: This pull request references OTA-924 which is a valid jira issue.

Details

In response to this:

In preparation for HyperShift, where we expect to have the cluster-version operator resolving update risks via a PromQL engine that stores data from multiple clusters. Metrics in that data store use an _id label to show which cluster the time-series belongs to (enhancement, implementation here and here). This label does not exist in the
thanos-querier.openshift-monitoring.svc.cluster.local store for stand-alone, self-hosted clusters, so limiting to _id="" is a no-op there. And future work for HyperShift-hosted cluster-version operators can teach those operators to substitute this placeholder with _id="theActualClusterID", so they match time-series from their cluster, and do not get distracted by time-series from other, sibling hosted clusters.

Generated with:
$ sed -i 's/cluster_infrastructure_provider{/cluster_infrastructure_provider{_id="",/;s/cluster_infrastructure_provider$/cluster_infrastructure_provider{_id=""}/;s/mcd_update_state{/mcd_update_state{_id="",/;s/mcd_update_state)/mcd_update_state{_id=""})/' blocked-edges/4.11.*-KeepalivedMulticastSkew.yaml

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

wking · 2023-05-09T21:04:31Z

Reprising this testing for various PromQL in an launch 4.10.58 ovirt cluster-bot cluster:

Impacted infrastructure platform, has at least one machine-config-controlled compute node

Real query

max(
  cluster_infrastructure_provider{_id="",type=~"OpenStack|oVirt|VSphere"}
  or
  0 * cluster_infrastructure_provider{_id=""}
)
*
(
  group(topk by (node) (1, mcd_update_state{_id="",config!~"|rendered-master-.*"}))
  or
  0 * group(mcd_update_state{_id=""})
)

shows the risk matching with a 1:

Other platforms

With oVirtNot, simulating an AWS or other non-impacted cluster, the query drops to zero as a non-match:

No MachineConfig compute

And with .* matching every MachineConfig to simulate a cluster with no MachineConfig-managed compute, the query drops to zero as a non-match:

openshift-ci-robot · 2023-05-09T21:06:06Z

@wking: This pull request references OTA-964 which is a valid jira issue.

Details

In response to this:

In preparation for HyperShift, where we expect to have the cluster-version operator resolving update risks via a PromQL engine that stores data from multiple clusters. Metrics in that data store use an _id label to show which cluster the time-series belongs to (enhancement, implementation here and here). This label does not exist in the
thanos-querier.openshift-monitoring.svc.cluster.local store for stand-alone, self-hosted clusters, so limiting to _id="" is a no-op there. And future work for HyperShift-hosted cluster-version operators can teach those operators to substitute this placeholder with _id="theActualClusterID", so they match time-series from their cluster, and do not get distracted by time-series from other, sibling hosted clusters.

Generated with:
$ sed -i 's/cluster_infrastructure_provider{/cluster_infrastructure_provider{_id="",/;s/cluster_infrastructure_provider$/cluster_infrastructure_provider{_id=""}/;s/mcd_update_state{/mcd_update_state{_id="",/;s/mcd_update_state)/mcd_update_state{_id=""})/' blocked-edges/4.11.*-KeepalivedMulticastSkew.yaml

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

In preparation for HyperShift, where we expect to have the cluster-version operator resolving update risks via a PromQL engine that stores data from multiple clusters. Metrics in that data store use an _id label to show which cluster the time-series belongs to [2,3,4]. This label does not exist in the thanos-querier.openshift-monitoring.svc.cluster.local store for stand-alone, self-hosted clusters, so limiting to _id="" is a no-op there. And future work for HyperShift-hosted cluster-version operators can teach those operators to substitute this placeholder with _id="theActualClusterID", so they match time-series from their cluster, and do not get distracted by time-series from other, sibling hosted clusters. Generated with: $ sed -i 's/cluster_infrastructure_provider{/cluster_infrastructure_provider{_id="",/;s/cluster_infrastructure_provider$/cluster_infrastructure_provider{_id=""}/;s/mcd_update_state{/mcd_update_state{_id="",/;s/mcd_update_state)/mcd_update_state{_id=""})/' blocked-edges/4.11.*-KeepalivedMulticastSkew.yaml [1]: https://issues.redhat.com/browse/OTA-924 [2]: https://github.com/openshift/enhancements/blob/ba306451b99da182d26bd0631f3f1e448930316c/enhancements/monitoring/hypershift-monitoring.md#common-labels [3]: openshift/hypershift@a0be78d [4]: openshift/hypershift@44247e9

openshift-ci · 2023-05-11T17:45:30Z

@wking: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-latest-cincinnati	`5cb2e93`	link	false	`/test e2e-latest-cincinnati`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

LalatenduMohanty · 2023-05-16T17:57:37Z

Should we wait for conditional update risks between 4.12 to 4.13?

LalatenduMohanty

/lgtm

openshift-ci · 2023-05-17T16:09:41Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: LalatenduMohanty, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [LalatenduMohanty,wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

This commit will introduce new flags and logic to the CVO regarding its PromQL target for risk evaluation of conditional updates with the CVO being run in a hosted control plane in mind. For the CVO to successfully access a service that provides metrics (in the case of the CVO in a standalone OpenShift cluster it's the thanos-querier service), it needs three things. It needs the service's address, a CA bundle to verify the certificate provided by the service to allow secure communication using TLS between the actors [1], and the authorization credentials of the CVO. Currently, the CVO hardcodes the address, the path to the CA bundle, and the path to the credentials file. This is not ideal, as CVO is starting to be used in other repositories such as HyperShift [2]. A path to the service CA bundle file is added to allow explicitly setting the CA bundle file of the given query service. Currently, the CVO is using a kube-api server to resolve the IP address of a specific service in the cluster [4]. New flags were added to specify the service name and its namespace as the PromQL target may reside in different namespaces or have a different name. In HyperShift multiple kube-api servers are being used. A management cluster creates a hosted control plane where a hosted CVO resides. This hosted CVO is then given a kubeconfig file for a cluster where the hosted CVO reconciles manifests. However, because the CVO cannot resolve internal DNS names and, in this case, is given a kubeconfig for a remote cluster, the CVO is not capable to communicate with internal services on the management cluster. Thus a new flag is added to specify the path to the management cluster kubeconfig file. In case the management cluster kubeconfig is not specified, the CVO defaults to using the `serviceaccount` directory as is the behavior in the standalone OCP cluster. A flag to specify the path to the credentials file was not added because the CVO hardcodes the same path to the credentials file as is the path to the credentials file used in other service accounts in Kubernetes [3]. [1] https://docs.openshift.com/container-platform/4.12/security/certificate_types_descriptions/service-ca-certificates.html [2] https://github.com/openshift/hypershift [3] https://kubernetes.io/docs/reference/access-authn-authz/service-accounts-admin/#serviceaccount-admission-controller [4] openshift#920 This commit also adds a new flag to specify whether the CVO resides in a hosted control plane. In this case, the CVO will inject its cluster Id into PromQL queries to differentiate between multiple time series belonging to different hosted clusters [5]. [5] openshift/cincinnati-graph-data#3591

This commit will introduce new flags and logic to the CVO regarding its PromQL target for risk evaluation of conditional updates with the CVO being run in a hosted control plane in mind. For the CVO to successfully access a service that provides metrics (in the case of the CVO in a standalone OpenShift cluster it's the thanos-querier service), it needs three things. It needs the service's address, a CA bundle to verify the certificate provided by the service to allow secure communication using TLS between the actors [1], and the authorization credentials of the CVO. Currently, the CVO hardcodes the address, the path to the CA bundle, and the path to the credentials file. This is not ideal, as CVO is starting to be used in other repositories such as HyperShift [2]. A path to the service CA bundle file is added to allow explicitly setting the CA bundle file of the given query service. Currently, the CVO is using a kube-api server to resolve the IP address of a specific service in the cluster [4]. New flags were added to specify the service name and its namespace as the PromQL target may reside in different namespaces or have a different name. The port of the service can also be different as in the case of the managed HyperShift monitoring. This commit will make sure the port number is not hard-coded but is rather dynamically found by the port's name. In HyperShift multiple kube-api servers are being used. A management cluster creates a hosted control plane where a hosted CVO resides. This hosted CVO is then given a kubeconfig file for a cluster where the hosted CVO reconciles manifests. However, because the CVO cannot resolve internal DNS names and, in this case, is given a kubeconfig for a remote cluster, the CVO is not capable to communicate with internal services on the management cluster. Thus a new flag is added to specify the path to the management cluster kubeconfig file. In case the management cluster kubeconfig is not specified, the CVO defaults to using the `serviceaccount` directory as is the behavior in the standalone OCP cluster. A flag to specify the path to the credentials file was not added because the CVO hardcodes the same path to the credentials file as is the path to the credentials file used in other service accounts in Kubernetes [3]. [1] https://docs.openshift.com/container-platform/4.12/security/certificate_types_descriptions/service-ca-certificates.html [2] https://github.com/openshift/hypershift [3] https://kubernetes.io/docs/reference/access-authn-authz/service-accounts-admin/#serviceaccount-admission-controller [4] openshift#920 This commit also adds a new flag to specify whether the CVO resides in a hosted control plane. In this case, the CVO will inject its cluster Id into PromQL queries to differentiate between multiple time series belonging to different hosted clusters [5]. [5] openshift/cincinnati-graph-data#3591

This commit will introduce new flags and logic to the CVO regarding its PromQL target for risk evaluation of conditional updates with the CVO being run in a hosted control plane in mind. For the CVO to successfully access a service that provides metrics (in the case of the CVO in a standalone OpenShift cluster it's the thanos-querier service), it needs three things. It needs the service's address, a CA bundle to verify the certificate provided by the service to allow secure communication using TLS between the actors [1], and the authorization credentials of the CVO. Currently, the CVO hardcodes the address, the path to the CA bundle, and the path to the credentials file. This is not ideal, as CVO is starting to be used in other repositories such as HyperShift [2]. A path to the service CA bundle file is added to allow explicitly setting the CA bundle file of the given query service. Currently, the CVO is using a kube-api server to resolve the IP address of a specific service in the cluster [3]. Add new flags that allow configuring the CVO to resolve the IP address via DNS when DNS is available to the CVO. This is the case for hosted CVOs in HyperShift. The alternative in HyperShift would be to use the management cluster kube-api server and give the hosted CVOs additional permissions. Add flags to specify the PromQL target URL. This URL contains the used scheme, port, and the server's name used for TLS configuration. In the case, DNS is enabled, the URL is also used to query the Prometheus server. In the case, the DNS is disabled, the server can be specified via the Kubernetes service in the cluster that exposes the server. A flag to specify the path to the credentials file was added for more customizability. This flag also enables the CVO to not set the token when it's not needed. A CVO can communicate with a Prometheus server over HTTP. In the case, the token is not needed, it would be undesirable for the CVO to send its token without reason over HTTP. This commit also adds a new flag to specify whether the CVO resides in a hosted control plane. In this case, the CVO will inject its cluster Id into PromQL queries to differentiate between multiple time series belonging to different hosted clusters [4]. [1] https://docs.openshift.com/container-platform/4.12/security/certificate_types_descriptions/service-ca-certificates.html [2] https://github.com/openshift/hypershift [3] openshift#920 [4] openshift/cincinnati-graph-data#3591

…and 4.15.0-ec.0 The risk is for updating out of the impacted releases to hypothetical future releases with the user change, e.g. from 4.14.0-rc.3 to 4.14.0. But we're warning on * -> 4.14.0-rc.3 (and similar) to help folks dodge the sticky updates entirely, because 4.14.0-rc.2 -> 4.14.0 are not expected to ever have running-as-root webhook listeners. Pattern-matching: * apiserver_storage_objects from b573872 (blocked-edges/4.13.*-PerformanceProfilesCPUQuota: Declare new risk, 2023-06-27, openshift#3786), * egressips.k8s.ovn.org from 894fab7 (blocked-edges/4.11.9-ovn-namespace: 4.11.9 does not fix the regression, 2022-10-12, openshift#2628), and * _id from 5cb2e93 (blocked-edges/4.11.*-KeepalivedMulticastSkew: Explicit _id="", 2023-05-09, openshift#3591). Using cluster_installer with the hypershift invoker is new for this commit, and in this case I'm using it to declare HyperShift clustres not exposed to the risk. Generated by manually writing the rc.e risk, and then copying it around with: $ for VERSION in 4.14.0-rc.4 4.15.0-ec.0; do sed "s/4.14.0-rc.3/${VERSION}/" blocked-edges/4.14.0-rc.3-OVNWebhookUserConflict.yaml > "blocked-edges/${VERSION}-OVNWebhookUserConflict.yaml"; done

Miguel points out that the exposure set is more complicated [1] than what I'd done in 45eb9ea (blocked-edges/4.14*: Declare AzureDefaultVMType, openshift#4541). It's: * Azure born in 4.8 or earlier are exposed. Both ARO (which creates clusters with Hive?) and clusters created via openshift-installer. * ARO clusters created in 4.13 and earlier are exposed. Generated by updating the 4.14.1 risk by hand, and then running: $ curl -s 'https://api.openshift.com/api/upgrades_info/graph?channel=candidate-4.14&arch=amd64' | jq -r '.nodes[] | .version' | grep '^4[.]14[.]' | grep -v '^4[.]14[.][01]$' | while read VERSION; do sed "s/4.14.1/${VERSION}/" blocked-edges/4.14.1-AzureDefaultVMType.yaml > "blocked-edges/${VERSION}-AzureDefaultVMType.yaml"; done Breaking down the logic for my new PromQL: a. First stanza, using topk is likely unecessary, but if we do happen to have multiple matches for some reason, we'll take the highest. That gives us a "we match" 1 (if any aggregated entries were 1) or a "we don't match" (if they were all 0), instead of "we're having a hard time figuring out" Recommended=Unknown. a. If the cluster is ARO (using cluster_operator_conditions, as in ba09198 (MCO-958: Blocking edges to 4.14.2+ and 4.13.25+, 2023-12-15, openshift#4524), first stanza is 1. Otherwise, 'or' falls back to... b. Nested block, again with the cautious topk: a. If there are no cluster_operator_conditions, don't return a time series. This ensures that "we didn't match a.a, but we might be ARO, and just have cluster_operator_conditions aggregation broken" gives us a Recommended=Unknown evaluation failure. b. Nested block, again with the cautious topk: a. born_by_4_9 yes case, with 4.(<=9) instead of the desired 4.(<=8) because of the "old CVO bugs make it hard to distinguish between 4.(<=9) birth-versions" issue discussed in 034fa01 (blocked-edges/4.12.*: Declare AWSOldBootImages, 2022-12-14, openshift#2909). Otherwise, 'or' falls back to... b. A check to ensure cluster_version{type="initial"} is working. This ensures that "we didn't match the a.b.b.a born_by_4_9 yes case, but we be that old, and just have cluster_version aggregation broken" gives us a Recommended=Unknown evaluation failure. b. Second stanza, again with the cautious topk: a. cluster_infrastructure_provider is Azure. Otherwise, 'or' falls back to... b. If there are no cluster_infrastructure_provider, don't return a time series. This ensures that "we didn't match b.a, but we might be Azure, and just have cluster_infrastructure_provider aggregation broken" gives us a Recommended=Unknown evaluation failure. All of the _id filtering is for use in hosted clusters or other PromQL stores that include multiple clusters. More background in 5cb2e93 (blocked-edges/4.11.*-KeepalivedMulticastSkew: Explicit _id="", 2023-05-09, openshift#3591). So walking some cases: * Non-Azure cluster, cluster_operator_conditions, cluster_version, and cluster_infrastructure_provider all working: * a.a matches no series (not ARO). Fall back to... * a.b.a confirms cluster_operator_conditions is working. * a.b.b could be 1 or 0 for cluster_version. * b.a matches no series (not Azure). * b.b gives 0 (confirming cluster_infrastructure_provider is working). * (1 or 0) * 0 = 0, cluster does not match. * Non-Azure cluster, cluster_version is broken: * a.a matches no series (not ARO). Fall back to... * a.b.a confirms cluster_operator_conditions is working. * a.b.b matches no series (cluster_version is broken). * b.a matches no series (not Azure). * b.b gives 0 (confirming cluster_infrastructure_provider is working). * (no-match) * 0 = no-match, evaluation fails, Recommended=Unknown. Admin gets to figure out what's broken with cluster_version and/or manually assess their exposure based on the message and linked URI. * Non-ARO Azure cluster born in 4.9, all time-series working: * a.a matches no series (not ARO). Fall back to... * a.b.a confirms cluster_operator_conditions is working. * a.b.b.a matches born_by_4_9 yes. * b.a matches (Azure). * 1 * 1 = 1, cluster matches. * ARO cluster born in 4.9, all time-series working: * a.a matches (ARO). * b.a matches (Azure). * 1 * 1 = 1, cluster matches. * ARO cluster born in 4.13, all time-series working (this is the case I'm fixing with this commit): * a.a matches (ARO). * b.a matches (Azure). * 1 * 1 = 1, cluster matches. * ARO cluster, cluster_operator_conditions is broken. * a.a matches no series (cluster_operator_conditions) is broken. * a.b.a matches no series (cluster_operator_conditions) is broken. * b.a matches (Azure) * (no-match) * 1 = no-match, evaluation fails, Recommended=Unknown. * ARO cluster, cluster_infrastructure_provider is broken. * a.a matches (ARO). * b.a matches no series (cluster_infrastructure_provider) is broken. * b.b matches no series (cluster_infrastructure_provider) is broken. * 1 * (no-match) = no-match, evaluation fails, Recommended=Unknown. We could add logic like a cluster_operator_conditions{name="aro"} check to the (b) stanza if we wanted to bakein "all ARO clusters are Azure" knowledge to successfully evaluate this case. But I'd guess cluster_infrastructure_provider is working in most ARO clusters, and this PromQL is already complicated enough, so I haven't bothered with that level of tuning. * ...lots of other combinations... [1]: https://issues.redhat.com/browse/OCPCLOUD-2409?focusedId=23694976&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-23694976

…registry use Some clusters do not excercise the internal image registry, and we do not want to bother those admins about this risk. Look at imageregistry_request_duration_seconds_count (which only records registry activity like BlobStore.Create) as a sign that the registry is actively used to handle images. Compare with imageregistry_http_request_duration_seconds_count, which the registry serves even when it is not actively being used to manage images, as a sign that registry metrics are working, to distinguish between "we know the registry is not being used" and "registry metrics collection is broken, and we're blind to registry use". In 4.14, the registry component became an optional component [1], but it is not optional in 4.13. And it's 4.13 clusters evaluating this PromQL because of the 'from: 4[.]13[.].*'. So we don't have to worry about "what if imageregistry_http_request_duration_seconds_count is not available?" cases outside of the "registry metrics collection is broken" case. The registry does not persist these *_count values across container restarts, so there is a window of exposure like: 1. Registry activity increments imageregistry_request_duration_seconds_count counters. 2. PromQL correctly identifies "this cluster is using its registry". 3. Registry container restarts, zeroing its *_count buckets. 4. If _all_ registry containers restart before there is new registry activity, the imageregistry_request_duration_seconds_count signal may go away. 5. New registry activity repopulates imageregistry_request_duration_seconds_count. The chances that all registry pods restart at the same time seems low, outside of: * Updates or other registry-hosting-node reboots, where clusters that actively use the in-cluster registry are likely to have pods consuming images from that registry need re-pulling, which would drive fresh registry traffic and repopulate imageregistry_request_duration_seconds_count. * Registry reconfiguration, which would not drive fresh registry traffic, but which seems like unlikely admin activity in a cluster where the internal-registry is not being used. I'm also using the max_over_time with a 1 hour lookback to try to smear over small durations of "waiting for new registry activity to repopulate imageregistry_request_duration_seconds_count". And now that the PromQL is getting pretty complicated, I'm also adjusting label management so it's easier to see from the result which clauses are matching for a particular cluster, in case we need to debug why we're matching when we don't expect to, or not matching when we do expect to. I'm also adding _id="" placeholders for HyperShift, see 5cb2e93 (blocked-edges/4.11.*-KeepalivedMulticastSkew: Explicit _id="", 2023-05-09, openshift#3591) for more on that. Generated by editing 4.14.1's by hand and copying out to the others with: $ ls blocked-edges/*AzureRegistryImagePreservation.yaml | grep -v '/4[.]14[.]1-A' | while read X; do TO="$(grep '^to: ' "${X}")"; sed "s/^to: .*/${TO}/" blocked-edges/4.14.1-AzureRegistryImagePreservation.yaml > "${X}"; done [1]: https://docs.openshift.com/container-platform/4.14/release_notes/ocp-4-14-release-notes.html#ocp-4-14-optional-capabilities

…registry use Some clusters do not excercise the internal image registry, and we do not want to bother those admins about this risk. Look at imageregistry_request_duration_seconds_count (which only records registry activity like BlobStore.Create) as a sign that the registry is actively used to handle images. Compare with imageregistry_http_request_duration_seconds_count, which the registry serves even when it is not actively being used to manage images, as a sign that registry metrics are working, to distinguish between "we know the registry is not being used" and "registry metrics collection is broken, and we're blind to registry use". In 4.14, the registry component became an optional component [1], but it is not optional in 4.13. And it's 4.13 clusters evaluating this PromQL because of the 'from: 4[.]13[.].*'. So we don't have to worry about "what if imageregistry_http_request_duration_seconds_count is not available?" cases outside of the "registry metrics collection is broken" case. The registry does not persist these *_count values across container restarts, so there is a window of exposure like: 1. Registry activity increments imageregistry_request_duration_seconds_count counters. 2. PromQL correctly identifies "this cluster is using its registry". 3. Registry container restarts, zeroing its *_count buckets. 4. If _all_ registry containers restart before there is new registry activity, the imageregistry_request_duration_seconds_count signal may go away. 5. New registry activity repopulates imageregistry_request_duration_seconds_count, return to step 1. The chances that all registry pods restart at the same time seems low, outside of: * Updates or other registry-hosting-node reboots, where clusters that actively use the in-cluster registry are likely to have pods consuming images from that registry need re-pulling, which would drive fresh registry traffic and repopulate imageregistry_request_duration_seconds_count. * Registry reconfiguration, which would not drive fresh registry traffic, but which seems like unlikely admin activity in a cluster where the internal-registry is not being used. I'm also using the max_over_time with a 1 hour lookback to try to smear over small durations of "waiting for new registry activity to repopulate imageregistry_request_duration_seconds_count". And now that the PromQL is getting pretty complicated, I'm also adjusting label management so it's easier to see from the result which clauses are matching for a particular cluster, in case we need to debug why we're matching when we don't expect to, or not matching when we do expect to. I'm also adding _id="" placeholders for HyperShift, see 5cb2e93 (blocked-edges/4.11.*-KeepalivedMulticastSkew: Explicit _id="", 2023-05-09, openshift#3591) for more on that. Generated by editing 4.14.1's by hand and copying out to the others with: $ ls blocked-edges/*AzureRegistryImagePreservation.yaml | grep -v '/4[.]14[.]1-A' | while read X; do TO="$(grep '^to: ' "${X}")"; sed "s/^to: .*/${TO}/" blocked-edges/4.14.1-AzureRegistryImagePreservation.yaml > "${X}"; done [1]: https://docs.openshift.com/container-platform/4.14/release_notes/ocp-4-14-release-notes.html#ocp-4-14-optional-capabilities

…registry use Some clusters do not excercise the internal image registry, and we do not want to bother those admins about this risk. Look at imageregistry_request_duration_seconds_count (which only records registry activity like BlobStore.Create) as a sign that the registry is actively used to handle images. Compare with imageregistry_http_request_duration_seconds_count, which the registry serves even when it is not actively being used to manage images, as a sign that registry metrics are working, to distinguish between "we know the registry is not being used" and "registry metrics collection is broken, and we're blind to registry use". In 4.14, the registry component became an optional component [1], but it is not optional in 4.13. And it's 4.13 clusters evaluating this PromQL because of the 'from: 4[.]13[.].*'. So we don't have to worry about "what if imageregistry_http_request_duration_seconds_count is not available?" cases outside of the "registry metrics collection is broken" case. The registry does not persist these *_count values across container restarts, so there is a window of exposure like: 1. Registry activity increments imageregistry_request_duration_seconds_count counters. 2. PromQL correctly identifies "this cluster is using its registry". 3. Registry container restarts, zeroing its *_count buckets. 4. If _all_ registry containers restart before there is new registry activity, the imageregistry_request_duration_seconds_count signal may go away. 5. New registry activity repopulates imageregistry_request_duration_seconds_count, return to step 1. The chances that all registry pods restart at the same time seems low, outside of: * Updates or other registry-hosting-node reboots, where clusters that actively use the in-cluster registry are likely to have pods consuming images from that registry need re-pulling, which would drive fresh registry traffic and repopulate imageregistry_request_duration_seconds_count. * Registry reconfiguration, which would not drive fresh registry traffic, but which seems like unlikely admin activity in a cluster where the internal-registry is not being used. I'm also using the max_over_time with a 1 hour lookback to try to smear over small durations of "waiting for new registry activity to repopulate imageregistry_request_duration_seconds_count". ARO is comfortable waiting for the bugfix to ship, and does not want the complicated "registry in use?" detection, so they have an 'or' entry in that stanza to ensure it will always match on ARO clusters. And now that the PromQL is getting pretty complicated, I'm also adjusting label management so it's easier to see from the result which clauses are matching for a particular cluster, in case we need to debug why we're matching when we don't expect to, or not matching when we do expect to. I'm also adding _id="" placeholders for HyperShift, see 5cb2e93 (blocked-edges/4.11.*-KeepalivedMulticastSkew: Explicit _id="", 2023-05-09, openshift#3591) for more on that. Generated by editing 4.14.1's by hand and copying out to the others with: $ ls blocked-edges/*AzureRegistryImagePreservation.yaml | grep -v '/4[.]14[.]1-A' | while read X; do TO="$(grep '^to: ' "${X}")"; sed "s/^to: .*/${TO}/" blocked-edges/4.14.1-AzureRegistryImagePreservation.yaml > "${X}"; done [1]: https://docs.openshift.com/container-platform/4.14/release_notes/ocp-4-14-release-notes.html#ocp-4-14-optional-capabilities

The h suffix in "19h" stands for "hexadecimal", so 19h is decimal 25. The first group-by matches CPUs exposed to the firmware bug with a value of one. The second group-by shows that the node_cpu_info metric is working on at least some nodes. I'm preserving vendor and family in the groupings, because it's nice to be able to say "we aren't matching this risk for your cluster, because we see..." and then point out an Intel CPU, or an AMD CPU with a different family, or some other example of why we don't think the risk applies to that cluster. The topk ensures that if we have both risk-matching CPUs and non-matching CPUs, we prefer the matching result (value 1) over the non-matching result(s) (value 0). The _id="" portion is a pattern to support HyperShift and other systems that could query the cluster's data out of a PromQL engine that stored data for multiple clusters. More context in 5cb2e93 (blocked-edges/4.11.*-KeepalivedMulticastSkew: Explicit _id="", 2023-05-09, openshift#3591) Generated by writing the 4.12.45 risk by hand and copying it out to other impacted patch versions with: $ for Z in $(seq 46 51); do sed "s/4.12.45/4.12.${Z}/" blocked-edges/4.12.45-AMD19hFirmware.yaml > "blocked-edges/4.12.${Z}-AMD19hFirmware.yaml"; done And then I set 'fixedIn: 4.12.53' in the 4.12.51 risk (we never had a 4.12.52 release, so 4.12.53 is the next one after 4.12.51).

Point at the fleet-scoped OCPBUGS / UpgradeBlocker assessment process, because we don't want managed to declare risks without having that all-fleet assessment (even if the all-fleet assessment happens after a managed-specific risk declaration). If an all-fleet risk is declared, any managed-specific risk in that space can be removed. If an all-fleet risk is not declared, service-delivery graph-data admins can make their own call about what to do with a managed-specific risk. Also point out the possibility of customer-managed admins being confused by managed-specific risks. For example, ca5795c (Adding update risk for OPNET-479, 2024-03-08, openshift#4903) was an ARO-specific declaration, but the message said "clusters running on Azure with Accelerated Networking" which nominally includes non-ARO Azure clusters. Some of those cluster admins asked about why the risk wasn't matching their clusters, and it's because we saw no evidence of non-ARO clusters being bit by the race [1]. The guidelines I'm adding will hopefully reduce the chance of future managed-specific risk declarations causing similar customer-managed admin confusion. I'm also adding _id="" to the queries as a pattern to support HyperShift and other systems that could query the cluster's data out of a PromQL engine that stored data for multiple clusters. More context in 5cb2e93 (blocked-edges/4.11.*-KeepalivedMulticastSkew: Explicit _id="", 2023-05-09, openshift#3591). And I'm also adding ARO-specific PromQL. Maybe someday ARO will join the other managed flavors in using sre:telemetry:managed_labels, but they haven't yet. [1]: https://issues.redhat.com/browse/OPNET-479?focusedId=24366396&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-24366396

The PromQL is: * If we know IPsec is enabled, we're exposed, or... * If we know IPsec is not enabled, we're not exposed, or... * If are OVN, but aren't sure if IPsec is enabled (e.g. ovnkube_master_ipsec_enabled scraping is failing), we might be exposed. -1 will cause an evaluation failure [1]. Or... * If we know we are not OVN, we are not exposed, or... * If we aren't sure we're on OVN (e.g. apiserver_storage_objects scraping is failing), we might be exposed. Returning no time series will cause an evaluation failure. The label_replace makes it easier to tell when we're in the "we know IPsec is not enabled" case. The max_over_time avoids hiccuping if metrics are interrupted; see 5e4480f (blocked-edges/4.14.*-AzureRegistryImagePreservation: Look for active registry use, 2024-02-09, #4763). I'm also adding _id="" to the queries as a pattern to support HyperShift and other systems that could query the cluster's data out of a PromQL engine that stored data for multiple clusters. More context in 5cb2e93 (blocked-edges/4.11.*-KeepalivedMulticastSkew: Explicit _id="", 2023-05-09, #3591). Generated by creating the 4.14.0 file by hand and copying it out to the other 4.14 releases with: $ curl -s 'https://api.openshift.com/api/upgrades_info/graph?channel=candidate-4.14&arch=amd64' | jq -r '.nodes[] | .version' | grep '^4[.]14[.]' | grep -v '^4[.]14[.]0$' | while read VERSION; do sed "s/4.14.0/${VERSION}/" blocked-edges/4.14.0-OVNInterConnectTransitionIPsec.yaml > "blocked-edges/${VERSION}-OVNInterConnectTransitionIPsec.yaml"; done [1]: https://github.com/openshift/enhancements/blob/4668a0825c59739dfafd2ae661c16cf30f540946/enhancements/update/targeted-update-edge-blocking.md?plain=1#L119

034fa01 (blocked-edges/4.12.*: Declare AWSOldBootImages, 2022-12-14, openshift#2909) explains why we need to look for 4.9-or-earlier instead of looking for the 4.8-or-earlier condition this risk is associated with. I'm also adding _id="" to the queries as a pattern to support HyperShift and other systems that could query the cluster's data out of a PromQL engine that stored data for multiple clusters. More context in 5cb2e93 (blocked-edges/4.11.*-KeepalivedMulticastSkew: Explicit _id="", 2023-05-09, openshift#3591).

034fa01 (blocked-edges/4.12.*: Declare AWSOldBootImages, 2022-12-14, openshift#2909) explains why we need to look for 4.9-or-earlier instead of looking for the 4.8-or-earlier condition this risk is associated with. I'm also adding _id="" to the queries as a pattern to support HyperShift and other systems that could query the cluster's data out of a PromQL engine that stored data for multiple clusters. More context in 5cb2e93 (blocked-edges/4.11.*-KeepalivedMulticastSkew: Explicit _id="", 2023-05-09, openshift#3591). Fixed in rc.4, because it has the new minor_min from f8316da (build-suggestions/4.16: Set minor_min to 4.15.17, 2024-06-06, openshift#5352): $ oc adm release info quay.io/openshift-release-dev/ocp-release:4.16.0-rc.3-x86_64 | grep Upgrades Upgrades: 4.15.11, 4.15.12, 4.15.13, 4.15.14, 4.15.15, 4.15.16, 4.16.0-ec.1, 4.16.0-ec.2, 4.16.0-ec.3, 4.16.0-ec.4, 4.16.0-ec.5, 4.16.0-ec.6, 4.16.0-rc.0, 4.16.0-rc.1, 4.16.0-rc.2 $ oc adm release info quay.io/openshift-release-dev/ocp-release:4.16.0-rc.4-x86_64 | grep Upgrades Upgrades: 4.15.17, 4.16.0-ec.1, 4.16.0-ec.2, 4.16.0-ec.3, 4.16.0-ec.4, 4.16.0-ec.5, 4.16.0-ec.6, 4.16.0-rc.0, 4.16.0-rc.1, 4.16.0-rc.2, 4.16.0-rc.3 and the fix for [1] is in [2]. [1]: https://issues.redhat.com/browse/OCPBUGS-34492 [2]: https://amd64.ocp.releases.ci.openshift.org/releasestream/4-stable/release/4.15.16

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 9, 2023

openshift-ci bot requested review from LalatenduMohanty and petr-muller May 9, 2023 21:03

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 9, 2023

wking changed the title ~~OTA-924: blocked-edges/4.11.*-KeepalivedMulticastSkew: Explicit _id=""~~ OTA-964: blocked-edges/4.11.*-KeepalivedMulticastSkew: Explicit _id="" May 9, 2023

wking force-pushed the explicit-_id branch from 020b3d3 to 5cb2e93 Compare May 11, 2023 17:23

LalatenduMohanty approved these changes May 17, 2023

View reviewed changes

openshift-ci bot assigned LalatenduMohanty May 17, 2023

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 17, 2023

openshift-merge-robot merged commit e6fc02c into openshift:master May 17, 2023

wking deleted the explicit-_id branch May 17, 2023 17:42

wking mentioned this pull request Feb 9, 2024

IR-461: blocked-edges/4.14.*-AzureRegistryImagePreservation: Look for active registry use #4763

Merged

wking mentioned this pull request Apr 16, 2024

COS-2747: blocked-edges/4.12.*: Declare AMD19hFirmware #5114

Merged

wking mentioned this pull request May 7, 2024

COS-2781: CephCapDropPanic on 4.12.54+, 4.13.36+, 4.14.14+, 4.15.0+ #5204

Merged

wking mentioned this pull request May 23, 2024

OTA-1179: README: Clarify managed-cluster risk scoping #5289

Merged

wking mentioned this pull request May 31, 2024

SDN-4871: blocked-edges/4.14.*-OVNInterConnectTransitionIPsec: Declare risk #5334

Merged

wking mentioned this pull request Jun 12, 2024

Block 4.14 due to bad boot order #5390

Merged

wking mentioned this pull request Dec 6, 2024

SDN-5485: OVNLocalnetWithNoSubnets: ->4.17.[0-6] with virtualization #6403

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OTA-964: blocked-edges/4.11.*-KeepalivedMulticastSkew: Explicit _id="" #3591

OTA-964: blocked-edges/4.11.*-KeepalivedMulticastSkew: Explicit _id="" #3591

Uh oh!

wking commented May 9, 2023

Uh oh!

openshift-ci-robot commented May 9, 2023 •

edited by openshift-ci bot

Loading

Uh oh!

wking commented May 9, 2023

Uh oh!

openshift-ci-robot commented May 9, 2023 •

edited by openshift-ci bot

Loading

Uh oh!

openshift-ci bot commented May 11, 2023

Uh oh!

LalatenduMohanty commented May 16, 2023

Uh oh!

LalatenduMohanty left a comment

Uh oh!

openshift-ci bot commented May 17, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

OTA-964: blocked-edges/4.11.*-KeepalivedMulticastSkew: Explicit _id="" #3591

OTA-964: blocked-edges/4.11.*-KeepalivedMulticastSkew: Explicit _id="" #3591

Uh oh!

Conversation

wking commented May 9, 2023

Uh oh!

openshift-ci-robot commented May 9, 2023 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wking commented May 9, 2023

Impacted infrastructure platform, has at least one machine-config-controlled compute node

Real query

Other platforms

No MachineConfig compute

Uh oh!

openshift-ci-robot commented May 9, 2023 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci bot commented May 11, 2023

Uh oh!

LalatenduMohanty commented May 16, 2023

Uh oh!

LalatenduMohanty left a comment

Choose a reason for hiding this comment

Uh oh!

openshift-ci bot commented May 17, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

openshift-ci-robot commented May 9, 2023 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented May 9, 2023 •

edited by openshift-ci bot

Loading