-
Notifications
You must be signed in to change notification settings - Fork 65
OTA-964: blocked-edges/4.11.*-KeepalivedMulticastSkew: Explicit _id="" #3591
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@wking: This pull request references OTA-924 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
Reprising this testing for various PromQL in an Impacted infrastructure platform, has at least one machine-config-controlled compute nodeReal queryshows the risk matching with a 1: Other platformsWith No MachineConfig computeAnd with |
|
@wking: This pull request references OTA-964 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
In preparation for HyperShift, where we expect to have the
cluster-version operator resolving update risks via a PromQL engine
that stores data from multiple clusters. Metrics in that data store
use an _id label to show which cluster the time-series belongs to
[2,3,4]. This label does not exist in the
thanos-querier.openshift-monitoring.svc.cluster.local store for
stand-alone, self-hosted clusters, so limiting to _id="" is a no-op
there. And future work for HyperShift-hosted cluster-version
operators can teach those operators to substitute this placeholder
with _id="theActualClusterID", so they match time-series from their
cluster, and do not get distracted by time-series from other, sibling
hosted clusters.
Generated with:
$ sed -i 's/cluster_infrastructure_provider{/cluster_infrastructure_provider{_id="",/;s/cluster_infrastructure_provider$/cluster_infrastructure_provider{_id=""}/;s/mcd_update_state{/mcd_update_state{_id="",/;s/mcd_update_state)/mcd_update_state{_id=""})/' blocked-edges/4.11.*-KeepalivedMulticastSkew.yaml
[1]: https://issues.redhat.com/browse/OTA-924
[2]: https://github.com/openshift/enhancements/blob/ba306451b99da182d26bd0631f3f1e448930316c/enhancements/monitoring/hypershift-monitoring.md#common-labels
[3]: openshift/hypershift@a0be78d
[4]: openshift/hypershift@44247e9
|
@wking: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
Should we wait for conditional update risks between 4.12 to 4.13? |
LalatenduMohanty
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: LalatenduMohanty, wking The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This commit will introduce new flags and logic to the CVO regarding its
PromQL target for risk evaluation of conditional updates with the CVO
being run in a hosted control plane in mind.
For the CVO to successfully access a service that provides metrics
(in the case of the CVO in a standalone OpenShift cluster it's the
thanos-querier service), it needs three things. It needs the service's
address, a CA bundle to verify the certificate provided by the service
to allow secure communication using TLS between the actors [1], and the
authorization credentials of the CVO. Currently, the CVO hardcodes the
address, the path to the CA bundle, and the path to the credentials file.
This is not ideal, as CVO is starting to be used in other repositories
such as HyperShift [2].
A path to the service CA bundle file is added to allow explicitly
setting the CA bundle file of the given query service.
Currently, the CVO is using a kube-api server to resolve the IP address
of a specific service in the cluster [4]. New flags were added to
specify the service name and its namespace as the PromQL target may
reside in different namespaces or have a different name.
In HyperShift multiple kube-api servers are being used. A management
cluster creates a hosted control plane where a hosted CVO resides. This
hosted CVO is then given a kubeconfig file for a cluster where the
hosted CVO reconciles manifests. However, because the CVO cannot resolve
internal DNS names and, in this case, is given a kubeconfig for a remote
cluster, the CVO is not capable to communicate with internal services on
the management cluster. Thus a new flag is added to specify the path to
the management cluster kubeconfig file. In case the management cluster
kubeconfig is not specified, the CVO defaults to using the
`serviceaccount` directory as is the behavior in the standalone OCP
cluster.
A flag to specify the path to the credentials file was not added because
the CVO hardcodes the same path to the credentials file as is the
path to the credentials file used in other service accounts in
Kubernetes [3].
[1] https://docs.openshift.com/container-platform/4.12/security/certificate_types_descriptions/service-ca-certificates.html
[2] https://github.com/openshift/hypershift
[3] https://kubernetes.io/docs/reference/access-authn-authz/service-accounts-admin/#serviceaccount-admission-controller
[4] openshift#920
This commit also adds a new flag to specify whether the CVO resides in
a hosted control plane. In this case, the CVO will inject its cluster
Id into PromQL queries to differentiate between multiple time series
belonging to different hosted clusters [5].
[5] openshift/cincinnati-graph-data#3591
This commit will introduce new flags and logic to the CVO regarding its
PromQL target for risk evaluation of conditional updates with the CVO
being run in a hosted control plane in mind.
For the CVO to successfully access a service that provides metrics
(in the case of the CVO in a standalone OpenShift cluster it's the
thanos-querier service), it needs three things. It needs the service's
address, a CA bundle to verify the certificate provided by the service
to allow secure communication using TLS between the actors [1], and the
authorization credentials of the CVO. Currently, the CVO hardcodes the
address, the path to the CA bundle, and the path to the credentials file.
This is not ideal, as CVO is starting to be used in other repositories
such as HyperShift [2].
A path to the service CA bundle file is added to allow explicitly
setting the CA bundle file of the given query service.
Currently, the CVO is using a kube-api server to resolve the IP address
of a specific service in the cluster [4]. New flags were added to
specify the service name and its namespace as the PromQL target may
reside in different namespaces or have a different name.
In HyperShift multiple kube-api servers are being used. A management
cluster creates a hosted control plane where a hosted CVO resides. This
hosted CVO is then given a kubeconfig file for a cluster where the
hosted CVO reconciles manifests. However, because the CVO cannot resolve
internal DNS names and, in this case, is given a kubeconfig for a remote
cluster, the CVO is not capable to communicate with internal services on
the management cluster. Thus a new flag is added to specify the path to
the management cluster kubeconfig file. In case the management cluster
kubeconfig is not specified, the CVO defaults to using the
`serviceaccount` directory as is the behavior in the standalone OCP
cluster.
A flag to specify the path to the credentials file was not added because
the CVO hardcodes the same path to the credentials file as is the
path to the credentials file used in other service accounts in
Kubernetes [3].
[1] https://docs.openshift.com/container-platform/4.12/security/certificate_types_descriptions/service-ca-certificates.html
[2] https://github.com/openshift/hypershift
[3] https://kubernetes.io/docs/reference/access-authn-authz/service-accounts-admin/#serviceaccount-admission-controller
[4] openshift#920
This commit also adds a new flag to specify whether the CVO resides in
a hosted control plane. In this case, the CVO will inject its cluster
Id into PromQL queries to differentiate between multiple time series
belonging to different hosted clusters [5].
[5] openshift/cincinnati-graph-data#3591
This commit will introduce new flags and logic to the CVO regarding its
PromQL target for risk evaluation of conditional updates with the CVO
being run in a hosted control plane in mind.
For the CVO to successfully access a service that provides metrics
(in the case of the CVO in a standalone OpenShift cluster it's the
thanos-querier service), it needs three things. It needs the service's
address, a CA bundle to verify the certificate provided by the service
to allow secure communication using TLS between the actors [1], and the
authorization credentials of the CVO. Currently, the CVO hardcodes the
address, the path to the CA bundle, and the path to the credentials file.
This is not ideal, as CVO is starting to be used in other repositories
such as HyperShift [2].
A path to the service CA bundle file is added to allow explicitly
setting the CA bundle file of the given query service.
Currently, the CVO is using a kube-api server to resolve the IP address
of a specific service in the cluster [4]. New flags were added to
specify the service name and its namespace as the PromQL target may
reside in different namespaces or have a different name. The port of the
service can also be different as in the case of the managed HyperShift
monitoring. This commit will make sure the port number is not hard-coded
but is rather dynamically found by the port's name.
In HyperShift multiple kube-api servers are being used. A management
cluster creates a hosted control plane where a hosted CVO resides. This
hosted CVO is then given a kubeconfig file for a cluster where the
hosted CVO reconciles manifests. However, because the CVO cannot resolve
internal DNS names and, in this case, is given a kubeconfig for a remote
cluster, the CVO is not capable to communicate with internal services on
the management cluster. Thus a new flag is added to specify the path to
the management cluster kubeconfig file. In case the management cluster
kubeconfig is not specified, the CVO defaults to using the
`serviceaccount` directory as is the behavior in the standalone OCP
cluster.
A flag to specify the path to the credentials file was not added because
the CVO hardcodes the same path to the credentials file as is the
path to the credentials file used in other service accounts in
Kubernetes [3].
[1] https://docs.openshift.com/container-platform/4.12/security/certificate_types_descriptions/service-ca-certificates.html
[2] https://github.com/openshift/hypershift
[3] https://kubernetes.io/docs/reference/access-authn-authz/service-accounts-admin/#serviceaccount-admission-controller
[4] openshift#920
This commit also adds a new flag to specify whether the CVO resides in
a hosted control plane. In this case, the CVO will inject its cluster
Id into PromQL queries to differentiate between multiple time series
belonging to different hosted clusters [5].
[5] openshift/cincinnati-graph-data#3591
This commit will introduce new flags and logic to the CVO regarding its
PromQL target for risk evaluation of conditional updates with the CVO
being run in a hosted control plane in mind.
For the CVO to successfully access a service that provides metrics
(in the case of the CVO in a standalone OpenShift cluster it's the
thanos-querier service), it needs three things. It needs the service's
address, a CA bundle to verify the certificate provided by the service
to allow secure communication using TLS between the actors [1], and the
authorization credentials of the CVO. Currently, the CVO hardcodes the
address, the path to the CA bundle, and the path to the credentials file.
This is not ideal, as CVO is starting to be used in other repositories
such as HyperShift [2].
A path to the service CA bundle file is added to allow explicitly
setting the CA bundle file of the given query service.
Currently, the CVO is using a kube-api server to resolve the IP address
of a specific service in the cluster [4]. New flags were added to
specify the service name and its namespace as the PromQL target may
reside in different namespaces or have a different name. The port of the
service can also be different as in the case of the managed HyperShift
monitoring. This commit will make sure the port number is not hard-coded
but is rather dynamically found by the port's name.
In HyperShift multiple kube-api servers are being used. A management
cluster creates a hosted control plane where a hosted CVO resides. This
hosted CVO is then given a kubeconfig file for a cluster where the
hosted CVO reconciles manifests. However, because the CVO cannot resolve
internal DNS names and, in this case, is given a kubeconfig for a remote
cluster, the CVO is not capable to communicate with internal services on
the management cluster. Thus a new flag is added to specify the path to
the management cluster kubeconfig file. In case the management cluster
kubeconfig is not specified, the CVO defaults to using the
`serviceaccount` directory as is the behavior in the standalone OCP
cluster.
A flag to specify the path to the credentials file was not added because
the CVO hardcodes the same path to the credentials file as is the
path to the credentials file used in other service accounts in
Kubernetes [3].
[1] https://docs.openshift.com/container-platform/4.12/security/certificate_types_descriptions/service-ca-certificates.html
[2] https://github.com/openshift/hypershift
[3] https://kubernetes.io/docs/reference/access-authn-authz/service-accounts-admin/#serviceaccount-admission-controller
[4] openshift#920
This commit also adds a new flag to specify whether the CVO resides in
a hosted control plane. In this case, the CVO will inject its cluster
Id into PromQL queries to differentiate between multiple time series
belonging to different hosted clusters [5].
[5] openshift/cincinnati-graph-data#3591
This commit will introduce new flags and logic to the CVO regarding its
PromQL target for risk evaluation of conditional updates with the CVO
being run in a hosted control plane in mind.
For the CVO to successfully access a service that provides metrics
(in the case of the CVO in a standalone OpenShift cluster it's the
thanos-querier service), it needs three things. It needs the service's
address, a CA bundle to verify the certificate provided by the service
to allow secure communication using TLS between the actors [1], and the
authorization credentials of the CVO. Currently, the CVO hardcodes the
address, the path to the CA bundle, and the path to the credentials file.
This is not ideal, as CVO is starting to be used in other repositories
such as HyperShift [2].
A path to the service CA bundle file is added to allow explicitly
setting the CA bundle file of the given query service.
Currently, the CVO is using a kube-api server to resolve the IP address
of a specific service in the cluster [4]. New flags were added to
specify the service name and its namespace as the PromQL target may
reside in different namespaces or have a different name. The port of the
service can also be different as in the case of the managed HyperShift
monitoring. This commit will make sure the port number is not hard-coded
but is rather dynamically found by the port's name.
In HyperShift multiple kube-api servers are being used. A management
cluster creates a hosted control plane where a hosted CVO resides. This
hosted CVO is then given a kubeconfig file for a cluster where the
hosted CVO reconciles manifests. However, because the CVO cannot resolve
internal DNS names and, in this case, is given a kubeconfig for a remote
cluster, the CVO is not capable to communicate with internal services on
the management cluster. Thus a new flag is added to specify the path to
the management cluster kubeconfig file. In case the management cluster
kubeconfig is not specified, the CVO defaults to using the
`serviceaccount` directory as is the behavior in the standalone OCP
cluster.
A flag to specify the path to the credentials file was not added because
the CVO hardcodes the same path to the credentials file as is the
path to the credentials file used in other service accounts in
Kubernetes [3].
[1] https://docs.openshift.com/container-platform/4.12/security/certificate_types_descriptions/service-ca-certificates.html
[2] https://github.com/openshift/hypershift
[3] https://kubernetes.io/docs/reference/access-authn-authz/service-accounts-admin/#serviceaccount-admission-controller
[4] openshift#920
This commit also adds a new flag to specify whether the CVO resides in
a hosted control plane. In this case, the CVO will inject its cluster
Id into PromQL queries to differentiate between multiple time series
belonging to different hosted clusters [5].
[5] openshift/cincinnati-graph-data#3591
This commit will introduce new flags and logic to the CVO regarding its
PromQL target for risk evaluation of conditional updates with the CVO
being run in a hosted control plane in mind.
For the CVO to successfully access a service that provides metrics
(in the case of the CVO in a standalone OpenShift cluster it's the
thanos-querier service), it needs three things. It needs the service's
address, a CA bundle to verify the certificate provided by the service
to allow secure communication using TLS between the actors [1], and the
authorization credentials of the CVO. Currently, the CVO hardcodes the
address, the path to the CA bundle, and the path to the credentials file.
This is not ideal, as CVO is starting to be used in other repositories
such as HyperShift [2].
A path to the service CA bundle file is added to allow explicitly
setting the CA bundle file of the given query service.
Currently, the CVO is using a kube-api server to resolve the IP address
of a specific service in the cluster [4]. New flags were added to
specify the service name and its namespace as the PromQL target may
reside in different namespaces or have a different name. The port of the
service can also be different as in the case of the managed HyperShift
monitoring. This commit will make sure the port number is not hard-coded
but is rather dynamically found by the port's name.
In HyperShift multiple kube-api servers are being used. A management
cluster creates a hosted control plane where a hosted CVO resides. This
hosted CVO is then given a kubeconfig file for a cluster where the
hosted CVO reconciles manifests. However, because the CVO cannot resolve
internal DNS names and, in this case, is given a kubeconfig for a remote
cluster, the CVO is not capable to communicate with internal services on
the management cluster. Thus a new flag is added to specify the path to
the management cluster kubeconfig file. In case the management cluster
kubeconfig is not specified, the CVO defaults to using the
`serviceaccount` directory as is the behavior in the standalone OCP
cluster.
A flag to specify the path to the credentials file was not added because
the CVO hardcodes the same path to the credentials file as is the
path to the credentials file used in other service accounts in
Kubernetes [3].
[1] https://docs.openshift.com/container-platform/4.12/security/certificate_types_descriptions/service-ca-certificates.html
[2] https://github.com/openshift/hypershift
[3] https://kubernetes.io/docs/reference/access-authn-authz/service-accounts-admin/#serviceaccount-admission-controller
[4] openshift#920
This commit also adds a new flag to specify whether the CVO resides in
a hosted control plane. In this case, the CVO will inject its cluster
Id into PromQL queries to differentiate between multiple time series
belonging to different hosted clusters [5].
[5] openshift/cincinnati-graph-data#3591
This commit will introduce new flags and logic to the CVO regarding its
PromQL target for risk evaluation of conditional updates with the CVO
being run in a hosted control plane in mind.
For the CVO to successfully access a service that provides metrics
(in the case of the CVO in a standalone OpenShift cluster it's the
thanos-querier service), it needs three things. It needs the service's
address, a CA bundle to verify the certificate provided by the service
to allow secure communication using TLS between the actors [1], and the
authorization credentials of the CVO. Currently, the CVO hardcodes the
address, the path to the CA bundle, and the path to the credentials file.
This is not ideal, as CVO is starting to be used in other repositories
such as HyperShift [2].
A path to the service CA bundle file is added to allow explicitly
setting the CA bundle file of the given query service.
Currently, the CVO is using a kube-api server to resolve the IP address
of a specific service in the cluster [4]. New flags were added to
specify the service name and its namespace as the PromQL target may
reside in different namespaces or have a different name. The port of the
service can also be different as in the case of the managed HyperShift
monitoring. This commit will make sure the port number is not hard-coded
but is rather dynamically found by the port's name.
In HyperShift multiple kube-api servers are being used. A management
cluster creates a hosted control plane where a hosted CVO resides. This
hosted CVO is then given a kubeconfig file for a cluster where the
hosted CVO reconciles manifests. However, because the CVO cannot resolve
internal DNS names and, in this case, is given a kubeconfig for a remote
cluster, the CVO is not capable to communicate with internal services on
the management cluster. Thus a new flag is added to specify the path to
the management cluster kubeconfig file. In case the management cluster
kubeconfig is not specified, the CVO defaults to using the
`serviceaccount` directory as is the behavior in the standalone OCP
cluster.
A flag to specify the path to the credentials file was not added because
the CVO hardcodes the same path to the credentials file as is the
path to the credentials file used in other service accounts in
Kubernetes [3].
[1] https://docs.openshift.com/container-platform/4.12/security/certificate_types_descriptions/service-ca-certificates.html
[2] https://github.com/openshift/hypershift
[3] https://kubernetes.io/docs/reference/access-authn-authz/service-accounts-admin/#serviceaccount-admission-controller
[4] openshift#920
This commit also adds a new flag to specify whether the CVO resides in
a hosted control plane. In this case, the CVO will inject its cluster
Id into PromQL queries to differentiate between multiple time series
belonging to different hosted clusters [5].
[5] openshift/cincinnati-graph-data#3591
This commit will introduce new flags and logic to the CVO regarding its
PromQL target for risk evaluation of conditional updates with the CVO
being run in a hosted control plane in mind.
For the CVO to successfully access a service that provides metrics
(in the case of the CVO in a standalone OpenShift cluster it's the
thanos-querier service), it needs three things. It needs the service's
address, a CA bundle to verify the certificate provided by the service
to allow secure communication using TLS between the actors [1], and the
authorization credentials of the CVO. Currently, the CVO hardcodes the
address, the path to the CA bundle, and the path to the credentials file.
This is not ideal, as CVO is starting to be used in other repositories
such as HyperShift [2].
A path to the service CA bundle file is added to allow explicitly
setting the CA bundle file of the given query service.
Currently, the CVO is using a kube-api server to resolve the IP address
of a specific service in the cluster [4]. New flags were added to
specify the service name and its namespace as the PromQL target may
reside in different namespaces or have a different name. The port of the
service can also be different as in the case of the managed HyperShift
monitoring. This commit will make sure the port number is not hard-coded
but is rather dynamically found by the port's name.
In HyperShift multiple kube-api servers are being used. A management
cluster creates a hosted control plane where a hosted CVO resides. This
hosted CVO is then given a kubeconfig file for a cluster where the
hosted CVO reconciles manifests. However, because the CVO cannot resolve
internal DNS names and, in this case, is given a kubeconfig for a remote
cluster, the CVO is not capable to communicate with internal services on
the management cluster. Thus a new flag is added to specify the path to
the management cluster kubeconfig file. In case the management cluster
kubeconfig is not specified, the CVO defaults to using the
`serviceaccount` directory as is the behavior in the standalone OCP
cluster.
A flag to specify the path to the credentials file was not added because
the CVO hardcodes the same path to the credentials file as is the
path to the credentials file used in other service accounts in
Kubernetes [3].
[1] https://docs.openshift.com/container-platform/4.12/security/certificate_types_descriptions/service-ca-certificates.html
[2] https://github.com/openshift/hypershift
[3] https://kubernetes.io/docs/reference/access-authn-authz/service-accounts-admin/#serviceaccount-admission-controller
[4] openshift#920
This commit also adds a new flag to specify whether the CVO resides in
a hosted control plane. In this case, the CVO will inject its cluster
Id into PromQL queries to differentiate between multiple time series
belonging to different hosted clusters [5].
[5] openshift/cincinnati-graph-data#3591
This commit will introduce new flags and logic to the CVO regarding its PromQL target for risk evaluation of conditional updates with the CVO being run in a hosted control plane in mind. For the CVO to successfully access a service that provides metrics (in the case of the CVO in a standalone OpenShift cluster it's the thanos-querier service), it needs three things. It needs the service's address, a CA bundle to verify the certificate provided by the service to allow secure communication using TLS between the actors [1], and the authorization credentials of the CVO. Currently, the CVO hardcodes the address, the path to the CA bundle, and the path to the credentials file. This is not ideal, as CVO is starting to be used in other repositories such as HyperShift [2]. A path to the service CA bundle file is added to allow explicitly setting the CA bundle file of the given query service. Currently, the CVO is using a kube-api server to resolve the IP address of a specific service in the cluster [3]. Add new flags that allow configuring the CVO to resolve the IP address via DNS when DNS is available to the CVO. This is the case for hosted CVOs in HyperShift. The alternative in HyperShift would be to use the management cluster kube-api server and give the hosted CVOs additional permissions. Add flags to specify the PromQL target URL. This URL contains the used scheme, port, and the server's name used for TLS configuration. In the case, DNS is enabled, the URL is also used to query the Prometheus server. In the case, the DNS is disabled, the server can be specified via the Kubernetes service in the cluster that exposes the server. A flag to specify the path to the credentials file was added for more customizability. This flag also enables the CVO to not set the token when it's not needed. A CVO can communicate with a Prometheus server over HTTP. In the case, the token is not needed, it would be undesirable for the CVO to send its token without reason over HTTP. This commit also adds a new flag to specify whether the CVO resides in a hosted control plane. In this case, the CVO will inject its cluster Id into PromQL queries to differentiate between multiple time series belonging to different hosted clusters [4]. [1] https://docs.openshift.com/container-platform/4.12/security/certificate_types_descriptions/service-ca-certificates.html [2] https://github.com/openshift/hypershift [3] openshift#920 [4] openshift/cincinnati-graph-data#3591
…and 4.15.0-ec.0 The risk is for updating out of the impacted releases to hypothetical future releases with the user change, e.g. from 4.14.0-rc.3 to 4.14.0. But we're warning on * -> 4.14.0-rc.3 (and similar) to help folks dodge the sticky updates entirely, because 4.14.0-rc.2 -> 4.14.0 are not expected to ever have running-as-root webhook listeners. Pattern-matching: * apiserver_storage_objects from b573872 (blocked-edges/4.13.*-PerformanceProfilesCPUQuota: Declare new risk, 2023-06-27, openshift#3786), * egressips.k8s.ovn.org from 894fab7 (blocked-edges/4.11.9-ovn-namespace: 4.11.9 does not fix the regression, 2022-10-12, openshift#2628), and * _id from 5cb2e93 (blocked-edges/4.11.*-KeepalivedMulticastSkew: Explicit _id="", 2023-05-09, openshift#3591). Using cluster_installer with the hypershift invoker is new for this commit, and in this case I'm using it to declare HyperShift clustres not exposed to the risk. Generated by manually writing the rc.e risk, and then copying it around with: $ for VERSION in 4.14.0-rc.4 4.15.0-ec.0; do sed "s/4.14.0-rc.3/${VERSION}/" blocked-edges/4.14.0-rc.3-OVNWebhookUserConflict.yaml > "blocked-edges/${VERSION}-OVNWebhookUserConflict.yaml"; done
…and 4.15.0-ec.0 The risk is for updating out of the impacted releases to hypothetical future releases with the user change, e.g. from 4.14.0-rc.3 to 4.14.0. But we're warning on * -> 4.14.0-rc.3 (and similar) to help folks dodge the sticky updates entirely, because 4.14.0-rc.2 -> 4.14.0 are not expected to ever have running-as-root webhook listeners. Pattern-matching: * apiserver_storage_objects from b573872 (blocked-edges/4.13.*-PerformanceProfilesCPUQuota: Declare new risk, 2023-06-27, openshift#3786), * egressips.k8s.ovn.org from 894fab7 (blocked-edges/4.11.9-ovn-namespace: 4.11.9 does not fix the regression, 2022-10-12, openshift#2628), and * _id from 5cb2e93 (blocked-edges/4.11.*-KeepalivedMulticastSkew: Explicit _id="", 2023-05-09, openshift#3591). Using cluster_installer with the hypershift invoker is new for this commit, and in this case I'm using it to declare HyperShift clustres not exposed to the risk. Generated by manually writing the rc.e risk, and then copying it around with: $ for VERSION in 4.14.0-rc.4 4.15.0-ec.0; do sed "s/4.14.0-rc.3/${VERSION}/" blocked-edges/4.14.0-rc.3-OVNWebhookUserConflict.yaml > "blocked-edges/${VERSION}-OVNWebhookUserConflict.yaml"; done
Miguel points out that the exposure set is more complicated [1] than what I'd done in 45eb9ea (blocked-edges/4.14*: Declare AzureDefaultVMType, openshift#4541). It's: * Azure born in 4.8 or earlier are exposed. Both ARO (which creates clusters with Hive?) and clusters created via openshift-installer. * ARO clusters created in 4.13 and earlier are exposed. Generated by updating the 4.14.1 risk by hand, and then running: $ curl -s 'https://api.openshift.com/api/upgrades_info/graph?channel=candidate-4.14&arch=amd64' | jq -r '.nodes[] | .version' | grep '^4[.]14[.]' | grep -v '^4[.]14[.][01]$' | while read VERSION; do sed "s/4.14.1/${VERSION}/" blocked-edges/4.14.1-AzureDefaultVMType.yaml > "blocked-edges/${VERSION}-AzureDefaultVMType.yaml"; done Breaking down the logic for my new PromQL: a. First stanza, using topk is likely unecessary, but if we do happen to have multiple matches for some reason, we'll take the highest. That gives us a "we match" 1 (if any aggregated entries were 1) or a "we don't match" (if they were all 0), instead of "we're having a hard time figuring out" Recommended=Unknown. a. If the cluster is ARO (using cluster_operator_conditions, as in ba09198 (MCO-958: Blocking edges to 4.14.2+ and 4.13.25+, 2023-12-15, openshift#4524), first stanza is 1. Otherwise, 'or' falls back to... b. Nested block, again with the cautious topk: a. If there are no cluster_operator_conditions, don't return a time series. This ensures that "we didn't match a.a, but we might be ARO, and just have cluster_operator_conditions aggregation broken" gives us a Recommended=Unknown evaluation failure. b. Nested block, again with the cautious topk: a. born_by_4_9 yes case, with 4.(<=9) instead of the desired 4.(<=8) because of the "old CVO bugs make it hard to distinguish between 4.(<=9) birth-versions" issue discussed in 034fa01 (blocked-edges/4.12.*: Declare AWSOldBootImages, 2022-12-14, openshift#2909). Otherwise, 'or' falls back to... b. A check to ensure cluster_version{type="initial"} is working. This ensures that "we didn't match the a.b.b.a born_by_4_9 yes case, but we be that old, and just have cluster_version aggregation broken" gives us a Recommended=Unknown evaluation failure. b. Second stanza, again with the cautious topk: a. cluster_infrastructure_provider is Azure. Otherwise, 'or' falls back to... b. If there are no cluster_infrastructure_provider, don't return a time series. This ensures that "we didn't match b.a, but we might be Azure, and just have cluster_infrastructure_provider aggregation broken" gives us a Recommended=Unknown evaluation failure. All of the _id filtering is for use in hosted clusters or other PromQL stores that include multiple clusters. More background in 5cb2e93 (blocked-edges/4.11.*-KeepalivedMulticastSkew: Explicit _id="", 2023-05-09, openshift#3591). So walking some cases: * Non-Azure cluster, cluster_operator_conditions, cluster_version, and cluster_infrastructure_provider all working: * a.a matches no series (not ARO). Fall back to... * a.b.a confirms cluster_operator_conditions is working. * a.b.b could be 1 or 0 for cluster_version. * b.a matches no series (not Azure). * b.b gives 0 (confirming cluster_infrastructure_provider is working). * (1 or 0) * 0 = 0, cluster does not match. * Non-Azure cluster, cluster_version is broken: * a.a matches no series (not ARO). Fall back to... * a.b.a confirms cluster_operator_conditions is working. * a.b.b matches no series (cluster_version is broken). * b.a matches no series (not Azure). * b.b gives 0 (confirming cluster_infrastructure_provider is working). * (no-match) * 0 = no-match, evaluation fails, Recommended=Unknown. Admin gets to figure out what's broken with cluster_version and/or manually assess their exposure based on the message and linked URI. * Non-ARO Azure cluster born in 4.9, all time-series working: * a.a matches no series (not ARO). Fall back to... * a.b.a confirms cluster_operator_conditions is working. * a.b.b.a matches born_by_4_9 yes. * b.a matches (Azure). * 1 * 1 = 1, cluster matches. * ARO cluster born in 4.9, all time-series working: * a.a matches (ARO). * b.a matches (Azure). * 1 * 1 = 1, cluster matches. * ARO cluster born in 4.13, all time-series working (this is the case I'm fixing with this commit): * a.a matches (ARO). * b.a matches (Azure). * 1 * 1 = 1, cluster matches. * ARO cluster, cluster_operator_conditions is broken. * a.a matches no series (cluster_operator_conditions) is broken. * a.b.a matches no series (cluster_operator_conditions) is broken. * b.a matches (Azure) * (no-match) * 1 = no-match, evaluation fails, Recommended=Unknown. * ARO cluster, cluster_infrastructure_provider is broken. * a.a matches (ARO). * b.a matches no series (cluster_infrastructure_provider) is broken. * b.b matches no series (cluster_infrastructure_provider) is broken. * 1 * (no-match) = no-match, evaluation fails, Recommended=Unknown. We could add logic like a cluster_operator_conditions{name="aro"} check to the (b) stanza if we wanted to bakein "all ARO clusters are Azure" knowledge to successfully evaluate this case. But I'd guess cluster_infrastructure_provider is working in most ARO clusters, and this PromQL is already complicated enough, so I haven't bothered with that level of tuning. * ...lots of other combinations... [1]: https://issues.redhat.com/browse/OCPCLOUD-2409?focusedId=23694976&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-23694976
…registry use Some clusters do not excercise the internal image registry, and we do not want to bother those admins about this risk. Look at imageregistry_request_duration_seconds_count (which only records registry activity like BlobStore.Create) as a sign that the registry is actively used to handle images. Compare with imageregistry_http_request_duration_seconds_count, which the registry serves even when it is not actively being used to manage images, as a sign that registry metrics are working, to distinguish between "we know the registry is not being used" and "registry metrics collection is broken, and we're blind to registry use". In 4.14, the registry component became an optional component [1], but it is not optional in 4.13. And it's 4.13 clusters evaluating this PromQL because of the 'from: 4[.]13[.].*'. So we don't have to worry about "what if imageregistry_http_request_duration_seconds_count is not available?" cases outside of the "registry metrics collection is broken" case. The registry does not persist these *_count values across container restarts, so there is a window of exposure like: 1. Registry activity increments imageregistry_request_duration_seconds_count counters. 2. PromQL correctly identifies "this cluster is using its registry". 3. Registry container restarts, zeroing its *_count buckets. 4. If _all_ registry containers restart before there is new registry activity, the imageregistry_request_duration_seconds_count signal may go away. 5. New registry activity repopulates imageregistry_request_duration_seconds_count. The chances that all registry pods restart at the same time seems low, outside of: * Updates or other registry-hosting-node reboots, where clusters that actively use the in-cluster registry are likely to have pods consuming images from that registry need re-pulling, which would drive fresh registry traffic and repopulate imageregistry_request_duration_seconds_count. * Registry reconfiguration, which would not drive fresh registry traffic, but which seems like unlikely admin activity in a cluster where the internal-registry is not being used. I'm also using the max_over_time with a 1 hour lookback to try to smear over small durations of "waiting for new registry activity to repopulate imageregistry_request_duration_seconds_count". And now that the PromQL is getting pretty complicated, I'm also adjusting label management so it's easier to see from the result which clauses are matching for a particular cluster, in case we need to debug why we're matching when we don't expect to, or not matching when we do expect to. I'm also adding _id="" placeholders for HyperShift, see 5cb2e93 (blocked-edges/4.11.*-KeepalivedMulticastSkew: Explicit _id="", 2023-05-09, openshift#3591) for more on that. Generated by editing 4.14.1's by hand and copying out to the others with: $ ls blocked-edges/*AzureRegistryImagePreservation.yaml | grep -v '/4[.]14[.]1-A' | while read X; do TO="$(grep '^to: ' "${X}")"; sed "s/^to: .*/${TO}/" blocked-edges/4.14.1-AzureRegistryImagePreservation.yaml > "${X}"; done [1]: https://docs.openshift.com/container-platform/4.14/release_notes/ocp-4-14-release-notes.html#ocp-4-14-optional-capabilities
…registry use Some clusters do not excercise the internal image registry, and we do not want to bother those admins about this risk. Look at imageregistry_request_duration_seconds_count (which only records registry activity like BlobStore.Create) as a sign that the registry is actively used to handle images. Compare with imageregistry_http_request_duration_seconds_count, which the registry serves even when it is not actively being used to manage images, as a sign that registry metrics are working, to distinguish between "we know the registry is not being used" and "registry metrics collection is broken, and we're blind to registry use". In 4.14, the registry component became an optional component [1], but it is not optional in 4.13. And it's 4.13 clusters evaluating this PromQL because of the 'from: 4[.]13[.].*'. So we don't have to worry about "what if imageregistry_http_request_duration_seconds_count is not available?" cases outside of the "registry metrics collection is broken" case. The registry does not persist these *_count values across container restarts, so there is a window of exposure like: 1. Registry activity increments imageregistry_request_duration_seconds_count counters. 2. PromQL correctly identifies "this cluster is using its registry". 3. Registry container restarts, zeroing its *_count buckets. 4. If _all_ registry containers restart before there is new registry activity, the imageregistry_request_duration_seconds_count signal may go away. 5. New registry activity repopulates imageregistry_request_duration_seconds_count, return to step 1. The chances that all registry pods restart at the same time seems low, outside of: * Updates or other registry-hosting-node reboots, where clusters that actively use the in-cluster registry are likely to have pods consuming images from that registry need re-pulling, which would drive fresh registry traffic and repopulate imageregistry_request_duration_seconds_count. * Registry reconfiguration, which would not drive fresh registry traffic, but which seems like unlikely admin activity in a cluster where the internal-registry is not being used. I'm also using the max_over_time with a 1 hour lookback to try to smear over small durations of "waiting for new registry activity to repopulate imageregistry_request_duration_seconds_count". And now that the PromQL is getting pretty complicated, I'm also adjusting label management so it's easier to see from the result which clauses are matching for a particular cluster, in case we need to debug why we're matching when we don't expect to, or not matching when we do expect to. I'm also adding _id="" placeholders for HyperShift, see 5cb2e93 (blocked-edges/4.11.*-KeepalivedMulticastSkew: Explicit _id="", 2023-05-09, openshift#3591) for more on that. Generated by editing 4.14.1's by hand and copying out to the others with: $ ls blocked-edges/*AzureRegistryImagePreservation.yaml | grep -v '/4[.]14[.]1-A' | while read X; do TO="$(grep '^to: ' "${X}")"; sed "s/^to: .*/${TO}/" blocked-edges/4.14.1-AzureRegistryImagePreservation.yaml > "${X}"; done [1]: https://docs.openshift.com/container-platform/4.14/release_notes/ocp-4-14-release-notes.html#ocp-4-14-optional-capabilities
…registry use Some clusters do not excercise the internal image registry, and we do not want to bother those admins about this risk. Look at imageregistry_request_duration_seconds_count (which only records registry activity like BlobStore.Create) as a sign that the registry is actively used to handle images. Compare with imageregistry_http_request_duration_seconds_count, which the registry serves even when it is not actively being used to manage images, as a sign that registry metrics are working, to distinguish between "we know the registry is not being used" and "registry metrics collection is broken, and we're blind to registry use". In 4.14, the registry component became an optional component [1], but it is not optional in 4.13. And it's 4.13 clusters evaluating this PromQL because of the 'from: 4[.]13[.].*'. So we don't have to worry about "what if imageregistry_http_request_duration_seconds_count is not available?" cases outside of the "registry metrics collection is broken" case. The registry does not persist these *_count values across container restarts, so there is a window of exposure like: 1. Registry activity increments imageregistry_request_duration_seconds_count counters. 2. PromQL correctly identifies "this cluster is using its registry". 3. Registry container restarts, zeroing its *_count buckets. 4. If _all_ registry containers restart before there is new registry activity, the imageregistry_request_duration_seconds_count signal may go away. 5. New registry activity repopulates imageregistry_request_duration_seconds_count, return to step 1. The chances that all registry pods restart at the same time seems low, outside of: * Updates or other registry-hosting-node reboots, where clusters that actively use the in-cluster registry are likely to have pods consuming images from that registry need re-pulling, which would drive fresh registry traffic and repopulate imageregistry_request_duration_seconds_count. * Registry reconfiguration, which would not drive fresh registry traffic, but which seems like unlikely admin activity in a cluster where the internal-registry is not being used. I'm also using the max_over_time with a 1 hour lookback to try to smear over small durations of "waiting for new registry activity to repopulate imageregistry_request_duration_seconds_count". And now that the PromQL is getting pretty complicated, I'm also adjusting label management so it's easier to see from the result which clauses are matching for a particular cluster, in case we need to debug why we're matching when we don't expect to, or not matching when we do expect to. I'm also adding _id="" placeholders for HyperShift, see 5cb2e93 (blocked-edges/4.11.*-KeepalivedMulticastSkew: Explicit _id="", 2023-05-09, openshift#3591) for more on that. Generated by editing 4.14.1's by hand and copying out to the others with: $ ls blocked-edges/*AzureRegistryImagePreservation.yaml | grep -v '/4[.]14[.]1-A' | while read X; do TO="$(grep '^to: ' "${X}")"; sed "s/^to: .*/${TO}/" blocked-edges/4.14.1-AzureRegistryImagePreservation.yaml > "${X}"; done [1]: https://docs.openshift.com/container-platform/4.14/release_notes/ocp-4-14-release-notes.html#ocp-4-14-optional-capabilities
…registry use Some clusters do not excercise the internal image registry, and we do not want to bother those admins about this risk. Look at imageregistry_request_duration_seconds_count (which only records registry activity like BlobStore.Create) as a sign that the registry is actively used to handle images. Compare with imageregistry_http_request_duration_seconds_count, which the registry serves even when it is not actively being used to manage images, as a sign that registry metrics are working, to distinguish between "we know the registry is not being used" and "registry metrics collection is broken, and we're blind to registry use". In 4.14, the registry component became an optional component [1], but it is not optional in 4.13. And it's 4.13 clusters evaluating this PromQL because of the 'from: 4[.]13[.].*'. So we don't have to worry about "what if imageregistry_http_request_duration_seconds_count is not available?" cases outside of the "registry metrics collection is broken" case. The registry does not persist these *_count values across container restarts, so there is a window of exposure like: 1. Registry activity increments imageregistry_request_duration_seconds_count counters. 2. PromQL correctly identifies "this cluster is using its registry". 3. Registry container restarts, zeroing its *_count buckets. 4. If _all_ registry containers restart before there is new registry activity, the imageregistry_request_duration_seconds_count signal may go away. 5. New registry activity repopulates imageregistry_request_duration_seconds_count, return to step 1. The chances that all registry pods restart at the same time seems low, outside of: * Updates or other registry-hosting-node reboots, where clusters that actively use the in-cluster registry are likely to have pods consuming images from that registry need re-pulling, which would drive fresh registry traffic and repopulate imageregistry_request_duration_seconds_count. * Registry reconfiguration, which would not drive fresh registry traffic, but which seems like unlikely admin activity in a cluster where the internal-registry is not being used. I'm also using the max_over_time with a 1 hour lookback to try to smear over small durations of "waiting for new registry activity to repopulate imageregistry_request_duration_seconds_count". ARO is comfortable waiting for the bugfix to ship, and does not want the complicated "registry in use?" detection, so they have an 'or' entry in that stanza to ensure it will always match on ARO clusters. And now that the PromQL is getting pretty complicated, I'm also adjusting label management so it's easier to see from the result which clauses are matching for a particular cluster, in case we need to debug why we're matching when we don't expect to, or not matching when we do expect to. I'm also adding _id="" placeholders for HyperShift, see 5cb2e93 (blocked-edges/4.11.*-KeepalivedMulticastSkew: Explicit _id="", 2023-05-09, openshift#3591) for more on that. Generated by editing 4.14.1's by hand and copying out to the others with: $ ls blocked-edges/*AzureRegistryImagePreservation.yaml | grep -v '/4[.]14[.]1-A' | while read X; do TO="$(grep '^to: ' "${X}")"; sed "s/^to: .*/${TO}/" blocked-edges/4.14.1-AzureRegistryImagePreservation.yaml > "${X}"; done [1]: https://docs.openshift.com/container-platform/4.14/release_notes/ocp-4-14-release-notes.html#ocp-4-14-optional-capabilities
The h suffix in "19h" stands for "hexadecimal", so 19h is decimal 25. The first group-by matches CPUs exposed to the firmware bug with a value of one. The second group-by shows that the node_cpu_info metric is working on at least some nodes. I'm preserving vendor and family in the groupings, because it's nice to be able to say "we aren't matching this risk for your cluster, because we see..." and then point out an Intel CPU, or an AMD CPU with a different family, or some other example of why we don't think the risk applies to that cluster. The topk ensures that if we have both risk-matching CPUs and non-matching CPUs, we prefer the matching result (value 1) over the non-matching result(s) (value 0). The _id="" portion is a pattern to support HyperShift and other systems that could query the cluster's data out of a PromQL engine that stored data for multiple clusters. More context in 5cb2e93 (blocked-edges/4.11.*-KeepalivedMulticastSkew: Explicit _id="", 2023-05-09, openshift#3591) Generated by writing the 4.12.45 risk by hand and copying it out to other impacted patch versions with: $ for Z in $(seq 46 51); do sed "s/4.12.45/4.12.${Z}/" blocked-edges/4.12.45-AMD19hFirmware.yaml > "blocked-edges/4.12.${Z}-AMD19hFirmware.yaml"; done And then I set 'fixedIn: 4.12.53' in the 4.12.51 risk (we never had a 4.12.52 release, so 4.12.53 is the next one after 4.12.51).
The h suffix in "19h" stands for "hexadecimal", so 19h is decimal 25. The first group-by matches CPUs exposed to the firmware bug with a value of one. The second group-by shows that the node_cpu_info metric is working on at least some nodes. I'm preserving vendor and family in the groupings, because it's nice to be able to say "we aren't matching this risk for your cluster, because we see..." and then point out an Intel CPU, or an AMD CPU with a different family, or some other example of why we don't think the risk applies to that cluster. The topk ensures that if we have both risk-matching CPUs and non-matching CPUs, we prefer the matching result (value 1) over the non-matching result(s) (value 0). The _id="" portion is a pattern to support HyperShift and other systems that could query the cluster's data out of a PromQL engine that stored data for multiple clusters. More context in 5cb2e93 (blocked-edges/4.11.*-KeepalivedMulticastSkew: Explicit _id="", 2023-05-09, openshift#3591) Generated by writing the 4.12.45 risk by hand and copying it out to other impacted patch versions with: $ for Z in $(seq 46 51); do sed "s/4.12.45/4.12.${Z}/" blocked-edges/4.12.45-AMD19hFirmware.yaml > "blocked-edges/4.12.${Z}-AMD19hFirmware.yaml"; done And then I set 'fixedIn: 4.12.53' in the 4.12.51 risk (we never had a 4.12.52 release, so 4.12.53 is the next one after 4.12.51).
Point at the fleet-scoped OCPBUGS / UpgradeBlocker assessment process, because we don't want managed to declare risks without having that all-fleet assessment (even if the all-fleet assessment happens after a managed-specific risk declaration). If an all-fleet risk is declared, any managed-specific risk in that space can be removed. If an all-fleet risk is not declared, service-delivery graph-data admins can make their own call about what to do with a managed-specific risk. Also point out the possibility of customer-managed admins being confused by managed-specific risks. For example, ca5795c (Adding update risk for OPNET-479, 2024-03-08, openshift#4903) was an ARO-specific declaration, but the message said "clusters running on Azure with Accelerated Networking" which nominally includes non-ARO Azure clusters. Some of those cluster admins asked about why the risk wasn't matching their clusters, and it's because we saw no evidence of non-ARO clusters being bit by the race [1]. The guidelines I'm adding will hopefully reduce the chance of future managed-specific risk declarations causing similar customer-managed admin confusion. I'm also adding _id="" to the queries as a pattern to support HyperShift and other systems that could query the cluster's data out of a PromQL engine that stored data for multiple clusters. More context in 5cb2e93 (blocked-edges/4.11.*-KeepalivedMulticastSkew: Explicit _id="", 2023-05-09, openshift#3591). And I'm also adding ARO-specific PromQL. Maybe someday ARO will join the other managed flavors in using sre:telemetry:managed_labels, but they haven't yet. [1]: https://issues.redhat.com/browse/OPNET-479?focusedId=24366396&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-24366396
The PromQL is: * If we know IPsec is enabled, we're exposed, or... * If we know IPsec is not enabled, we're not exposed, or... * If are OVN, but aren't sure if IPsec is enabled (e.g. ovnkube_master_ipsec_enabled scraping is failing), we might be exposed. -1 will cause an evaluation failure [1]. Or... * If we know we are not OVN, we are not exposed, or... * If we aren't sure we're on OVN (e.g. apiserver_storage_objects scraping is failing), we might be exposed. Returning no time series will cause an evaluation failure. The label_replace makes it easier to tell when we're in the "we know IPsec is not enabled" case. The max_over_time avoids hiccuping if metrics are interrupted; see 5e4480f (blocked-edges/4.14.*-AzureRegistryImagePreservation: Look for active registry use, 2024-02-09, #4763). I'm also adding _id="" to the queries as a pattern to support HyperShift and other systems that could query the cluster's data out of a PromQL engine that stored data for multiple clusters. More context in 5cb2e93 (blocked-edges/4.11.*-KeepalivedMulticastSkew: Explicit _id="", 2023-05-09, #3591). Generated by creating the 4.14.0 file by hand and copying it out to the other 4.14 releases with: $ curl -s 'https://api.openshift.com/api/upgrades_info/graph?channel=candidate-4.14&arch=amd64' | jq -r '.nodes[] | .version' | grep '^4[.]14[.]' | grep -v '^4[.]14[.]0$' | while read VERSION; do sed "s/4.14.0/${VERSION}/" blocked-edges/4.14.0-OVNInterConnectTransitionIPsec.yaml > "blocked-edges/${VERSION}-OVNInterConnectTransitionIPsec.yaml"; done [1]: https://github.com/openshift/enhancements/blob/4668a0825c59739dfafd2ae661c16cf30f540946/enhancements/update/targeted-update-edge-blocking.md?plain=1#L119
034fa01 (blocked-edges/4.12.*: Declare AWSOldBootImages, 2022-12-14, openshift#2909) explains why we need to look for 4.9-or-earlier instead of looking for the 4.8-or-earlier condition this risk is associated with. I'm also adding _id="" to the queries as a pattern to support HyperShift and other systems that could query the cluster's data out of a PromQL engine that stored data for multiple clusters. More context in 5cb2e93 (blocked-edges/4.11.*-KeepalivedMulticastSkew: Explicit _id="", 2023-05-09, openshift#3591).
034fa01 (blocked-edges/4.12.*: Declare AWSOldBootImages, 2022-12-14, openshift#2909) explains why we need to look for 4.9-or-earlier instead of looking for the 4.8-or-earlier condition this risk is associated with. I'm also adding _id="" to the queries as a pattern to support HyperShift and other systems that could query the cluster's data out of a PromQL engine that stored data for multiple clusters. More context in 5cb2e93 (blocked-edges/4.11.*-KeepalivedMulticastSkew: Explicit _id="", 2023-05-09, openshift#3591). Fixed in rc.4, because it has the new minor_min from f8316da (build-suggestions/4.16: Set minor_min to 4.15.17, 2024-06-06, openshift#5352): $ oc adm release info quay.io/openshift-release-dev/ocp-release:4.16.0-rc.3-x86_64 | grep Upgrades Upgrades: 4.15.11, 4.15.12, 4.15.13, 4.15.14, 4.15.15, 4.15.16, 4.16.0-ec.1, 4.16.0-ec.2, 4.16.0-ec.3, 4.16.0-ec.4, 4.16.0-ec.5, 4.16.0-ec.6, 4.16.0-rc.0, 4.16.0-rc.1, 4.16.0-rc.2 $ oc adm release info quay.io/openshift-release-dev/ocp-release:4.16.0-rc.4-x86_64 | grep Upgrades Upgrades: 4.15.17, 4.16.0-ec.1, 4.16.0-ec.2, 4.16.0-ec.3, 4.16.0-ec.4, 4.16.0-ec.5, 4.16.0-ec.6, 4.16.0-rc.0, 4.16.0-rc.1, 4.16.0-rc.2, 4.16.0-rc.3 and the fix for [1] is in [2]. [1]: https://issues.redhat.com/browse/OCPBUGS-34492 [2]: https://amd64.ocp.releases.ci.openshift.org/releasestream/4-stable/release/4.15.16



In preparation for HyperShift, where we expect to have the cluster-version operator resolving update risks via a PromQL engine that stores data from multiple clusters. Metrics in that data store use an
_idlabel to show which cluster the time-series belongs to (enhancement, implementation here and here). This label does not exist in thethanos-querier.openshift-monitoring.svc.cluster.localstore for stand-alone, self-hosted clusters, so limiting to_id=""is a no-op there. And future work for HyperShift-hosted cluster-version operators can teach those operators to substitute this placeholder with_id="theActualClusterID", so they match time-series from their cluster, and do not get distracted by time-series from other, sibling hosted clusters.Generated with:
$ sed -i 's/cluster_infrastructure_provider{/cluster_infrastructure_provider{_id="",/;s/cluster_infrastructure_provider$/cluster_infrastructure_provider{_id=""}/;s/mcd_update_state{/mcd_update_state{_id="",/;s/mcd_update_state)/mcd_update_state{_id=""})/' blocked-edges/4.11.*-KeepalivedMulticastSkew.yaml