Skip to content

Conversation

@wking
Copy link
Member

@wking wking commented May 9, 2023

In preparation for HyperShift, where we expect to have the cluster-version operator resolving update risks via a PromQL engine that stores data from multiple clusters. Metrics in that data store use an _id label to show which cluster the time-series belongs to (enhancement, implementation here and here). This label does not exist in the
thanos-querier.openshift-monitoring.svc.cluster.local store for stand-alone, self-hosted clusters, so limiting to _id="" is a no-op there. And future work for HyperShift-hosted cluster-version operators can teach those operators to substitute this placeholder with _id="theActualClusterID", so they match time-series from their cluster, and do not get distracted by time-series from other, sibling hosted clusters.

Generated with:

$ sed -i 's/cluster_infrastructure_provider{/cluster_infrastructure_provider{_id="",/;s/cluster_infrastructure_provider$/cluster_infrastructure_provider{_id=""}/;s/mcd_update_state{/mcd_update_state{_id="",/;s/mcd_update_state)/mcd_update_state{_id=""})/' blocked-edges/4.11.*-KeepalivedMulticastSkew.yaml

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 9, 2023
@openshift-ci-robot
Copy link

openshift-ci-robot commented May 9, 2023

@wking: This pull request references OTA-924 which is a valid jira issue.

Details

In response to this:

In preparation for HyperShift, where we expect to have the cluster-version operator resolving update risks via a PromQL engine that stores data from multiple clusters. Metrics in that data store use an _id label to show which cluster the time-series belongs to (enhancement, implementation here and here). This label does not exist in the
thanos-querier.openshift-monitoring.svc.cluster.local store for stand-alone, self-hosted clusters, so limiting to _id="" is a no-op there. And future work for HyperShift-hosted cluster-version operators can teach those operators to substitute this placeholder with _id="theActualClusterID", so they match time-series from their cluster, and do not get distracted by time-series from other, sibling hosted clusters.

Generated with:

$ sed -i 's/cluster_infrastructure_provider{/cluster_infrastructure_provider{_id="",/;s/cluster_infrastructure_provider$/cluster_infrastructure_provider{_id=""}/;s/mcd_update_state{/mcd_update_state{_id="",/;s/mcd_update_state)/mcd_update_state{_id=""})/' blocked-edges/4.11.*-KeepalivedMulticastSkew.yaml

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 9, 2023
@wking
Copy link
Member Author

wking commented May 9, 2023

Reprising this testing for various PromQL in an launch 4.10.58 ovirt cluster-bot cluster:

Impacted infrastructure platform, has at least one machine-config-controlled compute node

Real query

max(
  cluster_infrastructure_provider{_id="",type=~"OpenStack|oVirt|VSphere"}
  or
  0 * cluster_infrastructure_provider{_id=""}
)
*
(
  group(topk by (node) (1, mcd_update_state{_id="",config!~"|rendered-master-.*"}))
  or
  0 * group(mcd_update_state{_id=""})
)

shows the risk matching with a 1:

image

Other platforms

With oVirtNot, simulating an AWS or other non-impacted cluster, the query drops to zero as a non-match:

image

No MachineConfig compute

And with .* matching every MachineConfig to simulate a cluster with no MachineConfig-managed compute, the query drops to zero as a non-match:

image

@wking wking changed the title OTA-924: blocked-edges/4.11.*-KeepalivedMulticastSkew: Explicit _id="" OTA-964: blocked-edges/4.11.*-KeepalivedMulticastSkew: Explicit _id="" May 9, 2023
@openshift-ci-robot
Copy link

openshift-ci-robot commented May 9, 2023

@wking: This pull request references OTA-964 which is a valid jira issue.

Details

In response to this:

In preparation for HyperShift, where we expect to have the cluster-version operator resolving update risks via a PromQL engine that stores data from multiple clusters. Metrics in that data store use an _id label to show which cluster the time-series belongs to (enhancement, implementation here and here). This label does not exist in the
thanos-querier.openshift-monitoring.svc.cluster.local store for stand-alone, self-hosted clusters, so limiting to _id="" is a no-op there. And future work for HyperShift-hosted cluster-version operators can teach those operators to substitute this placeholder with _id="theActualClusterID", so they match time-series from their cluster, and do not get distracted by time-series from other, sibling hosted clusters.

Generated with:

$ sed -i 's/cluster_infrastructure_provider{/cluster_infrastructure_provider{_id="",/;s/cluster_infrastructure_provider$/cluster_infrastructure_provider{_id=""}/;s/mcd_update_state{/mcd_update_state{_id="",/;s/mcd_update_state)/mcd_update_state{_id=""})/' blocked-edges/4.11.*-KeepalivedMulticastSkew.yaml

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

In preparation for HyperShift, where we expect to have the
cluster-version operator resolving update risks via a PromQL engine
that stores data from multiple clusters.  Metrics in that data store
use an _id label to show which cluster the time-series belongs to
[2,3,4].  This label does not exist in the
thanos-querier.openshift-monitoring.svc.cluster.local store for
stand-alone, self-hosted clusters, so limiting to _id="" is a no-op
there.  And future work for HyperShift-hosted cluster-version
operators can teach those operators to substitute this placeholder
with _id="theActualClusterID", so they match time-series from their
cluster, and do not get distracted by time-series from other, sibling
hosted clusters.

Generated with:

  $ sed -i 's/cluster_infrastructure_provider{/cluster_infrastructure_provider{_id="",/;s/cluster_infrastructure_provider$/cluster_infrastructure_provider{_id=""}/;s/mcd_update_state{/mcd_update_state{_id="",/;s/mcd_update_state)/mcd_update_state{_id=""})/' blocked-edges/4.11.*-KeepalivedMulticastSkew.yaml

[1]: https://issues.redhat.com/browse/OTA-924
[2]: https://github.com/openshift/enhancements/blob/ba306451b99da182d26bd0631f3f1e448930316c/enhancements/monitoring/hypershift-monitoring.md#common-labels
[3]: openshift/hypershift@a0be78d
[4]: openshift/hypershift@44247e9
@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 11, 2023

@wking: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-latest-cincinnati 5cb2e93 link false /test e2e-latest-cincinnati

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@LalatenduMohanty
Copy link
Member

Should we wait for conditional update risks between 4.12 to 4.13?

Copy link
Member

@LalatenduMohanty LalatenduMohanty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 17, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 17, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: LalatenduMohanty, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [LalatenduMohanty,wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit e6fc02c into openshift:master May 17, 2023
@wking wking deleted the explicit-_id branch May 17, 2023 17:42
DavidHurta added a commit to DavidHurta/cluster-version-operator that referenced this pull request Jul 10, 2023
This commit will introduce new flags and logic to the CVO regarding its
PromQL target for risk evaluation of conditional updates with the CVO
being run in a hosted control plane in mind.

For the CVO to successfully access a service that provides metrics
(in the case of the CVO in a standalone OpenShift cluster it's the
thanos-querier service), it needs three things. It needs the service's
address, a CA bundle to verify the certificate provided by the service
to allow secure communication using TLS between the actors [1], and the
authorization credentials of the CVO. Currently, the CVO hardcodes the
address, the path to the CA bundle, and the path to the credentials file.

This is not ideal, as CVO is starting to be used in other repositories
such as HyperShift [2].

A path to the service CA bundle file is added to allow explicitly
setting the CA bundle file of the given query service.

Currently, the CVO is using a kube-api server to resolve the IP address
of a specific service in the cluster [4]. New flags were added to
specify the service name and its namespace as the PromQL target may
reside in different namespaces or have a different name.

In HyperShift multiple kube-api servers are being used. A management
cluster creates a hosted control plane where a hosted CVO resides. This
hosted CVO is then given a kubeconfig file for a cluster where the
hosted CVO reconciles manifests. However, because the CVO cannot resolve
internal DNS names and, in this case, is given a kubeconfig for a remote
cluster, the CVO is not capable to communicate with internal services on
the management cluster. Thus a new flag is added to specify the path to
the management cluster kubeconfig file. In case the management cluster
kubeconfig is not specified, the CVO defaults to using the
`serviceaccount` directory as is the behavior in the standalone OCP
cluster.

A flag to specify the path to the credentials file was not added because
the CVO hardcodes the same path to the credentials file as is the
path to the credentials file used in other service accounts in
Kubernetes [3].

    [1] https://docs.openshift.com/container-platform/4.12/security/certificate_types_descriptions/service-ca-certificates.html
    [2] https://github.com/openshift/hypershift
    [3] https://kubernetes.io/docs/reference/access-authn-authz/service-accounts-admin/#serviceaccount-admission-controller
    [4] openshift#920

This commit also adds a new flag to specify whether the CVO resides in
a hosted control plane. In this case, the CVO will inject its cluster
Id into PromQL queries to differentiate between multiple time series
belonging to different hosted clusters [5].

[5] openshift/cincinnati-graph-data#3591
DavidHurta added a commit to DavidHurta/cluster-version-operator that referenced this pull request Jul 11, 2023
This commit will introduce new flags and logic to the CVO regarding its
PromQL target for risk evaluation of conditional updates with the CVO
being run in a hosted control plane in mind.

For the CVO to successfully access a service that provides metrics
(in the case of the CVO in a standalone OpenShift cluster it's the
thanos-querier service), it needs three things. It needs the service's
address, a CA bundle to verify the certificate provided by the service
to allow secure communication using TLS between the actors [1], and the
authorization credentials of the CVO. Currently, the CVO hardcodes the
address, the path to the CA bundle, and the path to the credentials file.

This is not ideal, as CVO is starting to be used in other repositories
such as HyperShift [2].

A path to the service CA bundle file is added to allow explicitly
setting the CA bundle file of the given query service.

Currently, the CVO is using a kube-api server to resolve the IP address
of a specific service in the cluster [4]. New flags were added to
specify the service name and its namespace as the PromQL target may
reside in different namespaces or have a different name.

In HyperShift multiple kube-api servers are being used. A management
cluster creates a hosted control plane where a hosted CVO resides. This
hosted CVO is then given a kubeconfig file for a cluster where the
hosted CVO reconciles manifests. However, because the CVO cannot resolve
internal DNS names and, in this case, is given a kubeconfig for a remote
cluster, the CVO is not capable to communicate with internal services on
the management cluster. Thus a new flag is added to specify the path to
the management cluster kubeconfig file. In case the management cluster
kubeconfig is not specified, the CVO defaults to using the
`serviceaccount` directory as is the behavior in the standalone OCP
cluster.

A flag to specify the path to the credentials file was not added because
the CVO hardcodes the same path to the credentials file as is the
path to the credentials file used in other service accounts in
Kubernetes [3].

    [1] https://docs.openshift.com/container-platform/4.12/security/certificate_types_descriptions/service-ca-certificates.html
    [2] https://github.com/openshift/hypershift
    [3] https://kubernetes.io/docs/reference/access-authn-authz/service-accounts-admin/#serviceaccount-admission-controller
    [4] openshift#920

This commit also adds a new flag to specify whether the CVO resides in
a hosted control plane. In this case, the CVO will inject its cluster
Id into PromQL queries to differentiate between multiple time series
belonging to different hosted clusters [5].

[5] openshift/cincinnati-graph-data#3591
DavidHurta added a commit to DavidHurta/cluster-version-operator that referenced this pull request Jul 12, 2023
This commit will introduce new flags and logic to the CVO regarding its
PromQL target for risk evaluation of conditional updates with the CVO
being run in a hosted control plane in mind.

For the CVO to successfully access a service that provides metrics
(in the case of the CVO in a standalone OpenShift cluster it's the
thanos-querier service), it needs three things. It needs the service's
address, a CA bundle to verify the certificate provided by the service
to allow secure communication using TLS between the actors [1], and the
authorization credentials of the CVO. Currently, the CVO hardcodes the
address, the path to the CA bundle, and the path to the credentials file.

This is not ideal, as CVO is starting to be used in other repositories
such as HyperShift [2].

A path to the service CA bundle file is added to allow explicitly
setting the CA bundle file of the given query service.

Currently, the CVO is using a kube-api server to resolve the IP address
of a specific service in the cluster [4]. New flags were added to
specify the service name and its namespace as the PromQL target may
reside in different namespaces or have a different name. The port of the
service can also be different as in the case of the managed HyperShift
monitoring. This commit will make sure the port number is not hard-coded
but is rather dynamically found by the port's name.

In HyperShift multiple kube-api servers are being used. A management
cluster creates a hosted control plane where a hosted CVO resides. This
hosted CVO is then given a kubeconfig file for a cluster where the
hosted CVO reconciles manifests. However, because the CVO cannot resolve
internal DNS names and, in this case, is given a kubeconfig for a remote
cluster, the CVO is not capable to communicate with internal services on
the management cluster. Thus a new flag is added to specify the path to
the management cluster kubeconfig file. In case the management cluster
kubeconfig is not specified, the CVO defaults to using the
`serviceaccount` directory as is the behavior in the standalone OCP
cluster.

A flag to specify the path to the credentials file was not added because
the CVO hardcodes the same path to the credentials file as is the
path to the credentials file used in other service accounts in
Kubernetes [3].

    [1] https://docs.openshift.com/container-platform/4.12/security/certificate_types_descriptions/service-ca-certificates.html
    [2] https://github.com/openshift/hypershift
    [3] https://kubernetes.io/docs/reference/access-authn-authz/service-accounts-admin/#serviceaccount-admission-controller
    [4] openshift#920

This commit also adds a new flag to specify whether the CVO resides in
a hosted control plane. In this case, the CVO will inject its cluster
Id into PromQL queries to differentiate between multiple time series
belonging to different hosted clusters [5].

[5] openshift/cincinnati-graph-data#3591
DavidHurta added a commit to DavidHurta/cluster-version-operator that referenced this pull request Aug 1, 2023
This commit will introduce new flags and logic to the CVO regarding its
PromQL target for risk evaluation of conditional updates with the CVO
being run in a hosted control plane in mind.

For the CVO to successfully access a service that provides metrics
(in the case of the CVO in a standalone OpenShift cluster it's the
thanos-querier service), it needs three things. It needs the service's
address, a CA bundle to verify the certificate provided by the service
to allow secure communication using TLS between the actors [1], and the
authorization credentials of the CVO. Currently, the CVO hardcodes the
address, the path to the CA bundle, and the path to the credentials file.

This is not ideal, as CVO is starting to be used in other repositories
such as HyperShift [2].

A path to the service CA bundle file is added to allow explicitly
setting the CA bundle file of the given query service.

Currently, the CVO is using a kube-api server to resolve the IP address
of a specific service in the cluster [4]. New flags were added to
specify the service name and its namespace as the PromQL target may
reside in different namespaces or have a different name. The port of the
service can also be different as in the case of the managed HyperShift
monitoring. This commit will make sure the port number is not hard-coded
but is rather dynamically found by the port's name.

In HyperShift multiple kube-api servers are being used. A management
cluster creates a hosted control plane where a hosted CVO resides. This
hosted CVO is then given a kubeconfig file for a cluster where the
hosted CVO reconciles manifests. However, because the CVO cannot resolve
internal DNS names and, in this case, is given a kubeconfig for a remote
cluster, the CVO is not capable to communicate with internal services on
the management cluster. Thus a new flag is added to specify the path to
the management cluster kubeconfig file. In case the management cluster
kubeconfig is not specified, the CVO defaults to using the
`serviceaccount` directory as is the behavior in the standalone OCP
cluster.

A flag to specify the path to the credentials file was not added because
the CVO hardcodes the same path to the credentials file as is the
path to the credentials file used in other service accounts in
Kubernetes [3].

    [1] https://docs.openshift.com/container-platform/4.12/security/certificate_types_descriptions/service-ca-certificates.html
    [2] https://github.com/openshift/hypershift
    [3] https://kubernetes.io/docs/reference/access-authn-authz/service-accounts-admin/#serviceaccount-admission-controller
    [4] openshift#920

This commit also adds a new flag to specify whether the CVO resides in
a hosted control plane. In this case, the CVO will inject its cluster
Id into PromQL queries to differentiate between multiple time series
belonging to different hosted clusters [5].

[5] openshift/cincinnati-graph-data#3591
DavidHurta added a commit to DavidHurta/cluster-version-operator that referenced this pull request Aug 1, 2023
This commit will introduce new flags and logic to the CVO regarding its
PromQL target for risk evaluation of conditional updates with the CVO
being run in a hosted control plane in mind.

For the CVO to successfully access a service that provides metrics
(in the case of the CVO in a standalone OpenShift cluster it's the
thanos-querier service), it needs three things. It needs the service's
address, a CA bundle to verify the certificate provided by the service
to allow secure communication using TLS between the actors [1], and the
authorization credentials of the CVO. Currently, the CVO hardcodes the
address, the path to the CA bundle, and the path to the credentials file.

This is not ideal, as CVO is starting to be used in other repositories
such as HyperShift [2].

A path to the service CA bundle file is added to allow explicitly
setting the CA bundle file of the given query service.

Currently, the CVO is using a kube-api server to resolve the IP address
of a specific service in the cluster [4]. New flags were added to
specify the service name and its namespace as the PromQL target may
reside in different namespaces or have a different name. The port of the
service can also be different as in the case of the managed HyperShift
monitoring. This commit will make sure the port number is not hard-coded
but is rather dynamically found by the port's name.

In HyperShift multiple kube-api servers are being used. A management
cluster creates a hosted control plane where a hosted CVO resides. This
hosted CVO is then given a kubeconfig file for a cluster where the
hosted CVO reconciles manifests. However, because the CVO cannot resolve
internal DNS names and, in this case, is given a kubeconfig for a remote
cluster, the CVO is not capable to communicate with internal services on
the management cluster. Thus a new flag is added to specify the path to
the management cluster kubeconfig file. In case the management cluster
kubeconfig is not specified, the CVO defaults to using the
`serviceaccount` directory as is the behavior in the standalone OCP
cluster.

A flag to specify the path to the credentials file was not added because
the CVO hardcodes the same path to the credentials file as is the
path to the credentials file used in other service accounts in
Kubernetes [3].

    [1] https://docs.openshift.com/container-platform/4.12/security/certificate_types_descriptions/service-ca-certificates.html
    [2] https://github.com/openshift/hypershift
    [3] https://kubernetes.io/docs/reference/access-authn-authz/service-accounts-admin/#serviceaccount-admission-controller
    [4] openshift#920

This commit also adds a new flag to specify whether the CVO resides in
a hosted control plane. In this case, the CVO will inject its cluster
Id into PromQL queries to differentiate between multiple time series
belonging to different hosted clusters [5].

[5] openshift/cincinnati-graph-data#3591
DavidHurta added a commit to DavidHurta/cluster-version-operator that referenced this pull request Aug 2, 2023
This commit will introduce new flags and logic to the CVO regarding its
PromQL target for risk evaluation of conditional updates with the CVO
being run in a hosted control plane in mind.

For the CVO to successfully access a service that provides metrics
(in the case of the CVO in a standalone OpenShift cluster it's the
thanos-querier service), it needs three things. It needs the service's
address, a CA bundle to verify the certificate provided by the service
to allow secure communication using TLS between the actors [1], and the
authorization credentials of the CVO. Currently, the CVO hardcodes the
address, the path to the CA bundle, and the path to the credentials file.

This is not ideal, as CVO is starting to be used in other repositories
such as HyperShift [2].

A path to the service CA bundle file is added to allow explicitly
setting the CA bundle file of the given query service.

Currently, the CVO is using a kube-api server to resolve the IP address
of a specific service in the cluster [4]. New flags were added to
specify the service name and its namespace as the PromQL target may
reside in different namespaces or have a different name. The port of the
service can also be different as in the case of the managed HyperShift
monitoring. This commit will make sure the port number is not hard-coded
but is rather dynamically found by the port's name.

In HyperShift multiple kube-api servers are being used. A management
cluster creates a hosted control plane where a hosted CVO resides. This
hosted CVO is then given a kubeconfig file for a cluster where the
hosted CVO reconciles manifests. However, because the CVO cannot resolve
internal DNS names and, in this case, is given a kubeconfig for a remote
cluster, the CVO is not capable to communicate with internal services on
the management cluster. Thus a new flag is added to specify the path to
the management cluster kubeconfig file. In case the management cluster
kubeconfig is not specified, the CVO defaults to using the
`serviceaccount` directory as is the behavior in the standalone OCP
cluster.

A flag to specify the path to the credentials file was not added because
the CVO hardcodes the same path to the credentials file as is the
path to the credentials file used in other service accounts in
Kubernetes [3].

    [1] https://docs.openshift.com/container-platform/4.12/security/certificate_types_descriptions/service-ca-certificates.html
    [2] https://github.com/openshift/hypershift
    [3] https://kubernetes.io/docs/reference/access-authn-authz/service-accounts-admin/#serviceaccount-admission-controller
    [4] openshift#920

This commit also adds a new flag to specify whether the CVO resides in
a hosted control plane. In this case, the CVO will inject its cluster
Id into PromQL queries to differentiate between multiple time series
belonging to different hosted clusters [5].

[5] openshift/cincinnati-graph-data#3591
DavidHurta added a commit to DavidHurta/cluster-version-operator that referenced this pull request Aug 2, 2023
This commit will introduce new flags and logic to the CVO regarding its
PromQL target for risk evaluation of conditional updates with the CVO
being run in a hosted control plane in mind.

For the CVO to successfully access a service that provides metrics
(in the case of the CVO in a standalone OpenShift cluster it's the
thanos-querier service), it needs three things. It needs the service's
address, a CA bundle to verify the certificate provided by the service
to allow secure communication using TLS between the actors [1], and the
authorization credentials of the CVO. Currently, the CVO hardcodes the
address, the path to the CA bundle, and the path to the credentials file.

This is not ideal, as CVO is starting to be used in other repositories
such as HyperShift [2].

A path to the service CA bundle file is added to allow explicitly
setting the CA bundle file of the given query service.

Currently, the CVO is using a kube-api server to resolve the IP address
of a specific service in the cluster [4]. New flags were added to
specify the service name and its namespace as the PromQL target may
reside in different namespaces or have a different name. The port of the
service can also be different as in the case of the managed HyperShift
monitoring. This commit will make sure the port number is not hard-coded
but is rather dynamically found by the port's name.

In HyperShift multiple kube-api servers are being used. A management
cluster creates a hosted control plane where a hosted CVO resides. This
hosted CVO is then given a kubeconfig file for a cluster where the
hosted CVO reconciles manifests. However, because the CVO cannot resolve
internal DNS names and, in this case, is given a kubeconfig for a remote
cluster, the CVO is not capable to communicate with internal services on
the management cluster. Thus a new flag is added to specify the path to
the management cluster kubeconfig file. In case the management cluster
kubeconfig is not specified, the CVO defaults to using the
`serviceaccount` directory as is the behavior in the standalone OCP
cluster.

A flag to specify the path to the credentials file was not added because
the CVO hardcodes the same path to the credentials file as is the
path to the credentials file used in other service accounts in
Kubernetes [3].

    [1] https://docs.openshift.com/container-platform/4.12/security/certificate_types_descriptions/service-ca-certificates.html
    [2] https://github.com/openshift/hypershift
    [3] https://kubernetes.io/docs/reference/access-authn-authz/service-accounts-admin/#serviceaccount-admission-controller
    [4] openshift#920

This commit also adds a new flag to specify whether the CVO resides in
a hosted control plane. In this case, the CVO will inject its cluster
Id into PromQL queries to differentiate between multiple time series
belonging to different hosted clusters [5].

[5] openshift/cincinnati-graph-data#3591
DavidHurta added a commit to DavidHurta/cluster-version-operator that referenced this pull request Aug 2, 2023
This commit will introduce new flags and logic to the CVO regarding its
PromQL target for risk evaluation of conditional updates with the CVO
being run in a hosted control plane in mind.

For the CVO to successfully access a service that provides metrics
(in the case of the CVO in a standalone OpenShift cluster it's the
thanos-querier service), it needs three things. It needs the service's
address, a CA bundle to verify the certificate provided by the service
to allow secure communication using TLS between the actors [1], and the
authorization credentials of the CVO. Currently, the CVO hardcodes the
address, the path to the CA bundle, and the path to the credentials file.

This is not ideal, as CVO is starting to be used in other repositories
such as HyperShift [2].

A path to the service CA bundle file is added to allow explicitly
setting the CA bundle file of the given query service.

Currently, the CVO is using a kube-api server to resolve the IP address
of a specific service in the cluster [4]. New flags were added to
specify the service name and its namespace as the PromQL target may
reside in different namespaces or have a different name. The port of the
service can also be different as in the case of the managed HyperShift
monitoring. This commit will make sure the port number is not hard-coded
but is rather dynamically found by the port's name.

In HyperShift multiple kube-api servers are being used. A management
cluster creates a hosted control plane where a hosted CVO resides. This
hosted CVO is then given a kubeconfig file for a cluster where the
hosted CVO reconciles manifests. However, because the CVO cannot resolve
internal DNS names and, in this case, is given a kubeconfig for a remote
cluster, the CVO is not capable to communicate with internal services on
the management cluster. Thus a new flag is added to specify the path to
the management cluster kubeconfig file. In case the management cluster
kubeconfig is not specified, the CVO defaults to using the
`serviceaccount` directory as is the behavior in the standalone OCP
cluster.

A flag to specify the path to the credentials file was not added because
the CVO hardcodes the same path to the credentials file as is the
path to the credentials file used in other service accounts in
Kubernetes [3].

    [1] https://docs.openshift.com/container-platform/4.12/security/certificate_types_descriptions/service-ca-certificates.html
    [2] https://github.com/openshift/hypershift
    [3] https://kubernetes.io/docs/reference/access-authn-authz/service-accounts-admin/#serviceaccount-admission-controller
    [4] openshift#920

This commit also adds a new flag to specify whether the CVO resides in
a hosted control plane. In this case, the CVO will inject its cluster
Id into PromQL queries to differentiate between multiple time series
belonging to different hosted clusters [5].

[5] openshift/cincinnati-graph-data#3591
DavidHurta added a commit to DavidHurta/cluster-version-operator that referenced this pull request Aug 7, 2023
This commit will introduce new flags and logic to the CVO regarding its
PromQL target for risk evaluation of conditional updates with the CVO
being run in a hosted control plane in mind.

For the CVO to successfully access a service that provides metrics
(in the case of the CVO in a standalone OpenShift cluster it's the
thanos-querier service), it needs three things. It needs the service's
address, a CA bundle to verify the certificate provided by the service
to allow secure communication using TLS between the actors [1], and
the authorization credentials of the CVO. Currently, the CVO hardcodes
the address, the path to the CA bundle, and the path to the credentials
file.

This is not ideal, as CVO is starting to be used in other repositories
such as HyperShift [2].

A path to the service CA bundle file is added to allow explicitly
setting the CA bundle file of the given query service.

Currently, the CVO is using a kube-api server to resolve the IP address
of a specific service in the cluster [3]. Add new flags that allow
configuring the CVO to resolve the IP address via DNS when DNS is
available to the CVO. This is the case for hosted CVOs in HyperShift.
The alternative in HyperShift would be to use the management cluster
kube-api server and give the hosted CVOs additional permissions.

Add flags to specify the PromQL target URL. This URL contains the used
scheme, port, and the server's name used for TLS configuration. In the
case, DNS is enabled, the URL is also used to query the Prometheus
server. In the case, the DNS is disabled, the server can be specified
via the Kubernetes service in the cluster that exposes the server.

A flag to specify the path to the credentials file was added for more
customizability. This flag also enables the CVO to not set the token
when it's not needed. A CVO can communicate with a Prometheus server
over HTTP. In the case, the token is not needed, it would be undesirable
for the CVO to send its token without reason over HTTP.

This commit also adds a new flag to specify whether the CVO resides in
a hosted control plane. In this case, the CVO will inject its cluster
Id into PromQL queries to differentiate between multiple time series
belonging to different hosted clusters [4].

	[1] https://docs.openshift.com/container-platform/4.12/security/certificate_types_descriptions/service-ca-certificates.html
	[2] https://github.com/openshift/hypershift
	[3] openshift#920
	[4] openshift/cincinnati-graph-data#3591
wking added a commit to wking/cincinnati-graph-data that referenced this pull request Oct 6, 2023
…and 4.15.0-ec.0

The risk is for updating out of the impacted releases to hypothetical
future releases with the user change, e.g. from 4.14.0-rc.3 to 4.14.0.
But we're warning on * -> 4.14.0-rc.3 (and similar) to help folks
dodge the sticky updates entirely, because 4.14.0-rc.2 -> 4.14.0 are
not expected to ever have running-as-root webhook listeners.

Pattern-matching:

* apiserver_storage_objects from b573872
 (blocked-edges/4.13.*-PerformanceProfilesCPUQuota: Declare new risk,
 2023-06-27, openshift#3786),

* egressips.k8s.ovn.org from 894fab7
  (blocked-edges/4.11.9-ovn-namespace: 4.11.9 does not fix the
  regression, 2022-10-12, openshift#2628), and

* _id from 5cb2e93 (blocked-edges/4.11.*-KeepalivedMulticastSkew:
  Explicit _id="", 2023-05-09, openshift#3591).

Using cluster_installer with the hypershift invoker is new for this
commit, and in this case I'm using it to declare HyperShift clustres
not exposed to the risk.

Generated by manually writing the rc.e risk, and then copying it around with:

  $ for VERSION in 4.14.0-rc.4 4.15.0-ec.0; do sed "s/4.14.0-rc.3/${VERSION}/" blocked-edges/4.14.0-rc.3-OVNWebhookUserConflict.yaml > "blocked-edges/${VERSION}-OVNWebhookUserConflict.yaml"; done
wking added a commit to wking/cincinnati-graph-data that referenced this pull request Oct 6, 2023
…and 4.15.0-ec.0

The risk is for updating out of the impacted releases to hypothetical
future releases with the user change, e.g. from 4.14.0-rc.3 to 4.14.0.
But we're warning on * -> 4.14.0-rc.3 (and similar) to help folks
dodge the sticky updates entirely, because 4.14.0-rc.2 -> 4.14.0 are
not expected to ever have running-as-root webhook listeners.

Pattern-matching:

* apiserver_storage_objects from b573872
 (blocked-edges/4.13.*-PerformanceProfilesCPUQuota: Declare new risk,
 2023-06-27, openshift#3786),

* egressips.k8s.ovn.org from 894fab7
  (blocked-edges/4.11.9-ovn-namespace: 4.11.9 does not fix the
  regression, 2022-10-12, openshift#2628), and

* _id from 5cb2e93 (blocked-edges/4.11.*-KeepalivedMulticastSkew:
  Explicit _id="", 2023-05-09, openshift#3591).

Using cluster_installer with the hypershift invoker is new for this
commit, and in this case I'm using it to declare HyperShift clustres
not exposed to the risk.

Generated by manually writing the rc.e risk, and then copying it around with:

  $ for VERSION in 4.14.0-rc.4 4.15.0-ec.0; do sed "s/4.14.0-rc.3/${VERSION}/" blocked-edges/4.14.0-rc.3-OVNWebhookUserConflict.yaml > "blocked-edges/${VERSION}-OVNWebhookUserConflict.yaml"; done
wking added a commit to wking/cincinnati-graph-data that referenced this pull request Dec 21, 2023
Miguel points out that the exposure set is more complicated [1] than
what I'd done in 45eb9ea (blocked-edges/4.14*: Declare
AzureDefaultVMType, openshift#4541).  It's:

* Azure born in 4.8 or earlier are exposed.  Both ARO (which creates
  clusters with Hive?) and clusters created via openshift-installer.
* ARO clusters created in 4.13 and earlier are exposed.

Generated by updating the 4.14.1 risk by hand, and then running:

  $ curl -s 'https://api.openshift.com/api/upgrades_info/graph?channel=candidate-4.14&arch=amd64' | jq -r '.nodes[] | .version' | grep '^4[.]14[.]' | grep -v '^4[.]14[.][01]$' | while read VERSION; do sed "s/4.14.1/${VERSION}/" blocked-edges/4.14.1-AzureDefaultVMType.yaml > "blocked-edges/${VERSION}-AzureDefaultVMType.yaml"; done

Breaking down the logic for my new PromQL:

a. First stanza, using topk is likely unecessary, but if we do happen
   to have multiple matches for some reason, we'll take the highest.
   That gives us a "we match" 1 (if any aggregated entries were 1) or
   a "we don't match" (if they were all 0), instead of "we're having a
   hard time figuring out" Recommended=Unknown.

   a. If the cluster is ARO (using cluster_operator_conditions, as in
      ba09198 (MCO-958: Blocking edges to 4.14.2+ and 4.13.25+, 2023-12-15,
      openshift#4524), first stanza is 1.  Otherwise, 'or' falls back to...

   b. Nested block, again with the cautious topk:

      a. If there are no cluster_operator_conditions, don't return a
         time series.  This ensures that "we didn't match a.a, but we
         might be ARO, and just have cluster_operator_conditions
         aggregation broken" gives us a Recommended=Unknown evaluation
         failure.

      b. Nested block, again with the cautious topk:

         a. born_by_4_9 yes case, with 4.(<=9) instead of the desired
            4.(<=8) because of the "old CVO bugs make it hard to
            distinguish between 4.(<=9) birth-versions" issue
            discussed in 034fa01 (blocked-edges/4.12.*: Declare
            AWSOldBootImages, 2022-12-14, openshift#2909).  Otherwise, 'or'
            falls back to...

         b. A check to ensure cluster_version{type="initial"} is
            working.  This ensures that "we didn't match the a.b.b.a
            born_by_4_9 yes case, but we be that old, and just have
            cluster_version aggregation broken" gives us a
            Recommended=Unknown evaluation failure.

b. Second stanza, again with the cautious topk:

   a. cluster_infrastructure_provider is Azure.  Otherwise, 'or' falls
      back to...

   b. If there are no cluster_infrastructure_provider, don't return a
      time series.  This ensures that "we didn't match b.a, but we
      might be Azure, and just have cluster_infrastructure_provider
      aggregation broken" gives us a Recommended=Unknown evaluation
      failure.

All of the _id filtering is for use in hosted clusters or other PromQL
stores that include multiple clusters.  More background in 5cb2e93
(blocked-edges/4.11.*-KeepalivedMulticastSkew: Explicit _id="",
2023-05-09, openshift#3591).

So walking some cases:

* Non-Azure cluster, cluster_operator_conditions, cluster_version, and
  cluster_infrastructure_provider all working:
  * a.a matches no series (not ARO).  Fall back to...
  * a.b.a confirms cluster_operator_conditions is working.
  * a.b.b could be 1 or 0 for cluster_version.
  * b.a matches no series (not Azure).
  * b.b gives 0 (confirming cluster_infrastructure_provider is working).
  * (1 or 0) * 0 = 0, cluster does not match.
* Non-Azure cluster, cluster_version is broken:
  * a.a matches no series (not ARO).  Fall back to...
  * a.b.a confirms cluster_operator_conditions is working.
  * a.b.b matches no series (cluster_version is broken).
  * b.a matches no series (not Azure).
  * b.b gives 0 (confirming cluster_infrastructure_provider is working).
  * (no-match) * 0 = no-match, evaluation fails, Recommended=Unknown.
    Admin gets to figure out what's broken with cluster_version and/or
    manually assess their exposure based on the message and linked
    URI.
* Non-ARO Azure cluster born in 4.9, all time-series working:
  * a.a matches no series (not ARO).  Fall back to...
  * a.b.a confirms cluster_operator_conditions is working.
  * a.b.b.a matches born_by_4_9 yes.
  * b.a matches (Azure).
  * 1 * 1 = 1, cluster matches.
* ARO cluster born in 4.9, all time-series working:
  * a.a matches (ARO).
  * b.a matches (Azure).
  * 1 * 1 = 1, cluster matches.
* ARO cluster born in 4.13, all time-series working (this is the case
  I'm fixing with this commit):
  * a.a matches (ARO).
  * b.a matches (Azure).
  * 1 * 1 = 1, cluster matches.
* ARO cluster, cluster_operator_conditions is broken.
  * a.a matches no series (cluster_operator_conditions) is broken.
  * a.b.a matches no series (cluster_operator_conditions) is broken.
  * b.a matches (Azure)
  * (no-match) * 1 = no-match, evaluation fails, Recommended=Unknown.
* ARO cluster, cluster_infrastructure_provider is broken.
  * a.a matches (ARO).
  * b.a matches no series (cluster_infrastructure_provider) is broken.
  * b.b matches no series (cluster_infrastructure_provider) is broken.
  * 1 * (no-match) = no-match, evaluation fails, Recommended=Unknown.
    We could add logic like a cluster_operator_conditions{name="aro"}
    check to the (b) stanza if we wanted to bakein "all ARO clusters
    are Azure" knowledge to successfully evaluate this case.  But I'd
    guess cluster_infrastructure_provider is working in most ARO
    clusters, and this PromQL is already complicated enough, so I
    haven't bothered with that level of tuning.
* ...lots of other combinations...

[1]: https://issues.redhat.com/browse/OCPCLOUD-2409?focusedId=23694976&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-23694976
wking added a commit to wking/cincinnati-graph-data that referenced this pull request Feb 9, 2024
…registry use

Some clusters do not excercise the internal image registry, and we do
not want to bother those admins about this risk.  Look at
imageregistry_request_duration_seconds_count (which only records
registry activity like BlobStore.Create) as a sign that the registry
is actively used to handle images.  Compare with
imageregistry_http_request_duration_seconds_count, which the registry
serves even when it is not actively being used to manage images, as a
sign that registry metrics are working, to distinguish between "we
know the registry is not being used" and "registry metrics collection
is broken, and we're blind to registry use".

In 4.14, the registry component became an optional component [1], but
it is not optional in 4.13.  And it's 4.13 clusters evaluating this
PromQL because of the 'from: 4[.]13[.].*'.  So we don't have to worry
about "what if imageregistry_http_request_duration_seconds_count is
not available?" cases outside of the "registry metrics collection is
broken" case.

The registry does not persist these *_count values across container
restarts, so there is a window of exposure like:

1. Registry activity increments
   imageregistry_request_duration_seconds_count counters.
2. PromQL correctly identifies "this cluster is using its registry".
3. Registry container restarts, zeroing its *_count buckets.
4. If _all_ registry containers restart before there is new registry
   activity, the imageregistry_request_duration_seconds_count signal
   may go away.
5. New registry activity repopulates
   imageregistry_request_duration_seconds_count.

The chances that all registry pods restart at the same time seems low,
outside of:

* Updates or other registry-hosting-node reboots, where clusters that
  actively use the in-cluster registry are likely to have pods
  consuming images from that registry need re-pulling, which would
  drive fresh registry traffic and repopulate
  imageregistry_request_duration_seconds_count.
* Registry reconfiguration, which would not drive fresh registry
  traffic, but which seems like unlikely admin activity in a cluster
  where the internal-registry is not being used.

I'm also using the max_over_time with a 1 hour lookback to try to
smear over small durations of "waiting for new registry activity to
repopulate imageregistry_request_duration_seconds_count".

And now that the PromQL is getting pretty complicated, I'm also
adjusting label management so it's easier to see from the result which
clauses are matching for a particular cluster, in case we need to
debug why we're matching when we don't expect to, or not matching when
we do expect to.

I'm also adding _id="" placeholders for HyperShift, see 5cb2e93
(blocked-edges/4.11.*-KeepalivedMulticastSkew: Explicit _id="",
2023-05-09, openshift#3591) for more on that.

Generated by editing 4.14.1's by hand and copying out to the others with:

  $ ls blocked-edges/*AzureRegistryImagePreservation.yaml | grep -v '/4[.]14[.]1-A' | while read X; do TO="$(grep '^to: ' "${X}")"; sed "s/^to: .*/${TO}/" blocked-edges/4.14.1-AzureRegistryImagePreservation.yaml > "${X}"; done

[1]: https://docs.openshift.com/container-platform/4.14/release_notes/ocp-4-14-release-notes.html#ocp-4-14-optional-capabilities
wking added a commit to wking/cincinnati-graph-data that referenced this pull request Feb 9, 2024
…registry use

Some clusters do not excercise the internal image registry, and we do
not want to bother those admins about this risk.  Look at
imageregistry_request_duration_seconds_count (which only records
registry activity like BlobStore.Create) as a sign that the registry
is actively used to handle images.  Compare with
imageregistry_http_request_duration_seconds_count, which the registry
serves even when it is not actively being used to manage images, as a
sign that registry metrics are working, to distinguish between "we
know the registry is not being used" and "registry metrics collection
is broken, and we're blind to registry use".

In 4.14, the registry component became an optional component [1], but
it is not optional in 4.13.  And it's 4.13 clusters evaluating this
PromQL because of the 'from: 4[.]13[.].*'.  So we don't have to worry
about "what if imageregistry_http_request_duration_seconds_count is
not available?" cases outside of the "registry metrics collection is
broken" case.

The registry does not persist these *_count values across container
restarts, so there is a window of exposure like:

1. Registry activity increments
   imageregistry_request_duration_seconds_count counters.
2. PromQL correctly identifies "this cluster is using its registry".
3. Registry container restarts, zeroing its *_count buckets.
4. If _all_ registry containers restart before there is new registry
   activity, the imageregistry_request_duration_seconds_count signal
   may go away.
5. New registry activity repopulates
   imageregistry_request_duration_seconds_count, return to step 1.

The chances that all registry pods restart at the same time seems low,
outside of:

* Updates or other registry-hosting-node reboots, where clusters that
  actively use the in-cluster registry are likely to have pods
  consuming images from that registry need re-pulling, which would
  drive fresh registry traffic and repopulate
  imageregistry_request_duration_seconds_count.
* Registry reconfiguration, which would not drive fresh registry
  traffic, but which seems like unlikely admin activity in a cluster
  where the internal-registry is not being used.

I'm also using the max_over_time with a 1 hour lookback to try to
smear over small durations of "waiting for new registry activity to
repopulate imageregistry_request_duration_seconds_count".

And now that the PromQL is getting pretty complicated, I'm also
adjusting label management so it's easier to see from the result which
clauses are matching for a particular cluster, in case we need to
debug why we're matching when we don't expect to, or not matching when
we do expect to.

I'm also adding _id="" placeholders for HyperShift, see 5cb2e93
(blocked-edges/4.11.*-KeepalivedMulticastSkew: Explicit _id="",
2023-05-09, openshift#3591) for more on that.

Generated by editing 4.14.1's by hand and copying out to the others with:

  $ ls blocked-edges/*AzureRegistryImagePreservation.yaml | grep -v '/4[.]14[.]1-A' | while read X; do TO="$(grep '^to: ' "${X}")"; sed "s/^to: .*/${TO}/" blocked-edges/4.14.1-AzureRegistryImagePreservation.yaml > "${X}"; done

[1]: https://docs.openshift.com/container-platform/4.14/release_notes/ocp-4-14-release-notes.html#ocp-4-14-optional-capabilities
wking added a commit to wking/cincinnati-graph-data that referenced this pull request Feb 9, 2024
…registry use

Some clusters do not excercise the internal image registry, and we do
not want to bother those admins about this risk.  Look at
imageregistry_request_duration_seconds_count (which only records
registry activity like BlobStore.Create) as a sign that the registry
is actively used to handle images.  Compare with
imageregistry_http_request_duration_seconds_count, which the registry
serves even when it is not actively being used to manage images, as a
sign that registry metrics are working, to distinguish between "we
know the registry is not being used" and "registry metrics collection
is broken, and we're blind to registry use".

In 4.14, the registry component became an optional component [1], but
it is not optional in 4.13.  And it's 4.13 clusters evaluating this
PromQL because of the 'from: 4[.]13[.].*'.  So we don't have to worry
about "what if imageregistry_http_request_duration_seconds_count is
not available?" cases outside of the "registry metrics collection is
broken" case.

The registry does not persist these *_count values across container
restarts, so there is a window of exposure like:

1. Registry activity increments
   imageregistry_request_duration_seconds_count counters.
2. PromQL correctly identifies "this cluster is using its registry".
3. Registry container restarts, zeroing its *_count buckets.
4. If _all_ registry containers restart before there is new registry
   activity, the imageregistry_request_duration_seconds_count signal
   may go away.
5. New registry activity repopulates
   imageregistry_request_duration_seconds_count, return to step 1.

The chances that all registry pods restart at the same time seems low,
outside of:

* Updates or other registry-hosting-node reboots, where clusters that
  actively use the in-cluster registry are likely to have pods
  consuming images from that registry need re-pulling, which would
  drive fresh registry traffic and repopulate
  imageregistry_request_duration_seconds_count.
* Registry reconfiguration, which would not drive fresh registry
  traffic, but which seems like unlikely admin activity in a cluster
  where the internal-registry is not being used.

I'm also using the max_over_time with a 1 hour lookback to try to
smear over small durations of "waiting for new registry activity to
repopulate imageregistry_request_duration_seconds_count".

And now that the PromQL is getting pretty complicated, I'm also
adjusting label management so it's easier to see from the result which
clauses are matching for a particular cluster, in case we need to
debug why we're matching when we don't expect to, or not matching when
we do expect to.

I'm also adding _id="" placeholders for HyperShift, see 5cb2e93
(blocked-edges/4.11.*-KeepalivedMulticastSkew: Explicit _id="",
2023-05-09, openshift#3591) for more on that.

Generated by editing 4.14.1's by hand and copying out to the others with:

  $ ls blocked-edges/*AzureRegistryImagePreservation.yaml | grep -v '/4[.]14[.]1-A' | while read X; do TO="$(grep '^to: ' "${X}")"; sed "s/^to: .*/${TO}/" blocked-edges/4.14.1-AzureRegistryImagePreservation.yaml > "${X}"; done

[1]: https://docs.openshift.com/container-platform/4.14/release_notes/ocp-4-14-release-notes.html#ocp-4-14-optional-capabilities
wking added a commit to wking/cincinnati-graph-data that referenced this pull request Feb 9, 2024
…registry use

Some clusters do not excercise the internal image registry, and we do
not want to bother those admins about this risk.  Look at
imageregistry_request_duration_seconds_count (which only records
registry activity like BlobStore.Create) as a sign that the registry
is actively used to handle images.  Compare with
imageregistry_http_request_duration_seconds_count, which the registry
serves even when it is not actively being used to manage images, as a
sign that registry metrics are working, to distinguish between "we
know the registry is not being used" and "registry metrics collection
is broken, and we're blind to registry use".

In 4.14, the registry component became an optional component [1], but
it is not optional in 4.13.  And it's 4.13 clusters evaluating this
PromQL because of the 'from: 4[.]13[.].*'.  So we don't have to worry
about "what if imageregistry_http_request_duration_seconds_count is
not available?" cases outside of the "registry metrics collection is
broken" case.

The registry does not persist these *_count values across container
restarts, so there is a window of exposure like:

1. Registry activity increments
   imageregistry_request_duration_seconds_count counters.
2. PromQL correctly identifies "this cluster is using its registry".
3. Registry container restarts, zeroing its *_count buckets.
4. If _all_ registry containers restart before there is new registry
   activity, the imageregistry_request_duration_seconds_count signal
   may go away.
5. New registry activity repopulates
   imageregistry_request_duration_seconds_count, return to step 1.

The chances that all registry pods restart at the same time seems low,
outside of:

* Updates or other registry-hosting-node reboots, where clusters that
  actively use the in-cluster registry are likely to have pods
  consuming images from that registry need re-pulling, which would
  drive fresh registry traffic and repopulate
  imageregistry_request_duration_seconds_count.
* Registry reconfiguration, which would not drive fresh registry
  traffic, but which seems like unlikely admin activity in a cluster
  where the internal-registry is not being used.

I'm also using the max_over_time with a 1 hour lookback to try to
smear over small durations of "waiting for new registry activity to
repopulate imageregistry_request_duration_seconds_count".

ARO is comfortable waiting for the bugfix to ship, and does not want
the complicated "registry in use?" detection, so they have an 'or'
entry in that stanza to ensure it will always match on ARO clusters.

And now that the PromQL is getting pretty complicated, I'm also
adjusting label management so it's easier to see from the result which
clauses are matching for a particular cluster, in case we need to
debug why we're matching when we don't expect to, or not matching when
we do expect to.

I'm also adding _id="" placeholders for HyperShift, see 5cb2e93
(blocked-edges/4.11.*-KeepalivedMulticastSkew: Explicit _id="",
2023-05-09, openshift#3591) for more on that.

Generated by editing 4.14.1's by hand and copying out to the others with:

  $ ls blocked-edges/*AzureRegistryImagePreservation.yaml | grep -v '/4[.]14[.]1-A' | while read X; do TO="$(grep '^to: ' "${X}")"; sed "s/^to: .*/${TO}/" blocked-edges/4.14.1-AzureRegistryImagePreservation.yaml > "${X}"; done

[1]: https://docs.openshift.com/container-platform/4.14/release_notes/ocp-4-14-release-notes.html#ocp-4-14-optional-capabilities
wking added a commit to wking/cincinnati-graph-data that referenced this pull request Apr 16, 2024
The h suffix in "19h" stands for "hexadecimal", so 19h is decimal 25.
The first group-by matches CPUs exposed to the firmware bug with a
value of one.  The second group-by shows that the node_cpu_info metric
is working on at least some nodes.

I'm preserving vendor and family in the groupings, because it's nice
to be able to say "we aren't matching this risk for your cluster,
because we see..." and then point out an Intel CPU, or an AMD CPU with
a different family, or some other example of why we don't think the
risk applies to that cluster.  The topk ensures that if we have both
risk-matching CPUs and non-matching CPUs, we prefer the matching
result (value 1) over the non-matching result(s) (value 0).

The _id="" portion is a pattern to support HyperShift and other
systems that could query the cluster's data out of a PromQL engine
that stored data for multiple clusters.  More context in 5cb2e93
(blocked-edges/4.11.*-KeepalivedMulticastSkew: Explicit _id="",
2023-05-09, openshift#3591)

Generated by writing the 4.12.45 risk by hand and copying it out to
other impacted patch versions with:

  $ for Z in $(seq 46 51); do sed "s/4.12.45/4.12.${Z}/" blocked-edges/4.12.45-AMD19hFirmware.yaml > "blocked-edges/4.12.${Z}-AMD19hFirmware.yaml"; done

And then I set 'fixedIn: 4.12.53' in the 4.12.51 risk (we never had a
4.12.52 release, so 4.12.53 is the next one after 4.12.51).
wking added a commit to wking/cincinnati-graph-data that referenced this pull request Apr 16, 2024
The h suffix in "19h" stands for "hexadecimal", so 19h is decimal 25.
The first group-by matches CPUs exposed to the firmware bug with a
value of one.  The second group-by shows that the node_cpu_info metric
is working on at least some nodes.

I'm preserving vendor and family in the groupings, because it's nice
to be able to say "we aren't matching this risk for your cluster,
because we see..." and then point out an Intel CPU, or an AMD CPU with
a different family, or some other example of why we don't think the
risk applies to that cluster.  The topk ensures that if we have both
risk-matching CPUs and non-matching CPUs, we prefer the matching
result (value 1) over the non-matching result(s) (value 0).

The _id="" portion is a pattern to support HyperShift and other
systems that could query the cluster's data out of a PromQL engine
that stored data for multiple clusters.  More context in 5cb2e93
(blocked-edges/4.11.*-KeepalivedMulticastSkew: Explicit _id="",
2023-05-09, openshift#3591)

Generated by writing the 4.12.45 risk by hand and copying it out to
other impacted patch versions with:

  $ for Z in $(seq 46 51); do sed "s/4.12.45/4.12.${Z}/" blocked-edges/4.12.45-AMD19hFirmware.yaml > "blocked-edges/4.12.${Z}-AMD19hFirmware.yaml"; done

And then I set 'fixedIn: 4.12.53' in the 4.12.51 risk (we never had a
4.12.52 release, so 4.12.53 is the next one after 4.12.51).
wking added a commit to wking/cincinnati-graph-data that referenced this pull request May 23, 2024
Point at the fleet-scoped OCPBUGS / UpgradeBlocker assessment process,
because we don't want managed to declare risks without having that
all-fleet assessment (even if the all-fleet assessment happens after a
managed-specific risk declaration).  If an all-fleet risk is declared,
any managed-specific risk in that space can be removed.  If an
all-fleet risk is not declared, service-delivery graph-data admins can
make their own call about what to do with a managed-specific risk.

Also point out the possibility of customer-managed admins being
confused by managed-specific risks.  For example, ca5795c (Adding
update risk for OPNET-479, 2024-03-08, openshift#4903) was an ARO-specific
declaration, but the message said "clusters running on Azure with
Accelerated Networking" which nominally includes non-ARO Azure
clusters.  Some of those cluster admins asked about why the risk
wasn't matching their clusters, and it's because we saw no evidence of
non-ARO clusters being bit by the race [1].  The guidelines I'm adding
will hopefully reduce the chance of future managed-specific risk
declarations causing similar customer-managed admin confusion.

I'm also adding _id="" to the queries as a pattern to support
HyperShift and other systems that could query the cluster's data out
of a PromQL engine that stored data for multiple clusters.  More
context in 5cb2e93 (blocked-edges/4.11.*-KeepalivedMulticastSkew:
Explicit _id="", 2023-05-09, openshift#3591).

And I'm also adding ARO-specific PromQL.  Maybe someday ARO will join
the other managed flavors in using sre:telemetry:managed_labels, but
they haven't yet.

[1]: https://issues.redhat.com/browse/OPNET-479?focusedId=24366396&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-24366396
openshift-merge-bot bot pushed a commit that referenced this pull request May 31, 2024
The PromQL is:

* If we know IPsec is enabled, we're exposed, or...
* If we know IPsec is not enabled, we're not exposed, or...
* If are OVN, but aren't sure if IPsec is enabled
  (e.g. ovnkube_master_ipsec_enabled scraping is failing), we might be
  exposed.  -1 will cause an evaluation failure [1].  Or...
* If we know we are not OVN, we are not exposed, or...
* If we aren't sure we're on OVN (e.g. apiserver_storage_objects
  scraping is failing), we might be exposed.  Returning no time series
  will cause an evaluation failure.

The label_replace makes it easier to tell when we're in the "we know
IPsec is not enabled" case.

The max_over_time avoids hiccuping if metrics are interrupted; see
5e4480f (blocked-edges/4.14.*-AzureRegistryImagePreservation: Look
for active registry use, 2024-02-09, #4763).

I'm also adding _id="" to the queries as a pattern to support
HyperShift and other systems that could query the cluster's data out
of a PromQL engine that stored data for multiple clusters.  More
context in 5cb2e93 (blocked-edges/4.11.*-KeepalivedMulticastSkew:
Explicit _id="", 2023-05-09, #3591).

Generated by creating the 4.14.0 file by hand and copying it out to
the other 4.14 releases with:

  $ curl -s 'https://api.openshift.com/api/upgrades_info/graph?channel=candidate-4.14&arch=amd64' | jq -r '.nodes[] | .version' | grep '^4[.]14[.]' | grep -v '^4[.]14[.]0$' | while read VERSION; do sed "s/4.14.0/${VERSION}/" blocked-edges/4.14.0-OVNInterConnectTransitionIPsec.yaml > "blocked-edges/${VERSION}-OVNInterConnectTransitionIPsec.yaml"; done

[1]: https://github.com/openshift/enhancements/blob/4668a0825c59739dfafd2ae661c16cf30f540946/enhancements/update/targeted-update-edge-blocking.md?plain=1#L119
wking added a commit to wking/cincinnati-graph-data that referenced this pull request Jun 11, 2024
034fa01 (blocked-edges/4.12.*: Declare AWSOldBootImages,
2022-12-14, openshift#2909) explains why we need to look for 4.9-or-earlier
instead of looking for the 4.8-or-earlier condition this risk is
associated with.

I'm also adding _id="" to the queries as a pattern to support
HyperShift and other systems that could query the cluster's data out
of a PromQL engine that stored data for multiple clusters.  More
context in 5cb2e93 (blocked-edges/4.11.*-KeepalivedMulticastSkew:
Explicit _id="", 2023-05-09, openshift#3591).
wking added a commit to wking/cincinnati-graph-data that referenced this pull request Jun 11, 2024
034fa01 (blocked-edges/4.12.*: Declare AWSOldBootImages,
2022-12-14, openshift#2909) explains why we need to look for 4.9-or-earlier
instead of looking for the 4.8-or-earlier condition this risk is
associated with.

I'm also adding _id="" to the queries as a pattern to support
HyperShift and other systems that could query the cluster's data out
of a PromQL engine that stored data for multiple clusters.  More
context in 5cb2e93 (blocked-edges/4.11.*-KeepalivedMulticastSkew:
Explicit _id="", 2023-05-09, openshift#3591).

Fixed in rc.4, because it has the new minor_min from f8316da
(build-suggestions/4.16: Set minor_min to 4.15.17, 2024-06-06, openshift#5352):

  $ oc adm release info quay.io/openshift-release-dev/ocp-release:4.16.0-rc.3-x86_64 | grep Upgrades
    Upgrades: 4.15.11, 4.15.12, 4.15.13, 4.15.14, 4.15.15, 4.15.16, 4.16.0-ec.1, 4.16.0-ec.2, 4.16.0-ec.3, 4.16.0-ec.4, 4.16.0-ec.5, 4.16.0-ec.6, 4.16.0-rc.0, 4.16.0-rc.1, 4.16.0-rc.2
  $ oc adm release info quay.io/openshift-release-dev/ocp-release:4.16.0-rc.4-x86_64 | grep Upgrades
    Upgrades: 4.15.17, 4.16.0-ec.1, 4.16.0-ec.2, 4.16.0-ec.3, 4.16.0-ec.4, 4.16.0-ec.5, 4.16.0-ec.6, 4.16.0-rc.0, 4.16.0-rc.1, 4.16.0-rc.2, 4.16.0-rc.3

and the fix for [1] is in [2].

[1]: https://issues.redhat.com/browse/OCPBUGS-34492
[2]: https://amd64.ocp.releases.ci.openshift.org/releasestream/4-stable/release/4.15.16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants