Skip to content

Conversation

@DavidHurta
Copy link
Contributor

@DavidHurta DavidHurta commented Apr 17, 2023

This pull request will add new flags and some minor logic to enable the CVO to evaluate conditional updates in HyperShift.

This pull request references the https://issues.redhat.com/browse/OTA-854

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Apr 17, 2023
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Apr 17, 2023

@Davoska: This pull request references OTA-854 which is a valid jira issue.

Details

In response to this:

This pull request references the https://issues.redhat.com/browse/OTA-854

The copied text from the commit explaining the pull request:

This commit will introduce a new flag to the CVO regarding its PromQL target for risk evaluation of conditional updates.

For the CVO to successfully access a service that provides metrics (in the case of the CVO it's the thanos-querier service), it needs three things.

It needs the service's address, a CA bundle to verify the certificate provided by the service to allow secure communication using TLS between the actors [1], and the authorization credentials of the CVO. Currently, the CVO hardcodes the address, the path to the CA bundle, and the path to the credentials file.

This is not ideal, as CVO is starting to be used in other repositories such as HyperShift [2]. This forces other developers to look into the depths of the CVO to find these paths and forces them to put the respective files in these hardcoded paths.

A path to the service CA bundle was added because there exist a lot of CAs in the Hypershift, and the location of the corresponding CA bundle files may vary.

A flag for the address of the service was not added because it is not needed for the HyperShift to function properly at the moment. The CVO in the standalone OpenShift accesses the thanos-querier for metrics. More precisely, the CVO connects to the service called thanos-querier in the openshift-monitoring namespace. The same service is also present in the Hypershift. This service is accessible to the CVOs in the hosted control planes in the Hypershift.

A flag to specify the path to the credentials file was not added because the CVO hardcodes the same path to the credentials file as is the path to the credentials file used in other service accounts in Kubernetes [3].

[1] https://docs.openshift.com/container-platform/4.12/security/certificate_types_descriptions/service-ca-certificates.html
[2] https://github.com/openshift/hypershift
[3] https://kubernetes.io/docs/reference/access-authn-authz/service-accounts-admin/#serviceaccount-admission-controller

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 17, 2023
@DavidHurta
Copy link
Contributor Author

It's a simple change but I still need to test it with a CVO that has some conditional updates and evaluates them.

Copy link
Member

@petr-muller petr-muller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with a nit

/hold
Holding to you can perform the testing you mentioned

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 18, 2023
@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Apr 18, 2023
@DavidHurta DavidHurta changed the title OTA-854: Add a new flag to specify the path to the service CA bundle [WIP] OTA-854: Add a new flag to specify the path to the service CA bundle Jun 12, 2023
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 12, 2023
@DavidHurta
Copy link
Contributor Author

Making multiple changes. Converting back to a draft.

@DavidHurta DavidHurta marked this pull request as draft June 12, 2023 09:54
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Jun 13, 2023
@DavidHurta DavidHurta force-pushed the ota-854-configurable-cvo-knobs-for-promql branch from 7c20218 to 8dfcbf8 Compare July 3, 2023 09:57
@DavidHurta DavidHurta force-pushed the ota-854-configurable-cvo-knobs-for-promql branch from 420cd9e to 8cc2f51 Compare July 10, 2023 15:56
@DavidHurta DavidHurta changed the title [WIP] OTA-854: Add a new flag to specify the path to the service CA bundle [WIP] OTA-854: Add risk evaluation of conditional updates in HyperShift Jul 10, 2023
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Jul 11, 2023

@Davoska: This pull request references OTA-854 which is a valid jira issue.

Details

In response to this:

This pull request will add new flags and some minor logic to enable the CVO to evaluate conditional updates in HyperShift.

This pull request references the https://issues.redhat.com/browse/OTA-854

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@DavidHurta DavidHurta force-pushed the ota-854-configurable-cvo-knobs-for-promql branch 2 times, most recently from 6e743f4 to 93a6e5d Compare July 12, 2023 12:34
@DavidHurta DavidHurta changed the title [WIP] OTA-854: Add risk evaluation of conditional updates in HyperShift OTA-854: Add risk evaluation of conditional updates in HyperShift Jul 12, 2023
@DavidHurta
Copy link
Contributor Author

DavidHurta commented Jul 12, 2023

Steps I have taken to test this.

  1. Follow prerequisite steps on https://hypershift-docs.netlify.app/getting-started/. Some of my notes:
  • Have AWS credentials available. When developing I have used the openshift-dev account.

  • The pull secret file will depend on the image repositories being used.

  • The Route53 public zone step may be skipped as we will use the devcluster.openshift.com base domain.

  • Note the S3 bucket step. In my case the envsubst command is not working as expected, and the variable is not substituted. Make sure the final file policy.json has a correct format with the environment variable being substituted.

  • I have used a locally built and pushed HyperShift image (by running a following command in the hypershift repository directory RUNTIME=podman IMG=quay.io/dhurta/hypershift:latest make docker-build docker-push while having locally the OTA-855: Enable CVO to evaluate conditional updates on self-managed HyperShift deployed on OpenShift hypershift#2807 changes) and a built release image from the Cluster Bot (by running build https://github.com/openshift/cluster-version-operator/pull/926). I am not sure whether running build https://github.com/openshift/cluster-version-operator/pull/926, https://github.com/openshift/hypershift/pull/2807 will work to provide a single image as HyperShift is not part of the release image (I could not see any mentions of hypershift in the build logs).

  1. Prepare some environment variables appropriately, for example:
#BUCKET_NAME=dhurta-hypershift-test #BUCKET_NAME is already set as part of the prerequisite steps
REGION="us-east-1"
AWS_CREDS="$HOME/.aws/credentials"
PULL_SECRET="$HOME/.docker/config.json"
BASE_DOMAIN="devcluster.openshift.com"
HYPERSHIFT_IMAGE="quay.io/dhurta/hypershift:latest"
RELEASE_IMAGE="registry.build05.ci.openshift.org/ci-ln-bjxgnck/release:latest"
  1. Login to an OpenShift cluster
oc login <url> --username <username> --password <password>

To test the managed HyperShift:

  1. Follow steps on https://polarion.engineering.redhat.com/polarion/#/project/OSE/workitem?id=OCP-57234 for Observability Operator installation

  2. Follow steps on https://polarion.engineering.redhat.com/polarion/#/project/OSE/workitem?id=OCP-57236 to create MonitoringStack CR to collect HyperShift hosted-control-plane metrics. The secret and the spec.prometheusConfig.remoteWrite field can be omitted.

  3. Install HyperShift operator with RHOBS monitoring enabled. I am running the commands using a local binary in the following examples.

RHOBS_MONITORING="1" ./bin/hypershift install \
  --oidc-storage-provider-s3-bucket-name $BUCKET_NAME \
  --oidc-storage-provider-s3-credentials $AWS_CREDS \
  --oidc-storage-provider-s3-region $REGION \
  --enable-uwm-telemetry-remote-write \
  --platform-monitoring=OperatorOnly \
  --metrics-set=All \
  --hypershift-image "$HYPERSHIFT_IMAGE" \
  --rhobs-monitoring 
  1. Create a hosted cluster.
HOSTED_CLUSTER_NAME=dhurta-test-aws

./bin/hypershift create cluster aws  \
  --name="$HOSTED_CLUSTER_NAME" \
  --pull-secret=$PULL_SECRET \
  --node-pool-replicas=1 \
  --release-image="$RELEASE_IMAGE" \
  --aws-creds=$AWS_CREDS \
  --region=$REGION \
  --base-domain="$BASE_DOMAIN" \
  --control-plane-operator-image "$HYPERSHIFT_IMAGE"
  1. Wait for the hosted cluster to finish installation. The scraping manifests (such as ServiceMonitor) are applied after the hosted cluster has completed its installation. Wait for the PROGRESS to be COMPLETED. This may take a while.
oc get hostedcluster -n clusters --watch
  1. Scale the control-plane-operator (CPO) to zero. We will be modifying some resources that would be overwritten by the CPO. We are using oc annotate to tell the hypershift operator to scale down the CPO and not reconcile it, neat!
oc annotate -n clusters hostedcluster "$HOSTED_CLUSTER_NAME" hypershift.openshift.io/debug-deployments="control-plane-operator" --overwrite
  1. Scale down the hosted-cluster-config-operator. The operator reconciles the hosted cluster version and would overwrite our following changes.
oc scale -n "clusters-$HOSTED_CLUSTER_NAME" deployments/hosted-cluster-config-operator --replicas=0
  1. Extract the admin kubeconfig file for the hosted cluster. Make sure to specify the file appropriately!
KUBECONFIG_HOSTED_CLUSTER=kubeconfig
./bin/hypershift create kubeconfig --name "$HOSTED_CLUSTER_NAME" > "$KUBECONFIG_HOSTED_CLUSTER"
  1. View the status of the hosted cluster (note the flag --kubeconfig).
oc adm upgrade --kubeconfig="$KUBECONFIG_HOSTED_CLUSTER"
  1. Set a custom upstream. Note that the used version in the specified JSON file will need to be same as the version of the hosted cluster version. For example:
oc patch clusterversion version --type json -p '[{"op": "add", "path": "/spec/upstream", "value": "https://raw.githubusercontent.com/Davoska/cincinnati-graph-data/test-promql/test/cincinnati-graph-data.json"}]' --kubeconfig="$KUBECONFIG_HOSTED_CLUSTER"
  1. Set a custom channel for the hosted cluster to start fetching the updates.
oc adm upgrade channel test --kubeconfig="$KUBECONFIG_HOSTED_CLUSTER"
  1. Wait for the evaluation of conditional updates (as of this moment one PromQL query per ~10 minutes):
$ oc adm upgrade --include-not-recommended  --kubeconfig="$KUBECONFIG_HOSTED_CLUSTER"
Cluster version is 4.14.0-0.ci.test-2023-07-12-150458-ci-ln-bjxgnck-latest

Upstream: https://raw.githubusercontent.com/Davoska/cincinnati-graph-data/test-promql/test/cincinnati-graph-data.json
Channel: test
No updates available. You may force an upgrade to a specific release image, but doing so may not be supported and may result in downtime or data loss.

Supported but not recommended updates:

  Version: 4.12.23
  Image: quay.io/openshift-release-dev/ocp-release@sha256:3333333333333333333333333333333333333333333333333333333333333333
  Recommended: False
  Reason: Youngest
  Message: Risk to 4.12.23 - Hosted OpenShift clusters will explode! https://example.com/youngest

  Version: 4.12.22
  Image: quay.io/openshift-release-dev/ocp-release@sha256:1111111111111111111111111111111111111111111111111111111111111111
  Recommended: Unknown
  Reason: EvaluationFailed
  Message: Exposure to Oldest is unknown due to an evaluation failure: client-side throttling: only 10.765µs has elapsed since the last match call completed for this cluster condition backend; this cached cluster condition request has been queued for later execution
  Risk to 4.12.22 - Non-Hosted OpenShift clusters will explode! https://example.com/oldest
  1. Wait for the next evaluation:
$  oc adm upgrade --include-not-recommended  --kubeconfig="$KUBECONFIG_HOSTED_CLUSTER"
Cluster version is 4.14.0-0.ci.test-2023-07-12-150458-ci-ln-bjxgnck-latest

Upstream: https://raw.githubusercontent.com/Davoska/cincinnati-graph-data/test-promql/test/cincinnati-graph-data.json
Channel: test

Recommended updates:

  VERSION     IMAGE
  4.12.22     quay.io/openshift-release-dev/ocp-release@sha256:1111111111111111111111111111111111111111111111111111111111111111

Supported but not recommended updates:

  Version: 4.12.23
  Image: quay.io/openshift-release-dev/ocp-release@sha256:3333333333333333333333333333333333333333333333333333333333333333
  Recommended: False
  Reason: Youngest
  Message: Risk to 4.12.23 - Hosted OpenShift clusters will explode! https://example.com/youngest

Clean up

  1. Scale up the CPO:
oc annotate -n clusters hostedcluster "$HOSTED_CLUSTER_NAME" hypershift.openshift.io/debug-deployments="" --overwrite
  1. Destroy the hosted cluster:
./bin/hypershift destroy cluster aws --name "$HOSTED_CLUSTER_NAME" --aws-creds $AWS_CREDS 
  1. Uninstall the HyperShift operator
./bin/hypershift install render --format=yaml | oc delete -f -

Testing the self-managed HyperShift

  • Just like testing the managed HyperShift with a few modifications.

  • Omit the Observability Operator installation and the creation of the MonitoringStack.

  • Install HyperShift operator without the RHOBS monitoring being enabled.

./bin/hypershift install \
  --oidc-storage-provider-s3-bucket-name $BUCKET_NAME \
  --oidc-storage-provider-s3-credentials $AWS_CREDS \
  --oidc-storage-provider-s3-region $REGION \
  --enable-uwm-telemetry-remote-write \
  --platform-monitoring=OperatorOnly \
  --metrics-set=All \
  --hypershift-image "$HYPERSHIFT_IMAGE"
  • Repeat the steps...

  • View not recommended updates

$  oc adm upgrade --include-not-recommended  --kubeconfig="$KUBECONFIG_HOSTED_CLUSTER"
Cluster version is 4.14.0-0.ci.test-2023-07-12-150458-ci-ln-bjxgnck-latest

Upstream: https://raw.githubusercontent.com/Davoska/cincinnati-graph-data/test-promql/test/cincinnati-graph-data.json
Channel: test
No updates available. You may force an upgrade to a specific release image, but doing so may not be supported and may result in downtime or data loss.

Supported but not recommended updates:

  Version: 4.12.23
  Image: quay.io/openshift-release-dev/ocp-release@sha256:3333333333333333333333333333333333333333333333333333333333333333
  Recommended: False
  Reason: Youngest
  Message: Risk to 4.12.23 - Hosted OpenShift clusters will explode! https://example.com/youngest

  Version: 4.12.22
  Image: quay.io/openshift-release-dev/ocp-release@sha256:1111111111111111111111111111111111111111111111111111111111111111
  Recommended: Unknown
  Reason: EvaluationFailed
  Message: Exposure to Oldest is unknown due to an evaluation failure: client-side throttling: only 14.852µs has elapsed since the last match call completed for this cluster condition backend; this cached cluster condition request has been queued for later execution
  Risk to 4.12.22 - Non-Hosted OpenShift clusters will explode! https://example.com/oldest
  • Wait for the next evaluation:
$  oc adm upgrade --include-not-recommended  --kubeconfig="$KUBECONFIG_HOSTED_CLUSTER"
Cluster version is 4.14.0-0.ci.test-2023-07-12-150458-ci-ln-bjxgnck-latest

Upstream: https://raw.githubusercontent.com/Davoska/cincinnati-graph-data/test-promql/test/cincinnati-graph-data.json
Channel: test

Recommended updates:

  VERSION     IMAGE
  4.12.22     quay.io/openshift-release-dev/ocp-release@sha256:1111111111111111111111111111111111111111111111111111111111111111

Supported but not recommended updates:

  Version: 4.12.23
  Image: quay.io/openshift-release-dev/ocp-release@sha256:3333333333333333333333333333333333333333333333333333333333333333
  Recommended: False
  Reason: Youngest
  Message: Risk to 4.12.23 - Hosted OpenShift clusters will explode! https://example.com/youngest
  • Repeat the cleanup steps

@DavidHurta DavidHurta marked this pull request as ready for review July 12, 2023 22:17
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 12, 2023
@openshift-ci openshift-ci bot requested a review from wking July 12, 2023 22:18
pkg/cvo/cvo.go Outdated
requiredFeatureSet: requiredFeatureSet,
clusterProfile: clusterProfile,
conditionRegistry: standard.NewConditionRegistry(kubeClient),
conditionRegistry: standard.NewConditionRegistry(kubeClientMgmtCluster, promqlTarget),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the naming would be a little clearer if we avoided the hypershift lingo and called kubeClientMgmtCluster something that indicates "cluster that we use to evaluate upgrade conditions" instead, because at this place it is not necessarily a "hypershift management" cluster client.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this kubeclient should be part of promqlTarget... They are only used together.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should change the name of the variable from conditionRegistry to remoteMetricserver or just metricServer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should change the name of the variable from conditionRegistry to remoteMetricserver or just metricServer

That depends on the level of abstraction being used. The whole package clusterconditions is using the word condition to convey the meanings. There are conditionRegistries where you can Register a conditionType. You can Match clusterCondition against a clusterRegistry.

We could change the name of the Operator struct's field conditionRegistry to metricServer.

But using the metricServer as a variable of ConditionRegistry interface would imply that we register something at the metric server or we prune something at the server IMO.

type ConditionRegistry interface {
	// Register registers a condition type, and panics on any name collisions.
	Register(conditionType string, condition Condition)

	// PruneInvalid returns a new slice with recognized, valid conditions.
	// The error complains about any unrecognized or invalid conditions.
	PruneInvalid(ctx context.Context, matchingRules []configv1.ClusterCondition) ([]configv1.ClusterCondition, error)

	// Match returns whether the cluster matches the given rules (true),
	// does not match (false), or the rules fail to evaluate (error).
	Match(ctx context.Context, matchingRules []configv1.ClusterCondition) (bool, error)
}

upstream = ""
}

if optr.isHCPModeEnabled {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we may want to test for misconfigured empty clusterID?

Also, are the hosted custer admins able to spoof their clusters ClusterID and read other hosted cluster metrics this way? If yes then that seems to be somewhat security-sensitive...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we may want to test for misconfigured empty clusterID?

Is the question regarding whether the substitution of _id works for empty clusterId? Or to see whether something breaks and potential side effects?

Also, are the hosted custer admins able to spoof their clusters ClusterID and read other hosted cluster metrics this way? If yes then that seems to be somewhat security-sensitive...

Good question. The clusterId and similar things in the cluster version are being reconciled from the HCP (hosted control plane) and should be overwritten.

https://github.com/openshift/hypershift/blob/main/control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources.go#L956-L960

func (r *reconciler) reconcileClusterVersion(ctx context.Context, hcp *hyperv1.HostedControlPlane) error {
	clusterVersion := &configv1.ClusterVersion{ObjectMeta: metav1.ObjectMeta{Name: "version"}}
	if _, err := r.CreateOrUpdate(ctx, r.client, clusterVersion, func() error {
		clusterVersion.Spec.ClusterID = configv1.ClusterID(hcp.Spec.ClusterID)
		clusterVersion.Spec.Capabilities = nil
		clusterVersion.Spec.Upstream = ""
		clusterVersion.Spec.Channel = hcp.Spec.Channel
		clusterVersion.Spec.DesiredUpdate = nil
		return nil
	}); err != nil {
		return fmt.Errorf("failed to reconcile clusterVersion: %w", err)
	}

	return nil
}

But I am not sure which roles are binded to the hosted cluster admins at the moment. The admin-kubeconfig secret that is available in the HCP does provide the capability to change the cluster version. Although I need to check whether these kind of permissions are binded to a normal hosted cluster admin...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, when I deploy a hosted cluster using a Cluster Bot running rosa create 4.12.22 6h, I am not able to modify the cluster's cluster version. But that is a managed cluster...

oc patch clusterversion version --type json -p '[{"op": "add", "path": "/spec/upstream", "value": "https://raw.githubusercontent.com/Davoska/cincinnati-graph-data/test-promql/test/cincinnati-graph-data.json"}]'
Error from server (Prevented from accessing Red Hat managed resources. This is in an effort to prevent harmful actions that may cause unintended consequences or affect the stability of the cluster. If you have any questions about this, please reach out to Red Hat support at https://access.redhat.com/support): admission webhook "regular-user-validation.managed.openshift.io" denied the request: Prevented from accessing Red Hat managed resources. This is in an effort to prevent harmful actions that may cause unintended consequences or affect the stability of the cluster. If you have any questions about this, please reach out to Red Hat support at https://access.redhat.com/support

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The question I think Petr was asking if the clusterID is empty for some reason(because of a bug in the code) do you feel confident that it is handled properly with right information in the log.

}
scheme := "https"
if p.QueryNamespace == "openshift-observability-operator" && p.QueryService == "hypershift-monitoring-stack-prometheus" {
scheme = "http"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧐 why don't we have TLS in hypershift?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I am not sure. The monitoring stack deployed in the managed OpenShift exposes the Prometheus server's port that serves the HTTP API via the service hypershift-monitoring-stack-prometheus. But it seems it's not configured to use the https scheme.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or I am using a wrong configuration, or the service is not expected to be queried. Investigating...

Was initially going from the comment:

Follow steps on https://polarion.engineering.redhat.com/polarion/#/project/OSE/workitem?id=OCP-57234 for Observability Operator installation

Follow steps on https://polarion.engineering.redhat.com/polarion/#/project/OSE/workitem?id=OCP-57236 to create MonitoringStack CR to collect HyperShift hosted-control-plane metrics. The secret and the spec.prometheusConfig.remoteWrite field can be omitted.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Http looks odd. @wking Do you know if we have SSL TLS auth for the monitoring stack?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was discussed a little bit in https://redhat-internal.slack.com/archives/C0VMT03S5/p1689696199941319

...we only create a http service...

This code itself was removed. The user now configures the scheme via the --metrics-url flag. Using HTTPS in managed would be needed to be discussed with appropriate folks.

@DavidHurta
Copy link
Contributor Author

/hold
Modifying after the comments from Petr.

@DavidHurta
Copy link
Contributor Author

I have tried to address @petr-muller comments resulting in the bac4941 commit. I haven't tested new changes so far as they require building a release image and modifying the code in the HyperShift repository, and thus I have marked the commit as wip as of this moment.

@DavidHurta
Copy link
Contributor Author

/hold
Addressing feedback and working on new changes due to the feedback on the HyperShift pull request.

@DavidHurta DavidHurta force-pushed the ota-854-configurable-cvo-knobs-for-promql branch 7 times, most recently from e3fbb24 to 2d62d75 Compare August 2, 2023 19:33
Copy link
Member

@petr-muller petr-muller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is shaping up really nicely! Some comments inline.

This commit will introduce new flags and logic to the CVO regarding its
PromQL target for risk evaluation of conditional updates with the CVO
being run in a hosted control plane in mind.

For the CVO to successfully access a service that provides metrics
(in the case of the CVO in a standalone OpenShift cluster it's the
thanos-querier service), it needs three things. It needs the service's
address, a CA bundle to verify the certificate provided by the service
to allow secure communication using TLS between the actors [1], and
the authorization credentials of the CVO. Currently, the CVO hardcodes
the address, the path to the CA bundle, and the path to the credentials
file.

This is not ideal, as CVO is starting to be used in other repositories
such as HyperShift [2].

A path to the service CA bundle file is added to allow explicitly
setting the CA bundle file of the given query service.

Currently, the CVO is using a kube-api server to resolve the IP address
of a specific service in the cluster [3]. Add new flags that allow
configuring the CVO to resolve the IP address via DNS when DNS is
available to the CVO. This is the case for hosted CVOs in HyperShift.
The alternative in HyperShift would be to use the management cluster
kube-api server and give the hosted CVOs additional permissions.

Add flags to specify the PromQL target URL. This URL contains the used
scheme, port, and the server's name used for TLS configuration. In the
case, DNS is enabled, the URL is also used to query the Prometheus
server. In the case, the DNS is disabled, the server can be specified
via the Kubernetes service in the cluster that exposes the server.

A flag to specify the path to the credentials file was added for more
customizability. This flag also enables the CVO to not set the token
when it's not needed. A CVO can communicate with a Prometheus server
over HTTP. In the case, the token is not needed, it would be undesirable
for the CVO to send its token without reason over HTTP.

This commit also adds a new flag to specify whether the CVO resides in
a hosted control plane. In this case, the CVO will inject its cluster
Id into PromQL queries to differentiate between multiple time series
belonging to different hosted clusters [4].

	[1] https://docs.openshift.com/container-platform/4.12/security/certificate_types_descriptions/service-ca-certificates.html
	[2] https://github.com/openshift/hypershift
	[3] openshift#920
	[4] openshift/cincinnati-graph-data#3591
@DavidHurta DavidHurta force-pushed the ota-854-configurable-cvo-knobs-for-promql branch from 2d62d75 to b1e69af Compare August 7, 2023 13:56
@DavidHurta
Copy link
Contributor Author

The new commit utilizes the DNS in HyperShift and adds more configurability of the CVO.

I still need to verify that no regression in this code has happened for managed HyperShift.

Copy link
Member

@petr-muller petr-muller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/hold

LGTM. I am setting a hold to allow reviews by people more knowledgeable about what is the current state of Hypershift effort. I think the generalization added in this PR is not that costly in the terms of code complexity and should be safe to merge even if we do not have all details on Hypershift side sorted out.

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 8, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 8, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Davoska, petr-muller

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [Davoska,petr-muller]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@DavidHurta
Copy link
Contributor Author

DavidHurta commented Aug 9, 2023

Oh, I would like to test this against a standalone OCP cluster. I want to make sure there is no regression. I don't think we have a test for the evaluation of conditional updates, and I can't remember whether I have tested these new changes this way. Although the PR just propagates flags.

@petr-muller
Copy link
Member

I don't think we have a test for the evaluation of conditional updates

Can we invent one?

@DavidHurta
Copy link
Contributor Author

DavidHurta commented Aug 15, 2023

Can we invent one?

Since we are planning to have an e2e test for evaluation of conditional updates for HyperShift https://issues.redhat.com/browse/OTA-986 we should at least discuss the creation of one for standalone OCP. The code is pretty stable but the frequency of changes is increasing a little bit and testing this manually takes a little bit of time.


I have tested this PR against a standalone OCP cluster and no regression seems to be present.

The PromQL queries got evaluated

[dhurta@fedora ~]$ oc adm upgrade --include-not-recommended
Cluster version is 4.14.0-0.ci.test-2023-08-15-130559-ci-ln-35lnhkt-latest

Upstream: https://raw.githubusercontent.com/Davoska/cincinnati-graph-data/test-promql/test/cincinnati-graph-data-new.json
Channel: test

Recommended updates:

  VERSION     IMAGE
  4.12.23     quay.io/openshift-release-dev/ocp-release@sha256:3333333333333333333333333333333333333333333333333333333333333333

Supported but not recommended updates:

  Version: 4.12.22
  Image: quay.io/openshift-release-dev/ocp-release@sha256:1111111111111111111111111111111111111111111111111111111111111111
  Recommended: False
  Reason: Oldest
  Message: Risk to 4.12.22 - Non-Hosted OpenShift clusters will explode! https://example.com/oldest

@wking, @LalatenduMohanty, feel free to have a quick look after the new changes. I'll wait a little bit and then I'll unhold the PR.

@petr-muller
Copy link
Member

We may want to unhold this only after the branching though, no reason to increase the overall risk before that

@petr-muller
Copy link
Member

petr-muller commented Aug 22, 2023

/retitle OTA-854: Add configurable CVO knobs for risk-evaluation PromQL target

@openshift-ci openshift-ci bot changed the title OTA-854: Add risk evaluation of conditional updates in HyperShift OTA-854: Add configurable CVO knobs for risk-evaluation PromQL target Aug 22, 2023
@petr-muller
Copy link
Member

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 19, 2023
@petr-muller
Copy link
Member

Merge gate was lifted, we can now merge this one

@petr-muller
Copy link
Member

/test e2e-agnostic-ovn-upgrade-into-change

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 20, 2023

@Davoska: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants