-
Notifications
You must be signed in to change notification settings - Fork 213
OTA-854: Add configurable CVO knobs for risk-evaluation PromQL target #926
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OTA-854: Add configurable CVO knobs for risk-evaluation PromQL target #926
Conversation
|
@Davoska: This pull request references OTA-854 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
It's a simple change but I still need to test it with a CVO that has some conditional updates and evaluates them. |
petr-muller
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with a nit
/hold
Holding to you can perform the testing you mentioned
|
Making multiple changes. Converting back to a draft. |
7c20218 to
8dfcbf8
Compare
420cd9e to
8cc2f51
Compare
|
@Davoska: This pull request references OTA-854 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
6e743f4 to
93a6e5d
Compare
Steps I have taken to test this.
To test the managed HyperShift:
Clean up
Testing the self-managed HyperShift
|
pkg/cvo/cvo.go
Outdated
| requiredFeatureSet: requiredFeatureSet, | ||
| clusterProfile: clusterProfile, | ||
| conditionRegistry: standard.NewConditionRegistry(kubeClient), | ||
| conditionRegistry: standard.NewConditionRegistry(kubeClientMgmtCluster, promqlTarget), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the naming would be a little clearer if we avoided the hypershift lingo and called kubeClientMgmtCluster something that indicates "cluster that we use to evaluate upgrade conditions" instead, because at this place it is not necessarily a "hypershift management" cluster client.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this kubeclient should be part of promqlTarget... They are only used together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should change the name of the variable from conditionRegistry to remoteMetricserver or just metricServer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should change the name of the variable from conditionRegistry to remoteMetricserver or just metricServer
That depends on the level of abstraction being used. The whole package clusterconditions is using the word condition to convey the meanings. There are conditionRegistries where you can Register a conditionType. You can Match clusterCondition against a clusterRegistry.
We could change the name of the Operator struct's field conditionRegistry to metricServer.
But using the metricServer as a variable of ConditionRegistry interface would imply that we register something at the metric server or we prune something at the server IMO.
type ConditionRegistry interface {
// Register registers a condition type, and panics on any name collisions.
Register(conditionType string, condition Condition)
// PruneInvalid returns a new slice with recognized, valid conditions.
// The error complains about any unrecognized or invalid conditions.
PruneInvalid(ctx context.Context, matchingRules []configv1.ClusterCondition) ([]configv1.ClusterCondition, error)
// Match returns whether the cluster matches the given rules (true),
// does not match (false), or the rules fail to evaluate (error).
Match(ctx context.Context, matchingRules []configv1.ClusterCondition) (bool, error)
}
pkg/cvo/availableupdates.go
Outdated
| upstream = "" | ||
| } | ||
|
|
||
| if optr.isHCPModeEnabled { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we may want to test for misconfigured empty clusterID?
Also, are the hosted custer admins able to spoof their clusters ClusterID and read other hosted cluster metrics this way? If yes then that seems to be somewhat security-sensitive...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we may want to test for misconfigured empty clusterID?
Is the question regarding whether the substitution of _id works for empty clusterId? Or to see whether something breaks and potential side effects?
Also, are the hosted custer admins able to spoof their clusters ClusterID and read other hosted cluster metrics this way? If yes then that seems to be somewhat security-sensitive...
Good question. The clusterId and similar things in the cluster version are being reconciled from the HCP (hosted control plane) and should be overwritten.
func (r *reconciler) reconcileClusterVersion(ctx context.Context, hcp *hyperv1.HostedControlPlane) error {
clusterVersion := &configv1.ClusterVersion{ObjectMeta: metav1.ObjectMeta{Name: "version"}}
if _, err := r.CreateOrUpdate(ctx, r.client, clusterVersion, func() error {
clusterVersion.Spec.ClusterID = configv1.ClusterID(hcp.Spec.ClusterID)
clusterVersion.Spec.Capabilities = nil
clusterVersion.Spec.Upstream = ""
clusterVersion.Spec.Channel = hcp.Spec.Channel
clusterVersion.Spec.DesiredUpdate = nil
return nil
}); err != nil {
return fmt.Errorf("failed to reconcile clusterVersion: %w", err)
}
return nil
}
But I am not sure which roles are binded to the hosted cluster admins at the moment. The admin-kubeconfig secret that is available in the HCP does provide the capability to change the cluster version. Although I need to check whether these kind of permissions are binded to a normal hosted cluster admin...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, when I deploy a hosted cluster using a Cluster Bot running rosa create 4.12.22 6h, I am not able to modify the cluster's cluster version. But that is a managed cluster...
oc patch clusterversion version --type json -p '[{"op": "add", "path": "/spec/upstream", "value": "https://raw.githubusercontent.com/Davoska/cincinnati-graph-data/test-promql/test/cincinnati-graph-data.json"}]'
Error from server (Prevented from accessing Red Hat managed resources. This is in an effort to prevent harmful actions that may cause unintended consequences or affect the stability of the cluster. If you have any questions about this, please reach out to Red Hat support at https://access.redhat.com/support): admission webhook "regular-user-validation.managed.openshift.io" denied the request: Prevented from accessing Red Hat managed resources. This is in an effort to prevent harmful actions that may cause unintended consequences or affect the stability of the cluster. If you have any questions about this, please reach out to Red Hat support at https://access.redhat.com/support
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The question I think Petr was asking if the clusterID is empty for some reason(because of a bug in the code) do you feel confident that it is handled properly with right information in the log.
| } | ||
| scheme := "https" | ||
| if p.QueryNamespace == "openshift-observability-operator" && p.QueryService == "hypershift-monitoring-stack-prometheus" { | ||
| scheme = "http" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧐 why don't we have TLS in hypershift?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I am not sure. The monitoring stack deployed in the managed OpenShift exposes the Prometheus server's port that serves the HTTP API via the service hypershift-monitoring-stack-prometheus. But it seems it's not configured to use the https scheme.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or I am using a wrong configuration, or the service is not expected to be queried. Investigating...
Was initially going from the comment:
Follow steps on https://polarion.engineering.redhat.com/polarion/#/project/OSE/workitem?id=OCP-57234 for Observability Operator installation
Follow steps on https://polarion.engineering.redhat.com/polarion/#/project/OSE/workitem?id=OCP-57236 to create MonitoringStack CR to collect HyperShift hosted-control-plane metrics. The secret and the spec.prometheusConfig.remoteWrite field can be omitted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Http looks odd. @wking Do you know if we have SSL TLS auth for the monitoring stack?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was discussed a little bit in https://redhat-internal.slack.com/archives/C0VMT03S5/p1689696199941319
...we only create a http service...
This code itself was removed. The user now configures the scheme via the --metrics-url flag. Using HTTPS in managed would be needed to be discussed with appropriate folks.
|
/hold |
|
I have tried to address @petr-muller comments resulting in the bac4941 commit. I haven't tested new changes so far as they require building a release image and modifying the code in the HyperShift repository, and thus I have marked the commit as |
|
/hold |
e3fbb24 to
2d62d75
Compare
petr-muller
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is shaping up really nicely! Some comments inline.
This commit will introduce new flags and logic to the CVO regarding its PromQL target for risk evaluation of conditional updates with the CVO being run in a hosted control plane in mind. For the CVO to successfully access a service that provides metrics (in the case of the CVO in a standalone OpenShift cluster it's the thanos-querier service), it needs three things. It needs the service's address, a CA bundle to verify the certificate provided by the service to allow secure communication using TLS between the actors [1], and the authorization credentials of the CVO. Currently, the CVO hardcodes the address, the path to the CA bundle, and the path to the credentials file. This is not ideal, as CVO is starting to be used in other repositories such as HyperShift [2]. A path to the service CA bundle file is added to allow explicitly setting the CA bundle file of the given query service. Currently, the CVO is using a kube-api server to resolve the IP address of a specific service in the cluster [3]. Add new flags that allow configuring the CVO to resolve the IP address via DNS when DNS is available to the CVO. This is the case for hosted CVOs in HyperShift. The alternative in HyperShift would be to use the management cluster kube-api server and give the hosted CVOs additional permissions. Add flags to specify the PromQL target URL. This URL contains the used scheme, port, and the server's name used for TLS configuration. In the case, DNS is enabled, the URL is also used to query the Prometheus server. In the case, the DNS is disabled, the server can be specified via the Kubernetes service in the cluster that exposes the server. A flag to specify the path to the credentials file was added for more customizability. This flag also enables the CVO to not set the token when it's not needed. A CVO can communicate with a Prometheus server over HTTP. In the case, the token is not needed, it would be undesirable for the CVO to send its token without reason over HTTP. This commit also adds a new flag to specify whether the CVO resides in a hosted control plane. In this case, the CVO will inject its cluster Id into PromQL queries to differentiate between multiple time series belonging to different hosted clusters [4]. [1] https://docs.openshift.com/container-platform/4.12/security/certificate_types_descriptions/service-ca-certificates.html [2] https://github.com/openshift/hypershift [3] openshift#920 [4] openshift/cincinnati-graph-data#3591
2d62d75 to
b1e69af
Compare
|
The new commit utilizes the DNS in HyperShift and adds more configurability of the CVO. I still need to verify that no regression in this code has happened for managed HyperShift. |
petr-muller
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/hold
LGTM. I am setting a hold to allow reviews by people more knowledgeable about what is the current state of Hypershift effort. I think the generalization added in this PR is not that costly in the terms of code complexity and should be safe to merge even if we do not have all details on Hypershift side sorted out.
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Davoska, petr-muller The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
Oh, I would like to test this against a standalone OCP cluster. I want to make sure there is no regression. I don't think we have a test for the evaluation of conditional updates, and I can't remember whether I have tested these new changes this way. Although the PR just propagates flags. |
Can we invent one? |
Since we are planning to have an e2e test for evaluation of conditional updates for HyperShift https://issues.redhat.com/browse/OTA-986 we should at least discuss the creation of one for standalone OCP. The code is pretty stable but the frequency of changes is increasing a little bit and testing this manually takes a little bit of time. I have tested this PR against a standalone OCP cluster and no regression seems to be present. The PromQL queries got evaluated @wking, @LalatenduMohanty, feel free to have a quick look after the new changes. I'll wait a little bit and then I'll unhold the PR. |
|
We may want to unhold this only after the branching though, no reason to increase the overall risk before that |
|
/retitle OTA-854: Add configurable CVO knobs for risk-evaluation PromQL target |
|
/hold cancel |
|
Merge gate was lifted, we can now merge this one |
|
/test e2e-agnostic-ovn-upgrade-into-change |
|
@Davoska: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
This pull request will add new flags and some minor logic to enable the CVO to evaluate conditional updates in HyperShift.
This pull request references the https://issues.redhat.com/browse/OTA-854