Skip to content

Conversation

@DavidHurta
Copy link
Contributor

@DavidHurta DavidHurta commented Nov 20, 2023

This pull request will create GET requests to the Prometheus API server.

The currently used Prometheus client library doesn't allow explicit specification of the type of requests (GET/POST) being made to the Prometheus API server. The tenancy port of the thanos-querier.openshift-monitoring.svc service being utilized in self-managed HyperShift requires different permissions depending on the type of requests. This PR will make sure that the CVO uses GET requests to the PromQL service.

The used Prometheus client library attempts POST requests, and on 501 and 405 response status codes fall-backs to GET requests. This PR will return the 501 status code even before sending a given HTTP POST request, resulting in a GET request being made as a fallback by the used library.

An alternative is to use a different Prometheus client library (see DavidHurta@d4f4c9b). However, this vendors a lot of new packages creating potential vulnerabilities in the future. Another alternative is to fully implement specific interfaces, which may result in potential regression and bugs.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 20, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 20, 2023

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 20, 2023
@DavidHurta DavidHurta force-pushed the exchange-prometheus-client branch from d47c148 to d4f4c9b Compare November 21, 2023 15:00
@DavidHurta
Copy link
Contributor Author

/test all

@DavidHurta DavidHurta changed the title WIP: Exchange of prometheus client WIP: Create GET requests to the Prometheus API server Nov 27, 2023
@DavidHurta DavidHurta force-pushed the exchange-prometheus-client branch from 870cd9a to d00ab44 Compare November 27, 2023 20:15
@DavidHurta DavidHurta changed the title WIP: Create GET requests to the Prometheus API server OTA-855: Create GET requests to the Prometheus API server Nov 28, 2023
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Nov 28, 2023
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Nov 28, 2023

@Davoska: This pull request references OTA-855 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.15.0" version, but no target version was set.

Details

In response to this:

This pull request will create GET requests to the Prometheus API server.

The currently used Prometheus client library doesn't allow explicit specification of the type of requests (GET/POST) being made to the Prometheus API server. The tenancy port of the thanos-querier.openshift-monitoring.svc service being utilized in self-managed HyperShift requires different permissions depending on the type of requests. This PR will make sure that the CVO uses GET requests to the PromQL service.

The used Prometheus client library attempts POST requests, and on 501 and 405 response status codes fall-backs to GET requests. This PR will return the 501 status code even before sending a given HTTP POST request, resulting in a GET request being made as a fallback by the used library.

An alternative is to use a different Prometheus client library (see DavidHurta@d4f4c9b). However, this vendors a lot of new packages creating potential vulnerabilities in the future. Another alternative is to fully implement specific interfaces, which may result in potential regression and bugs.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@DavidHurta DavidHurta marked this pull request as ready for review November 28, 2023 00:33
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 28, 2023
@DavidHurta
Copy link
Contributor Author

@DavidHurta
Copy link
Contributor Author

/test unit

Copy link
Member

@wking wking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Nov 28, 2023
@DavidHurta DavidHurta force-pushed the exchange-prometheus-client branch from d00ab44 to 2230cd1 Compare November 28, 2023 23:04
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Nov 28, 2023
This commit will create GET requests to the Prometheus API server.

The used library doesn't allow to specify the type of requests
(GET/POST) being made to the Prometheus API server. The tenancy port of
the thanos-querier.openshift-monitoring.svc service being utilized in
self-managed HyperShift requires different permissions depending on the
type of requests. This commit will make sure that the CVO uses GET
requests to the PromQL service.

The used Prometheus client library attempts POST requests and on 501
and 405 response status codes fall-backs to GET requests. This commit
will return the 501 status code even before sending a given HTTP
request, resulting in a GET request being made as a fallback by the used
library.
@DavidHurta DavidHurta force-pushed the exchange-prometheus-client branch from 2230cd1 to 64906e9 Compare November 28, 2023 23:08
@DavidHurta
Copy link
Contributor Author

The ci/prow/unit seems to be failing quite often lately 👀

@DavidHurta
Copy link
Contributor Author

DavidHurta commented Nov 28, 2023

/hold

I want to test the new changes and make sure we have an agreement on whether to have this new logic only for HyperShift.

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 28, 2023
@DavidHurta
Copy link
Contributor Author

/retest

@DavidHurta
Copy link
Contributor Author

DavidHurta commented Nov 30, 2023

/unhold

Feedback addressed.

@DavidHurta
Copy link
Contributor Author

/unhold

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 30, 2023
Copy link
Member

@wking wking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 1, 2023
@wking
Copy link
Member

wking commented Dec 2, 2023

I see David tested this earlier, but repeating to fill in some details, using the flow from OTA-520. Cluster Bot launch 4.15,openshift/cluster-version-operator#999 azure, then:

$ oc get -o json clusterversion version | jq -r '.status.desired'
{
  "image": "registry.build04.ci.openshift.org/ci-ln-83x5tnb/release@sha256:def72cb591c2b8f990f05309df5f0a32c714cb0534247d55d25cd80f34dfd58d",
  "version": "4.15.0-0.test-2023-12-01-230104-ci-ln-83x5tnb-latest"
}

I updated my demo-cache-warming branch to use that as the initial release. Then:

$ oc patch clusterversion version --type json -p '[{"op": "add", "path": "/spec/upstream", "value": "https://raw.githubusercontent.com/wking/cincinnati-graph-data/demo-cache-warming/cincinnati-graph.json"}]'
$ oc adm upgrade channel buggy
warning: No channels known to be compatible with the current version "4.15.0-0.test-2023-12-01-230104-ci-ln-83x5tnb-latest"; unable to validate "buggy". Setting the update channel to "buggy" anyway.
$ oc adm upgrade --include-not-recommended
Cluster version is 4.15.0-0.test-2023-12-01-230104-ci-ln-83x5tnb-latest

Upstream: https://raw.githubusercontent.com/wking/cincinnati-graph-data/demo-cache-warming/cincinnati-graph.json
Channel: buggy
No updates available. You may still upgrade to a specific release image with --to-image or wait for new updates to be available.

Supported but not recommended updates:

  Version: 4.15.1
  Image: quay.io/openshift-release-dev/ocp-release@sha256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
  Recommended: False
  Reason: MultipleReasons
  Message: Could not evaluate exposure to update risk A (evaluation is throttled)
    A description: A.
    A URL: https://bug.example.com/a
  
  B. https://bug.example.com/b
  
  Could not evaluate exposure to update risk C (evaluation is throttled until 00:19:00Z)
    C description: C.
    C URL: https://bug.example.com/c
  
  Could not evaluate exposure to update risk D (evaluation is throttled until 00:19:00Z)
    D description: D.
    D URL: https://bug.example.com/d
  
  Could not evaluate exposure to update risk E (evaluation is throttled until 00:19:00Z)
    E description: E.
    E URL: https://bug.example.com/e
$ oc -n openshift-cluster-version logs -l k8s-app=cluster-version-operator --tail -1 | grep 'evaluate PromQL'
I1202 00:05:40.751294       1 promql.go:170] evaluate PromQL cluster condition: "max(cluster_proxy_enabled{type=~\"https?\"})"
I1202 00:05:41.783447       1 promql.go:170] evaluate PromQL cluster condition: "group(cluster_version_available_updates{channel=\"buggy\"})\nor\n0 * group(cluster_version_available_updates{channel!=\"buggy\"})"
I1202 00:09:54.193249       1 promql.go:170] evaluate PromQL cluster condition: "group(csv_succeeded{name=~\"local-storage-operator[.].*\"}) or 0 * group(csv_count)"
I1202 00:13:42.027462       1 promql.go:170] evaluate PromQL cluster condition: "0 * max(cluster_version)"
I1202 00:18:59.637248       1 promql.go:170] evaluate PromQL cluster condition: "0 * 0 * max(cluster_version)"

Not clear why that isn't getting 518b446 (#939)'s 1s MinBetweenMatches or 965bfb2 reques... Also not clear if we're using GET or POST. But it is clear that we aren't breaking PromQL resolution, because the cluster knows it is matching the B risk by having a channel named buggy.

Copy link
Member

@petr-muller petr-muller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@LalatenduMohanty LalatenduMohanty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 4, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Davoska, LalatenduMohanty, petr-muller, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [Davoska,LalatenduMohanty,petr-muller,wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@wking
Copy link
Member

wking commented Dec 4, 2023

Ok, I understand the slower PromQL evaluation now:

  1. I1202 00:05:40.751294 1 promql.go:170] evaluate PromQL cluster condition: "max(cluster_proxy_enabled{type=~\"https?\"})", was quick, and evaluated "does not match this cluster", so the update was still Recommended=Unknown, and we queued up an AddAfter next-check.
  2. I1202 00:05:41.783447 1 promql.go:170] evaluate PromQL cluster condition: "group(cluster_version_available_updates{channel=\"buggy\"})\nor\n0 * group(cluster_version_available_updates{channel!=\"buggy\"})" I1202 00:09:54.193249 1 promql.go:170] evaluate PromQL cluster condition: "group(csv_succeeded{name=~\"local-storage-operator[.].*\"}) or 0 * group(csv_count)" was quick, and evaluated "does match this cluster", so the update moved to Recommended=False, and we didn't rush to evaluate the remaining conditions.

Makes sense, and is not related to the change made in this pull.

@shellyyang1989
Copy link
Contributor

Pre-merge test by conducting regression test against conditional updates

// Launch a cluster using cluster-bot

# oc adm upgrade 
Cluster version is 4.15.0-0.test-2023-12-05-123655-ci-ln-chyhc0b-latest

warning: Cannot display available updates:
  Reason: NoChannel
  Message: The update channel has not been configured.

// Patch a dummy cincinnati

# oc patch clusterversion/version --patch '{"spec":{"upstream":"https://raw.githubusercontent.com/shellyyang1989/upgrade-cincy/master/cincy-conditional-edge.json"}}' --type=merge
clusterversion.config.openshift.io/version patched

// Check conditional updates

# oc adm upgrade --include-not-recommended
Cluster version is 4.15.0-0.test-2023-12-05-123655-ci-ln-chyhc0b-latest

Upstream: https://raw.githubusercontent.com/shellyyang1989/upgrade-cincy/master/cincy-conditional-edge.json
Channel: stable-4.15

Recommended updates:

  VERSION                            IMAGE
  4.15.0-0.nightly-2023-12-20-222222 registry.ci.openshift.org/ocp/release@sha256:caf073ce29232978c331d421c06ca5c2736ce5461962775fdd760b05fb2496a0
  4.15.0-0.nightly-2023-12-19-222222 registry.ci.openshift.org/ocp/release@sha256:e385a786f122c6c0e8848ecb9901f510676438f17af8a5c4c206807a9bc0bf28

Supported but not recommended updates:

  Version: 4.15.0-0.nightly-2023-12-18-111111
  Image: registry.ci.openshift.org/ocp/release@sha256:a5cd1b44e5b25b8a617d92a1f947297f56fc9bad104c117a8e452f932e1e2fd0
  Recommended: False
  Reason: ReleaseIsRejected
  Message: Too many CI failures on this release, so do not update to it https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/releasestream/4.10.0-0.nightly/release/4.10.0-0.nightly-2021-11-24-075634

  Version: 4.15.0-0.nightly-2023-12-17-000000
  Image: registry.ci.openshift.org/ocp/release@sha256:66c753e8b75d172f2a3f7ba13363383a76ecbc7ecdc00f3a423bef4ea8560405
  Recommended: False
  Reason: MultipleReasons
  Message: On clusters on default invoker user, this imaginary bug can happen. https://bug.example.com/a
  
  Could not evaluate exposure to update risk SomeChannelThing (evaluation is throttled until 13:38:25Z)
    SomeChannelThing description: On clusters with the channel set to 'buggy', this imaginary bug can happen.
    SomeChannelThing URL: https://bug.example.com/b

// CVO evaluates the unknown risks every second

# grep "evaluate PromQL cluster condition" cvo
I1205 13:38:20.816935       1 promql.go:170] evaluate PromQL cluster condition: "0 * group(cluster_version)"
I1205 13:38:21.871236       1 promql.go:170] evaluate PromQL cluster condition: "0 * 0 * group(cluster_version)"
I1205 13:38:22.898123       1 promql.go:170] evaluate PromQL cluster condition: "0 * 0* 0 * group(cluster_version)"
I1205 13:38:23.933570       1 promql.go:170] evaluate PromQL cluster condition: "cluster_infrastructure_provider{type=~\"nonexist\"}\nor\n0 * cluster_infrastructure_provider"
I1205 13:38:24.953166       1 promql.go:170] evaluate PromQL cluster condition: "cluster_installer"
I1205 13:44:54.989790       1 promql.go:170] evaluate PromQL cluster condition: "group(cluster_version_available_updates{channel=\"buggy\"})\nor\n0 * group(cluster_version_available_updates{channel!=\"buggy\"})"

// Patch to an invalid cincinnati

# oc patch clusterversion/version --patch '{"spec":{"upstream":"https://raw.githubusercontent.com/shellyyang1989/upgrade-cincy/master/cincy-conditional-edge-invalid-promql.json"}}' --type=merge
clusterversion.config.openshift.io/version patched
# oc adm upgrade --include-not-recommended
Cluster version is 4.15.0-0.test-2023-12-05-123655-ci-ln-chyhc0b-latest

Upstream: https://raw.githubusercontent.com/shellyyang1989/upgrade-cincy/master/cincy-conditional-edge-invalid-promql.json
Channel: stable-4.15
No updates available. You may still upgrade to a specific release image with --to-image or wait for new updates to be available.

Supported but not recommended updates:

  Version: 4.15.0-0.nightly-2023-12-11-065245
  Image: registry.ci.openshift.org/ocp/release@sha256:d9759e7c8ec5e2555419d84ff36aff2a4c8f9367236c18e722a3fe4d7c4f6dee
  Recommended: Unknown
  Reason: EvaluationFailed
  Message: Could not evaluate exposure to update risk InvalidPromQL (executing PromQL query: bad_data: 1:49: parse error: unexpected identifier "buggy" in label matching, expected string)
    InvalidPromQL description: Invalid Promql
    InvalidPromQL URL: https://invalid.com/a

// CVO looks good

# oc get pod -n openshift-cluster-version
NAME                                        READY   STATUS    RESTARTS   AGE
cluster-version-operator-568694b7fd-n62x5   1/1     Running   0          154m

@shellyyang1989
Copy link
Contributor

/label qe-approved

@openshift-ci openshift-ci bot added the qe-approved Signifies that QE has signed off on this PR label Dec 5, 2023
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Dec 5, 2023

@Davoska: This pull request references OTA-855 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.15.0" version, but no target version was set.

Details

In response to this:

This pull request will create GET requests to the Prometheus API server.

The currently used Prometheus client library doesn't allow explicit specification of the type of requests (GET/POST) being made to the Prometheus API server. The tenancy port of the thanos-querier.openshift-monitoring.svc service being utilized in self-managed HyperShift requires different permissions depending on the type of requests. This PR will make sure that the CVO uses GET requests to the PromQL service.

The used Prometheus client library attempts POST requests, and on 501 and 405 response status codes fall-backs to GET requests. This PR will return the 501 status code even before sending a given HTTP POST request, resulting in a GET request being made as a fallback by the used library.

An alternative is to use a different Prometheus client library (see DavidHurta@d4f4c9b). However, this vendors a lot of new packages creating potential vulnerabilities in the future. Another alternative is to fully implement specific interfaces, which may result in potential regression and bugs.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@wking
Copy link
Member

wking commented Dec 5, 2023

HyperShift CI is struggling more broadly. It passed earlier on this pull, and we know this is working for standalone.

/override ci/prow/e2e-hypershift

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 5, 2023

@wking: Overrode contexts on behalf of wking: ci/prow/e2e-hypershift

Details

In response to this:

HyperShift CI is struggling more broadly. It passed earlier on this pull, and we know this is working for standalone.

/override ci/prow/e2e-hypershift

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@petr-muller
Copy link
Member

/override ci/prow/e2e-hypershift-conformance

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 5, 2023

@petr-muller: Overrode contexts on behalf of petr-muller: ci/prow/e2e-hypershift-conformance

Details

In response to this:

/override ci/prow/e2e-hypershift-conformance

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 5, 2023

@Davoska: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot bot merged commit fb57321 into openshift:master Dec 5, 2023
@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

This PR has been included in build cluster-version-operator-container-v4.15.0-202312051932.p0.gfb57321.assembly.stream for distgit cluster-version-operator.
All builds following this will include this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. qe-approved Signifies that QE has signed off on this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants