Skip to content

Conversation

@gcs278
Copy link
Contributor

@gcs278 gcs278 commented May 16, 2024

Bump openshift/library-go to get update (openshift/library-go#1697) for NewPrometheusClient function that switches from using a legacy service account API to TokenRequest API. This update resolves issues in our E2E tests that depend on this function.

go get github.com/openshift/library-go@f1541d628746a866ef2ffecbb4fd15ec9e64b2a1
go mod tidy
go mod vendor

Resolve ca.MakeServerCert hostnames parameter change from deprecated sets.String to generic Set.

This was caused by the work done in https://issues.redhat.com/browse/API-1644, more specifically openshift/openshift-controller-manager#305

Slack Thread

@openshift-ci openshift-ci bot requested review from alebedev87 and frobware May 16, 2024 01:49
@gcs278 gcs278 changed the title Bump openshift/library-go to resolve NewPrometheusClient E2E failures [WIP] Bump openshift/library-go to resolve NewPrometheusClient E2E failures May 16, 2024
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 16, 2024
Bump openshift/library-go to get update for NewPrometheusClient function
that switches from using a legacy service account API to TokenRequest API.
This update resolves issues in our E2E tests that depend on this function.

   go get github.com/openshift/library-go@f1541d628746a866ef2ffecbb4fd15ec9e64b2a1
   go mod tidy
   go mod vendor

Resolve ca.MakeServerCert hostnames parameter type change from deprecated
sets.String to generic Set.
@gcs278
Copy link
Contributor Author

gcs278 commented May 16, 2024

Whoops, I needed to resolve a parameter type change in ca.MakeServerCert. I've fixed that, and now Github isn't updating my PR with my recent changes...I've forced pushed twice.

@gcs278 gcs278 force-pushed the bump-library-go-for-prometheus branch 2 times, most recently from a5eb14c to 741f83c Compare May 16, 2024 04:38
@gcs278
Copy link
Contributor Author

gcs278 commented May 16, 2024

Seems like it resolved the permafailing issue, but I did get a:
route_metrics_test.go:365: failed to fetch expected metrics: error waiting for route metrics: timed out waiting for the condition
in e2e-azure-operator. Not sure if this is a new flake or old flake. Sounds like it's an old flake, but search.CI isn't working well for me to confirm.
/test e2e-azure-operator

@gcs278
Copy link
Contributor Author

gcs278 commented May 16, 2024

Let's try again.
/test e2e-aws-operator
/test e2e-gcp-operator

@gcs278 gcs278 changed the title [WIP] Bump openshift/library-go to resolve NewPrometheusClient E2E failures OCPBUGS-33792: Bump openshift/library-go to resolve NewPrometheusClient E2E failures May 16, 2024
@openshift-ci-robot openshift-ci-robot added jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels May 16, 2024
@openshift-ci-robot
Copy link
Contributor

@gcs278: This pull request references Jira Issue OCPBUGS-33792, which is invalid:

  • expected the bug to target the "4.16.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Bump openshift/library-go to get update (openshift/library-go#1697) for NewPrometheusClient function that switches from using a legacy service account API to TokenRequest API. This update resolves issues in our E2E tests that depend on this function.

go get github.com/openshift/library-go@f1541d628746a866ef2ffecbb4fd15ec9e64b2a1
go mod tidy
go mod vendor

Resolve ca.MakeServerCert hostnames parameter change from deprecated sets.String to generic Set.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 16, 2024
@gcs278
Copy link
Contributor Author

gcs278 commented May 16, 2024

Definitely looks like this solves the problem. I also followed up in slack to confirm this is the right solution. It is:

  • As of OCP 4.11, service account API token secrets (now referred to as “legacy”) were no longer generated automatically for each service account.
  • All users were supposed to migrate to using the TokenRequest API.
  • Unfortunately, the integrated image registry was also generating a legacy service account API token for its own use, and some users started to accidentally pick up the token intended for the image registry and missed migrating to the TokenRequest API.
  • Starting in v4.15, if the image registry is not enabled, the tokens will not be generated. The image registry is still enabled by default, so again another missed opportunity to catch people not using the TokenRequest API.
  • Finally, in v4.16 the image registry no longer generates the legacy service account API token at all. I tried to catch all the stragglers out there before merging (took me about 4 months of fixing tests), but alas it seems that you found another one.

@openshift-ci-robot openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label May 16, 2024
@openshift-ci-robot
Copy link
Contributor

@gcs278: This pull request references Jira Issue OCPBUGS-33792, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.16.0) matches configured target version for branch (4.16.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @lihongan

Details

In response to this:

Bump openshift/library-go to get update (openshift/library-go#1697) for NewPrometheusClient function that switches from using a legacy service account API to TokenRequest API. This update resolves issues in our E2E tests that depend on this function.

go get github.com/openshift/library-go@f1541d628746a866ef2ffecbb4fd15ec9e64b2a1
go mod tidy
go mod vendor

Resolve ca.MakeServerCert hostnames parameter change from deprecated sets.String to generic Set.

This was caused by the work done in https://issues.redhat.com/browse/API-1644, more specifically openshift/openshift-controller-manager#305

Slack Thread

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot removed the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label May 16, 2024
@gcs278
Copy link
Contributor Author

gcs278 commented May 16, 2024

/retest-required

@Miciah
Copy link
Contributor

Miciah commented May 16, 2024

in e2e-azure-operator. Not sure if this is a new flake or old flake. Sounds like it's an old flake, but search.CI isn't working well for me to confirm.

I only see that one failure when I search for that failure over the past 14 days, filtering by jobs with names containing "cluster-ingress-operator".

Anyway, your changes look good. Thanks!
/approve
/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 16, 2024
@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 16, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Miciah

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 16, 2024
@Miciah
Copy link
Contributor

Miciah commented May 16, 2024

I would ask you to link to the OCPBUGS ticket in the commit message, but fixing CI is urgent, and I'd rather not delay merging this fix.

@Miciah
Copy link
Contributor

Miciah commented May 16, 2024

Finally, in v4.16 the image registry no longer generates the legacy service account API token at all. I tried to catch all the stragglers out there before merging (took me about 4 months of fixing tests), but alas it seems that you found another one.

Seems to me that it would be a simple matter of reviewing the audit logs for a few different CI runs for various CI jobs (for example, e2e, serial, and techpreview)? But oh well, that's in the past.

@Miciah
Copy link
Contributor

Miciah commented May 16, 2024

Finally, in v4.16 the image registry no longer generates the legacy service account API token at all. I tried to catch all the stragglers out there before merging (took me about 4 months of fixing tests), but alas it seems that you found another one.

Seems to me that it would be a simple matter of reviewing the audit logs for a few different CI runs for various CI jobs (for example, e2e, serial, and techpreview)? But oh well, that's in the past.

It would also be prudent to follow up on "prometheus service account not found" errors in search.ci.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 16, 2024

@gcs278: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ovn-single-node 4a4aec7 link false /test e2e-aws-ovn-single-node

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@gcs278
Copy link
Contributor Author

gcs278 commented May 16, 2024

Thanks @Miciah. I think we are going to have to override e2e-hypershift for the same problem: prometheus service account not found. Or we wait until openshift/hypershift#4044 has been merged, which should fix their E2E tests.

Feel comfortable to override @Miciah?

@Miciah
Copy link
Contributor

Miciah commented May 16, 2024

Thanks @Miciah. I think we are going to have to override e2e-hypershift for the same problem: prometheus service account not found. Or we wait until openshift/hypershift#4044 has been merged, which should fix their E2E tests.

Feel comfortable to override @Miciah?

Not totally comfortable. How confident are you that the e2e-hypershift failures are all caused by the same Prometheus issue (or some other flake)? Do you know whether openshift/hypershift#4044 is likely to be /lgtm'd and merged today or whether it will need to wait for tomorrow? Alternatively, maybe there's a HyperShift engineer around who could check the e2e-hypershift failures on this PR for us.

@gcs278
Copy link
Contributor Author

gcs278 commented May 16, 2024

Thanks @Miciah. I think we are going to have to override e2e-hypershift for the same problem: prometheus service account not found. Or we wait until openshift/hypershift#4044 has been merged, which should fix their E2E tests.
Feel comfortable to override @Miciah?

Not totally comfortable. How confident are you that the e2e-hypershift failures are all caused by the same Prometheus issue (or some other flake)? Do you know whether openshift/hypershift#4044 is likely to be /lgtm'd and merged today or whether it will need to wait for tomorrow? Alternatively, maybe there's a HyperShift engineer around who could check the e2e-hypershift failures on this PR for us.

Following up with the hypershift team.

I feel confident that prometheus service account not found errors are preventing e2e-hypershift from succeeding, but I can't guarantee that there isn't another failure being masked by the prometheus service account failure. The safest is to wait for openshift/hypershift#4044 if we can afford the time.

@Miciah
Copy link
Contributor

Miciah commented May 16, 2024

I feel confident that prometheus service account not found errors are preventing e2e-hypershift from succeeding, but I can't guarantee that there isn't another failure being masked by the prometheus service account failure. The safest is to wait for openshift/hypershift#4044 if we can afford the time.

Can you do a Cluster Bot test openshift/cluster-ingress-operator#1054,openshift/hypershift#4044 e2e-hypershift? I'm not sure about the exact syntax or whether it's possible to run e2e-hypershift using Cluster Bot.

@candita
Copy link
Contributor

candita commented May 16, 2024

I feel confident that prometheus service account not found errors are preventing e2e-hypershift from succeeding, but I can't guarantee that there isn't another failure being masked by the prometheus service account failure. The safest is to wait for openshift/hypershift#4044 if we can afford the time.

I've been monitoring another release oversight slack thread where they suspect that the new DNS Name Resolver functionality is causing them an issue, so it is going to be worth it to check with them.

@lihongan
Copy link
Contributor

/label qe-approved
e2e-[aws|gcp|azure]-operator are passed

@openshift-ci openshift-ci bot added the qe-approved Signifies that QE has signed off on this PR label May 17, 2024
@openshift-ci-robot
Copy link
Contributor

@gcs278: This pull request references Jira Issue OCPBUGS-33792, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.16.0) matches configured target version for branch (4.16.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @lihongan

Details

In response to this:

Bump openshift/library-go to get update (openshift/library-go#1697) for NewPrometheusClient function that switches from using a legacy service account API to TokenRequest API. This update resolves issues in our E2E tests that depend on this function.

go get github.com/openshift/library-go@f1541d628746a866ef2ffecbb4fd15ec9e64b2a1
go mod tidy
go mod vendor

Resolve ca.MakeServerCert hostnames parameter change from deprecated sets.String to generic Set.

This was caused by the work done in https://issues.redhat.com/browse/API-1644, more specifically openshift/openshift-controller-manager#305

Slack Thread

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@gcs278
Copy link
Contributor Author

gcs278 commented May 17, 2024

I feel confident that prometheus service account not found errors are preventing e2e-hypershift from succeeding, but I can't guarantee that there isn't another failure being masked by the prometheus service account failure. The safest is to wait for openshift/hypershift#4044 if we can afford the time.

Can you do a Cluster Bot test openshift/cluster-ingress-operator#1054,openshift/hypershift#4044 e2e-hypershift? I'm not sure about the exact syntax or whether it's possible to run e2e-hypershift using Cluster Bot.

I don't think the e2e-hypershift job is a "supported test suite", they are:

e2e, e2e-serial, e2e-all, e2e-disruptive, e2e-disruptive-all, e2e-builds, e2e-image-ecosystem, e2e-image-registry, e2e-network-stress.

I assume e2e here are openshit/origin tests. I asked about something similar how to do this in clusterbot in #forum-ocp-crt and they recommended to launch a cluster and they run the test manually. That's pretty painful, but possible I suppose. I can look into it.

@openshift-bot
Copy link
Contributor

/jira refresh

The requirements for Jira bugs have changed (Jira issues linked to PRs on main branch need to target different OCP), recalculating validity.

@openshift-ci-robot
Copy link
Contributor

@openshift-bot: This pull request references Jira Issue OCPBUGS-33792, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.16.0) matches configured target version for branch (4.16.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @lihongan

Details

In response to this:

/jira refresh

The requirements for Jira bugs have changed (Jira issues linked to PRs on main branch need to target different OCP), recalculating validity.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@Miciah
Copy link
Contributor

Miciah commented May 17, 2024

With branch-cut imminent, let's go ahead and get this change merged in order to unbreak CI.
/override ci/prow/e2e-hypershift

@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 17, 2024

@Miciah: Overrode contexts on behalf of Miciah: ci/prow/e2e-hypershift

Details

In response to this:

With branch-cut imminent, let's go ahead and get this change merged in order to unbreak CI.
/override ci/prow/e2e-hypershift

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@Miciah
Copy link
Contributor

Miciah commented May 17, 2024

/label acknowledge-critical-fixes-only

@openshift-ci openshift-ci bot added the acknowledge-critical-fixes-only Indicates if the issuer of the label is OK with the policy. label May 17, 2024
@openshift-merge-bot openshift-merge-bot bot merged commit 487fe81 into openshift:master May 17, 2024
@openshift-ci-robot
Copy link
Contributor

@gcs278: Jira Issue OCPBUGS-33792: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-33792 has been moved to the MODIFIED state.

Details

In response to this:

Bump openshift/library-go to get update (openshift/library-go#1697) for NewPrometheusClient function that switches from using a legacy service account API to TokenRequest API. This update resolves issues in our E2E tests that depend on this function.

go get github.com/openshift/library-go@f1541d628746a866ef2ffecbb4fd15ec9e64b2a1
go mod tidy
go mod vendor

Resolve ca.MakeServerCert hostnames parameter change from deprecated sets.String to generic Set.

This was caused by the work done in https://issues.redhat.com/browse/API-1644, more specifically openshift/openshift-controller-manager#305

Slack Thread

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

This PR has been included in build ose-cluster-ingress-operator-container-v4.17.0-202405171741.p0.g487fe81.assembly.stream.el9 for distgit ose-cluster-ingress-operator.
All builds following this will include this PR.

@gcs278
Copy link
Contributor Author

gcs278 commented Jun 4, 2024

Maybe this will work? Vendoring doesn't usually cleanly cherry-pick, but lets see...
/cherry-pick release-4.16

@openshift-cherrypick-robot

@gcs278: new pull request created: #1074

Details

In response to this:

Maybe this will work? Vendoring doesn't usually cleanly cherry-pick, but lets see...
/cherry-pick release-4.16

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

acknowledge-critical-fixes-only Indicates if the issuer of the label is OK with the policy. approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. qe-approved Signifies that QE has signed off on this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants