Skip to content

OCPBUGS-60620: e2e: Deflake tests by using ReplicaSet for test workload#1262

Merged
openshift-merge-bot[bot] merged 3 commits intoopenshift:masterfrom
alebedev87:e2e-echo-rs-2
Aug 21, 2025
Merged

OCPBUGS-60620: e2e: Deflake tests by using ReplicaSet for test workload#1262
openshift-merge-bot[bot] merged 3 commits intoopenshift:masterfrom
alebedev87:e2e-echo-rs-2

Conversation

@alebedev87
Copy link
Contributor

@alebedev87 alebedev87 commented Aug 10, 2025

This PR aims at trying to de-flake the e2e tests which were flaking during the CI testing of #1257:

  • Gateway API e2e test: switch the test HTTPRoute's backend workload to a ReplicaSet to better tolerate pod evictions or deletions during runs.
  • IdleConnection test: switch the test route's backend workload to a ReplicaSet to better tolerate pod evictions or deletions during runs.

@alebedev87 alebedev87 changed the title e2e: GWAPI - Use ReplicaSet for backend pod e2e: Deflake some tests by switching to a ReplicaSet for test workload Aug 11, 2025
@alebedev87 alebedev87 changed the title e2e: Deflake some tests by switching to a ReplicaSet for test workload e2e: Deflake tests by using ReplicaSet for test workload Aug 11, 2025
@alebedev87 alebedev87 force-pushed the e2e-echo-rs-2 branch 3 times, most recently from 7bb7c34 to 5647a21 Compare August 11, 2025 14:29
@alebedev87
Copy link
Contributor Author

/hold

Holding this PR to better split between different flakes (e.g. #1265).

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 12, 2025
@alebedev87
Copy link
Contributor Author

alebedev87 commented Aug 12, 2025

All operator tests finished green. Let's retry them by pushing an amended commit with no real changed.

@alebedev87
Copy link
Contributor Author

No operator test failures, second time in a row. Let's try again...

@alebedev87 alebedev87 changed the title e2e: Deflake tests by using ReplicaSet for test workload OCPBUGS-60620: e2e: Deflake tests by using ReplicaSet for test workload Aug 18, 2025
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Aug 18, 2025
@openshift-ci-robot
Copy link
Contributor

@alebedev87: This pull request references Jira Issue OCPBUGS-60620, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.20.0) matches configured target version for branch (4.20.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira (iamin@redhat.com), skipping review request.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

This PR aims at trying to de-flake the e2e tests which we were flaking during the CI testing of #1257:

  • Gateway API e2e test: switch the test HTTPRoute's backend workload to a ReplicaSet to better tolerate pod evictions or deletions during runs.
  • IdleConnection test: switch the test route's backend workload to a ReplicaSet to better tolerate pod evictions or deletions during runs.
  • Gateway API e2e test: Increase timeouts for CVO scale-ups/downs and VAP assertions.
  • Gateway API e2e test: Retry Gateway CRD VAP creation on network errors.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label Aug 18, 2025
@alebedev87
Copy link
Contributor Author

/unhold

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 18, 2025
@openshift-ci-robot
Copy link
Contributor

@alebedev87: This pull request references Jira Issue OCPBUGS-60620, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.20.0) matches configured target version for branch (4.20.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira (iamin@redhat.com), skipping review request.

Details

In response to this:

This PR aims at trying to de-flake the e2e tests which we were flaking during the CI testing of #1257:

  • Gateway API e2e test: switch the test HTTPRoute's backend workload to a ReplicaSet to better tolerate pod evictions or deletions during runs.
  • IdleConnection test: switch the test route's backend workload to a ReplicaSet to better tolerate pod evictions or deletions during runs.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@rikatz
Copy link
Member

rikatz commented Aug 19, 2025

/assign

}

return pod, nil
if err := kclient.Create(ctx, rs); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you wanna retry here for the sake of network problems? (as we've been doing on other PRs)

Copy link
Contributor Author

@alebedev87 alebedev87 Aug 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added createWithRetryOnConflict helper function (similar to existing update*WithRetryOnConflict helpers) and introduced it here.

return nil, fmt.Errorf("failed to create pod %s/%s: %v", namespace, echoPod.Name, err)
// buildEchoReplicaSet builds a replicaset which creates a pod that listens on port 8080.
echoRs := buildEchoReplicaSet(backendRefname, namespace)
if err := kclient.Create(context.TODO(), echoRs); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as above on the create and retry. I am wondering if we care on actually creating some helper function for these "create" operations, with something like "create or retry" to avoid network issues.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added createWithRetryOnConflict helper function and introduced it here.


// createWithRetryOnConflict creates the given object. If there is a conflict error on create
// then the create is retried until the timeout is reached.
func createWithRetryOnConflict[T client.Object](ctx context.Context, obj T, timeout time.Duration) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need generics/type parameters here, you can just pass client.Object as the argument of object. kclient already accepts a client.Object

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Converted to non generic one.

func createWithRetryOnConflict[T client.Object](ctx context.Context, obj T, timeout time.Duration) error {
return wait.PollUntilContextTimeout(ctx, 2*time.Second, timeout, true, func(ctx context.Context) (bool, error) {
if err := kclient.Create(ctx, obj); err != nil && !errors.IsAlreadyExists(err) {
if errors.IsConflict(err) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know where this can happen on a kclient.Create operation, at least for me the API will answer with Conflict only when you are doing patch/update operation. In case Create, I can think of:

  • Already exists - you cover on the if branch
  • Does not exist - So you create it
  • Another error (network, permission, etc) - you retry

Maybe I am missing something on the Conflict situation here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I am missing something on the Conflict situation here

No, you are right. I copied the function from an existing update*OnConflict function. Updated it to retry on any error but AlreadyExists.

Switch the test HTTPRoute's backend workload to a ReplicaSet
to better tolerate pod evictions or deletions during runs.
Switch the test route's backend workload to a ReplicaSet to better
tolerate pod evictions or deletions during runs.

Update test logic to expect a response prefix, since the echo pod names
returned in the response are now suffixed with a random hash by the
ReplicaSet controller.
@rikatz
Copy link
Member

rikatz commented Aug 20, 2025

/lgtm
/approve
Thanks!

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 20, 2025
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 20, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: rikatz

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 20, 2025
@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 52a0048 and 2 for PR HEAD edfab23 in total

@rikatz
Copy link
Member

rikatz commented Aug 20, 2025

/retest

@rikatz
Copy link
Member

rikatz commented Aug 20, 2025

The tests failing are not required.
The only required test failing is GCP one, which is a well known failure.

@Miciah this test may need an override

@alebedev87
Copy link
Contributor Author

alebedev87 commented Aug 20, 2025

@rikatz : just for info: we can use /retest-required if we see optional job failing on things which we cannot relate to the PR.

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 52a0048 and 2 for PR HEAD edfab23 in total

1 similar comment
@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 52a0048 and 2 for PR HEAD edfab23 in total

@candita
Copy link
Contributor

candita commented Aug 21, 2025

This test is failing and has been reported to the Cloud Platform team. Resolution depends at least on merge of openshift/cloud-provider-gcp#86.

/override ci/prow/e2e-gcp-operator

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 21, 2025

@candita: Overrode contexts on behalf of candita: ci/prow/e2e-gcp-operator

Details

In response to this:

This test is failing and has been reported to the Cloud Platform team. Resolution depends at least on merge of openshift/cloud-provider-gcp#86.

/override ci/prow/e2e-gcp-operator

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 21, 2025

@alebedev87: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/okd-scos-e2e-aws-ovn edfab23 link false /test okd-scos-e2e-aws-ovn

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot bot merged commit 2f21242 into openshift:master Aug 21, 2025
20 of 21 checks passed
@openshift-ci-robot
Copy link
Contributor

@alebedev87: Jira Issue OCPBUGS-60620: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-60620 has been moved to the MODIFIED state.

Details

In response to this:

This PR aims at trying to de-flake the e2e tests which were flaking during the CI testing of #1257:

  • Gateway API e2e test: switch the test HTTPRoute's backend workload to a ReplicaSet to better tolerate pod evictions or deletions during runs.
  • IdleConnection test: switch the test route's backend workload to a ReplicaSet to better tolerate pod evictions or deletions during runs.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants

Comments