Skip to content

OCPBUGS-81192: Fix race condition in internal-to-external LB migration test#1407

Open
gcs278 wants to merge 1 commit intoopenshift:masterfrom
gcs278:fix-testUnmanagedDNSToManagedDNSInternalIngressController
Open

OCPBUGS-81192: Fix race condition in internal-to-external LB migration test#1407
gcs278 wants to merge 1 commit intoopenshift:masterfrom
gcs278:fix-testUnmanagedDNSToManagedDNSInternalIngressController

Conversation

@gcs278
Copy link
Copy Markdown
Contributor

@gcs278 gcs278 commented Mar 27, 2026

Summary

Fixes a race condition in TestUnmanagedDNSToManagedDNSInternalIngressController that causes the test to fail with connection timeouts when migrating from internal to external load balancer scope.

Problem

When the test migrates the IngressController from:

  • Internal LB (10.0.128.8) + Unmanaged DNS
  • To: External LB (136.116.125.243) + Managed DNS

The IngressController may report DNSReady=True before the ingress-operator reconciles the DNSRecord with the new external IP. This causes the test to use the old internal IP (from wildcardRecord.Spec.Targets[0]) for connectivity verification, resulting in timeouts.

Solution

Use the Service's load balancer address directly (lbAddress) instead of waiting for the DNSRecord target to sync. The Service LB address is the authoritative source and is immediately available after the cloud provider provisions the external LB.

Additional Changes

  • Updated to use PollUntilContextTimeout instead of deprecated PollImmediate
  • Added header comments to all DNS migration tests for consistency

Test Plan

  • Verify TestUnmanagedDNSToManagedDNSInternalIngressController passes on GCP
  • Verify other DNS migration tests still pass

🤖 Generated with Claude Code

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 27, 2026

📝 Walkthrough

Walkthrough

This change updates test logic in test/e2e/unmanaged_dns_test.go: adds comments documenting dnsManagementPolicy transition behavior for external ingress controllers; replaces a wait.PollImmediate loop with wait.PollUntilContextTimeout to wait for load balancer Service status changes and ensure LoadBalancer.Ingress is populated; captures the load balancer address (IP or Hostname) and uses it for the final external verification instead of the wildcard record target. No exported/public declarations were modified.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci bot requested review from Thealisyed and rikatz March 27, 2026 15:40
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Mar 27, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign gcs278 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@gcs278 gcs278 changed the title Fix race condition in internal-to-external LB migration test OCPBUGS-81192: Fix race condition in internal-to-external LB migration test Mar 27, 2026
@openshift-ci-robot openshift-ci-robot added jira/severity-low Referenced Jira bug's severity is low for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Mar 27, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@gcs278: This pull request references Jira Issue OCPBUGS-81192, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.22.0) matches configured target version for branch (4.22.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Summary

Fixes a race condition in TestUnmanagedDNSToManagedDNSInternalIngressController that causes the test to fail with connection timeouts when migrating from internal to external load balancer scope.

Problem

When the test migrates the IngressController from:

  • Internal LB (10.0.128.8) + Unmanaged DNS
  • To: External LB (136.116.125.243) + Managed DNS

The IngressController may report DNSReady=True before the ingress-operator reconciles the DNSRecord with the new external IP. This causes the test to use the old internal IP (from wildcardRecord.Spec.Targets[0]) for connectivity verification, resulting in timeouts.

Solution

Use the Service's load balancer address directly (lbAddress) instead of waiting for the DNSRecord target to sync. The Service LB address is the authoritative source and is immediately available after the cloud provider provisions the external LB.

Additional Changes

  • Updated to use PollUntilContextTimeout instead of deprecated PollImmediate
  • Added header comments to all DNS migration tests for consistency

Test Plan

  • Verify TestUnmanagedDNSToManagedDNSInternalIngressController passes on GCP
  • Verify other DNS migration tests still pass

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

When migrating from internal to external scope, use the Service LB
address directly instead of DNSRecord target to avoid race condition
where IngressController reports DNSReady before DNSRecord reconciles.

Also add header comments to DNS migration tests for consistency.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@gcs278 gcs278 force-pushed the fix-testUnmanagedDNSToManagedDNSInternalIngressController branch from 2a2730d to 2cdc405 Compare April 3, 2026 14:01
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
test/e2e/unmanaged_dns_test.go (1)

290-293: Use poll callback context for API reads.

Inside wait.PollUntilContextTimeout, kclient.Get(context.TODO(), ...) ignores the poll context; use ctx so cancellation/timeouts propagate cleanly.

Proposed change
-		if err := kclient.Get(context.TODO(), controller.LoadBalancerServiceName(ic), lbService); err != nil {
+		if err := kclient.Get(ctx, controller.LoadBalancerServiceName(ic), lbService); err != nil {

As per coding guidelines, "Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/unmanaged_dns_test.go` around lines 290 - 293, The poll callback is
using context.TODO() for the API read which ignores the poll's
cancellation/timeout; inside the callback passed to
wait.PollUntilContextTimeout, replace the kclient.Get call's context.TODO() with
the provided ctx so that kclient.Get(ctx,
controller.LoadBalancerServiceName(ic), lbService) correctly observes
cancellation/timeouts from the poll.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/e2e/unmanaged_dns_test.go`:
- Around line 302-310: The polling callback that reads
lbService.Status.LoadBalancer.Ingress[0] must validate that the chosen address
is non-empty before returning success: after extracting lbAddress from
lbService.Status.LoadBalancer.Ingress[0].IP or .Hostname, ensure lbAddress != ""
and if it is empty return false, nil so the poll continues; update the logic
around lbService, lbAddress in the polling loop (the snippet that currently
returns true,nil) so verifyExternalIngressController() never receives an empty
lbAddress.

---

Nitpick comments:
In `@test/e2e/unmanaged_dns_test.go`:
- Around line 290-293: The poll callback is using context.TODO() for the API
read which ignores the poll's cancellation/timeout; inside the callback passed
to wait.PollUntilContextTimeout, replace the kclient.Get call's context.TODO()
with the provided ctx so that kclient.Get(ctx,
controller.LoadBalancerServiceName(ic), lbService) correctly observes
cancellation/timeouts from the poll.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 2c92fca6-44a4-46ab-852b-df3243b934cc

📥 Commits

Reviewing files that changed from the base of the PR and between 2a2730d and 2cdc405.

📒 Files selected for processing (1)
  • test/e2e/unmanaged_dns_test.go

Comment on lines +302 to 310
if len(lbService.Status.LoadBalancer.Ingress) == 0 {
t.Logf("service %s has no load balancer ingress, retrying...", lbService.Name)
return false, nil
}
lbAddress = lbService.Status.LoadBalancer.Ingress[0].IP
if lbAddress == "" {
lbAddress = lbService.Status.LoadBalancer.Ingress[0].Hostname
}
return true, nil
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

head -n 330 test/e2e/unmanaged_dns_test.go | tail -n +290

Repository: openshift/cluster-ingress-operator

Length of output: 2160


🏁 Script executed:

# Get more context on how lbAddress is used in the test
rg "lbAddress" test/e2e/unmanaged_dns_test.go -A 3 -B 3

Repository: openshift/cluster-ingress-operator

Length of output: 1511


Guard against empty LB ingress address before returning success.

The polling loop at lines 306-309 can return success even when both IP and Hostname are empty strings. This causes lbAddress to be empty when passed to the subsequent verifyExternalIngressController() call, resulting in flaky connectivity checks. Add a validation check to ensure lbAddress is not empty before the polling loop succeeds:

Proposed change
lbAddress = lbService.Status.LoadBalancer.Ingress[0].IP
if lbAddress == "" {
	lbAddress = lbService.Status.LoadBalancer.Ingress[0].Hostname
}
+if lbAddress == "" {
+	t.Logf("service %s has ingress entry without IP/hostname, retrying...", lbService.Name)
+	return false, nil
+}
return true, nil
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if len(lbService.Status.LoadBalancer.Ingress) == 0 {
t.Logf("service %s has no load balancer ingress, retrying...", lbService.Name)
return false, nil
}
lbAddress = lbService.Status.LoadBalancer.Ingress[0].IP
if lbAddress == "" {
lbAddress = lbService.Status.LoadBalancer.Ingress[0].Hostname
}
return true, nil
if len(lbService.Status.LoadBalancer.Ingress) == 0 {
t.Logf("service %s has no load balancer ingress, retrying...", lbService.Name)
return false, nil
}
lbAddress = lbService.Status.LoadBalancer.Ingress[0].IP
if lbAddress == "" {
lbAddress = lbService.Status.LoadBalancer.Ingress[0].Hostname
}
if lbAddress == "" {
t.Logf("service %s has ingress entry without IP/hostname, retrying...", lbService.Name)
return false, nil
}
return true, nil
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/unmanaged_dns_test.go` around lines 302 - 310, The polling callback
that reads lbService.Status.LoadBalancer.Ingress[0] must validate that the
chosen address is non-empty before returning success: after extracting lbAddress
from lbService.Status.LoadBalancer.Ingress[0].IP or .Hostname, ensure lbAddress
!= "" and if it is empty return false, nil so the poll continues; update the
logic around lbService, lbAddress in the polling loop (the snippet that
currently returns true,nil) so verifyExternalIngressController() never receives
an empty lbAddress.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Apr 3, 2026

@gcs278: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-gcp-operator 2cdc405 link true /test e2e-gcp-operator
ci/prow/e2e-aws-operator-techpreview 2cdc405 link false /test e2e-aws-operator-techpreview
ci/prow/e2e-aws-ovn-hypershift-conformance 2cdc405 link true /test e2e-aws-ovn-hypershift-conformance
ci/prow/hypershift-e2e-aks 2cdc405 link true /test hypershift-e2e-aks
ci/prow/e2e-aws-operator 2cdc405 link true /test e2e-aws-operator
ci/prow/e2e-aws-ovn 2cdc405 link true /test e2e-aws-ovn

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/severity-low Referenced Jira bug's severity is low for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants