Skip to content

fix: zone-aware-routing e2e falkiness#6152

Merged
arkodg merged 5 commits intoenvoyproxy:mainfrom
jukie:zone-aware-routing-e2e-fix
May 23, 2025
Merged

fix: zone-aware-routing e2e falkiness#6152
arkodg merged 5 commits intoenvoyproxy:mainfrom
jukie:zone-aware-routing-e2e-fix

Conversation

@jukie
Copy link
Contributor

@jukie jukie commented May 22, 2025

What type of PR is this?

Fixes e2e flakiness after a refactoring

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #6119

Release Notes: No

Signed-off-by: jukie <10012479+Jukie@users.noreply.github.com>
@jukie jukie changed the title NamespacesMustBeReady after GatewayAndHTTPRoutesMustBeAccepted fix: zone-aware-routing e2e falkiness May 22, 2025
@codecov
Copy link

codecov bot commented May 22, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 70.57%. Comparing base (9b3d897) to head (c5503ff).
Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #6152      +/-   ##
==========================================
+ Coverage   70.50%   70.57%   +0.06%     
==========================================
  Files         219      219              
  Lines       36345    36345              
==========================================
+ Hits        25626    25651      +25     
+ Misses       9196     9177      -19     
+ Partials     1523     1517       -6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

jukie and others added 2 commits May 22, 2025 00:40
Signed-off-by: Jukie <10012479+Jukie@users.noreply.github.com>
@jukie
Copy link
Contributor Author

jukie commented May 22, 2025

/retest

jukie and others added 2 commits May 22, 2025 09:19
Signed-off-by: Jukie <10012479+Jukie@users.noreply.github.com>
Comment on lines +69 to +71
// Make sure all test resources are ready
kubernetes.NamespacesMustBeReady(t, suite.Client, suite.TimeoutConfig, []string{ConformanceInfraNamespace})

Copy link
Contributor Author

@jukie jukie May 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhaohuabing @zirain I narrowed this down to being a race condition if NamespacesMustBeReady() executes prior to GatewayAndHTTPRoutesMustBeAccepted() and that's what was changed with the refactor.

The manifest apply action happens before test execution but the client cache seems to still be outdated and as a result the list actions in NamespacesMustBeReady() are empty and the intended wait action never happens.

I was able to reproduce the failure locally ~50% of the time but by moving NamespacesMustBeReady() into runWeightedBackendTest() after GatewayAndHTTPRoutesMustBeAccepted() execution it consistently suceeds.

Copy link
Member

@zirain zirain May 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTY, can you explain more about this?
NamespacesMustBeReady() make sure that all the pods in the namespace should be ready, it should be more strict than GatewayAndHTTPRoutesMustBeAccepted(), why move it after solve this?
Need more time to refresh the pod list? if so, prefer to bring back

WaitForPods(t, suite.Client, ns, map[string]string{"app": "zone-aware-backend"}, corev1.PodRunning, podReady)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have any real proof about the issue being client cache but that's a theory.
I suspect that by first calling GatewayAndHTTPRoutesMustBeAccepted() that implicitly ensures the proxy pods are also up and running. In the case of zone aware routing tests, this is needed due to the affinity rules.

It seems useful that for all tests we'd want to wait for all test resources to be completely ready before proceeding. Limiting to just Pods allows for other edge cases in the future so a comprehensive function like NamespacesMustBeReady() (tbf it currently just checks gateways and pods but could be expanded further) would be my preference but if you'd like to instead bring back WaitForPods() specifically for zone-aware-routing I can make that update.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're using same-namespace gateway, so it's always there.

Let us test with, we can bring WaitForPods() back if needed in the future.

@jukie jukie marked this pull request as ready for review May 22, 2025 17:03
@jukie jukie requested a review from a team as a code owner May 22, 2025 17:03
@jukie
Copy link
Contributor Author

jukie commented May 22, 2025

/retest

Copy link
Member

@zirain zirain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for debugging!

@arkodg arkodg merged commit 668d84b into envoyproxy:main May 23, 2025
44 of 45 checks passed
@jukie jukie deleted the zone-aware-routing-e2e-fix branch May 23, 2025 00:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky: TestE2E/ZoneAwareRouting

3 participants