fix: zone-aware-routing e2e falkiness by jukie · Pull Request #6152 · envoyproxy/gateway

jukie · 2025-05-22T05:40:49Z

What type of PR is this?

Fixes e2e flakiness after a refactoring

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #6119

Release Notes: No

Signed-off-by: jukie <10012479+Jukie@users.noreply.github.com>

codecov · 2025-05-22T05:49:48Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 70.57%. Comparing base (9b3d897) to head (c5503ff).
Report is 1 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #6152      +/-   ##
==========================================
+ Coverage   70.50%   70.57%   +0.06%     
==========================================
  Files         219      219              
  Lines       36345    36345              
==========================================
+ Hits        25626    25651      +25     
+ Misses       9196     9177      -19     
+ Partials     1523     1517       -6

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: Jukie <10012479+Jukie@users.noreply.github.com>

jukie · 2025-05-22T14:51:17Z

/retest

Signed-off-by: Jukie <10012479+Jukie@users.noreply.github.com>

jukie · 2025-05-22T17:03:21Z

test/e2e/tests/weighted_backend.go

+	// Make sure all test resources are ready
+	kubernetes.NamespacesMustBeReady(t, suite.Client, suite.TimeoutConfig, []string{ConformanceInfraNamespace})
+


@zhaohuabing @zirain I narrowed this down to being a race condition if NamespacesMustBeReady() executes prior to GatewayAndHTTPRoutesMustBeAccepted() and that's what was changed with the refactor.

The manifest apply action happens before test execution but the client cache seems to still be outdated and as a result the list actions in NamespacesMustBeReady() are empty and the intended wait action never happens.

I was able to reproduce the failure locally ~50% of the time but by moving NamespacesMustBeReady() into runWeightedBackendTest() after GatewayAndHTTPRoutesMustBeAccepted() execution it consistently suceeds.

BTY, can you explain more about this?
NamespacesMustBeReady() make sure that all the pods in the namespace should be ready, it should be more strict than GatewayAndHTTPRoutesMustBeAccepted(), why move it after solve this?
Need more time to refresh the pod list? if so, prefer to bring back

WaitForPods(t, suite.Client, ns, map[string]string{"app": "zone-aware-backend"}, corev1.PodRunning, podReady)

I don't have any real proof about the issue being client cache but that's a theory.
I suspect that by first calling GatewayAndHTTPRoutesMustBeAccepted() that implicitly ensures the proxy pods are also up and running. In the case of zone aware routing tests, this is needed due to the affinity rules.

It seems useful that for all tests we'd want to wait for all test resources to be completely ready before proceeding. Limiting to just Pods allows for other edge cases in the future so a comprehensive function like NamespacesMustBeReady() (tbf it currently just checks gateways and pods but could be expanded further) would be my preference but if you'd like to instead bring back WaitForPods() specifically for zone-aware-routing I can make that update.

You're using same-namespace gateway, so it's always there.

Let us test with, we can bring WaitForPods() back if needed in the future.

jukie · 2025-05-22T17:50:57Z

/retest

zirain

Thanks for debugging!

NamespacesMustBeReady after GatewayAndHTTPRoutesMustBeAccepted

618a0ee

Signed-off-by: jukie <10012479+Jukie@users.noreply.github.com>

jukie changed the title ~~NamespacesMustBeReady after GatewayAndHTTPRoutesMustBeAccepted~~ fix: zone-aware-routing e2e falkiness May 22, 2025

jukie and others added 2 commits May 22, 2025 00:40

lint

65118ed

Signed-off-by: Jukie <10012479+Jukie@users.noreply.github.com>

Merge branch 'main' into zone-aware-routing-e2e-fix

0d4adb9

jukie and others added 2 commits May 22, 2025 09:19

trigger tests

a6c3f5f

Signed-off-by: Jukie <10012479+Jukie@users.noreply.github.com>

Merge branch 'main' into zone-aware-routing-e2e-fix

c5503ff

jukie commented May 22, 2025

View reviewed changes

jukie marked this pull request as ready for review May 22, 2025 17:03

jukie requested a review from a team as a code owner May 22, 2025 17:03

jukie mentioned this pull request May 22, 2025

Flaky: TestE2E/ZoneAwareRouting #6119

Closed

zirain approved these changes May 23, 2025

View reviewed changes

arkodg approved these changes May 23, 2025

View reviewed changes

arkodg merged commit 668d84b into envoyproxy:main May 23, 2025
44 of 45 checks passed

jukie deleted the zone-aware-routing-e2e-fix branch May 23, 2025 00:18

arkodg mentioned this pull request Jun 26, 2025

Nominating jukie as a Reviewer #6415

Closed

BrewTestBot mentioned this pull request Aug 8, 2025

egctl 1.5.0 Homebrew/homebrew-core#232799

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: zone-aware-routing e2e falkiness#6152

fix: zone-aware-routing e2e falkiness#6152
arkodg merged 5 commits intoenvoyproxy:mainfrom
jukie:zone-aware-routing-e2e-fix

jukie commented May 22, 2025 •

edited

Loading

Uh oh!

codecov bot commented May 22, 2025 •

edited

Loading

Uh oh!

jukie commented May 22, 2025

Uh oh!

jukie May 22, 2025 •

edited

Loading

Uh oh!

zirain May 22, 2025 •

edited

Loading

Uh oh!

jukie May 22, 2025

Uh oh!

zirain May 23, 2025

Uh oh!

jukie commented May 22, 2025

Uh oh!

zirain left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		// Make sure all test resources are ready
		kubernetes.NamespacesMustBeReady(t, suite.Client, suite.TimeoutConfig, []string{ConformanceInfraNamespace})

Conversation

jukie commented May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jukie commented May 22, 2025

Uh oh!

jukie May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zirain May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jukie May 22, 2025

Choose a reason for hiding this comment

Uh oh!

zirain May 23, 2025

Choose a reason for hiding this comment

Uh oh!

jukie commented May 22, 2025

Uh oh!

zirain left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jukie commented May 22, 2025 •

edited

Loading

codecov bot commented May 22, 2025 •

edited

Loading

jukie May 22, 2025 •

edited

Loading

zirain May 22, 2025 •

edited

Loading