fix: zone-aware-routing e2e falkiness#6152
Conversation
Signed-off-by: jukie <10012479+Jukie@users.noreply.github.com>
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #6152 +/- ##
==========================================
+ Coverage 70.50% 70.57% +0.06%
==========================================
Files 219 219
Lines 36345 36345
==========================================
+ Hits 25626 25651 +25
+ Misses 9196 9177 -19
+ Partials 1523 1517 -6 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
/retest |
Signed-off-by: Jukie <10012479+Jukie@users.noreply.github.com>
| // Make sure all test resources are ready | ||
| kubernetes.NamespacesMustBeReady(t, suite.Client, suite.TimeoutConfig, []string{ConformanceInfraNamespace}) | ||
|
|
There was a problem hiding this comment.
@zhaohuabing @zirain I narrowed this down to being a race condition if NamespacesMustBeReady() executes prior to GatewayAndHTTPRoutesMustBeAccepted() and that's what was changed with the refactor.
The manifest apply action happens before test execution but the client cache seems to still be outdated and as a result the list actions in NamespacesMustBeReady() are empty and the intended wait action never happens.
I was able to reproduce the failure locally ~50% of the time but by moving NamespacesMustBeReady() into runWeightedBackendTest() after GatewayAndHTTPRoutesMustBeAccepted() execution it consistently suceeds.
There was a problem hiding this comment.
BTY, can you explain more about this?
NamespacesMustBeReady() make sure that all the pods in the namespace should be ready, it should be more strict than GatewayAndHTTPRoutesMustBeAccepted(), why move it after solve this?
Need more time to refresh the pod list? if so, prefer to bring back
WaitForPods(t, suite.Client, ns, map[string]string{"app": "zone-aware-backend"}, corev1.PodRunning, podReady)
There was a problem hiding this comment.
I don't have any real proof about the issue being client cache but that's a theory.
I suspect that by first calling GatewayAndHTTPRoutesMustBeAccepted() that implicitly ensures the proxy pods are also up and running. In the case of zone aware routing tests, this is needed due to the affinity rules.
It seems useful that for all tests we'd want to wait for all test resources to be completely ready before proceeding. Limiting to just Pods allows for other edge cases in the future so a comprehensive function like NamespacesMustBeReady() (tbf it currently just checks gateways and pods but could be expanded further) would be my preference but if you'd like to instead bring back WaitForPods() specifically for zone-aware-routing I can make that update.
There was a problem hiding this comment.
You're using same-namespace gateway, so it's always there.
Let us test with, we can bring WaitForPods() back if needed in the future.
|
/retest |
What type of PR is this?
Fixes e2e flakiness after a refactoring
What this PR does / why we need it:
Which issue(s) this PR fixes:
Fixes #6119
Release Notes: No