fix(ci): improve reliability of the router-tests#2577
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughCI job timeouts/parallelism and Kafka health checks updated; EngineStatistics gained a predicate-based Wait with condition variable and broadcasts; testenv WaitFor* helpers now use EngineStats.Wait; multiple event tests reorganized into Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. 📝 Coding Plan
Comment Tip CodeRabbit can use TruffleHog to scan for secrets in your code with verification capabilities.Add a TruffleHog config file (e.g. trufflehog-config.yml, trufflehog.yml) to your project to customize detectors and scanning behavior. The tool runs only when a config file is present. |
Router image scan passed✅ No security vulnerabilities found in image: |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
.github/workflows/router-ci.yaml (1)
373-374: Consider liftingGOMAXPROCSto job-levelenvto avoid drift.Both test steps set the same value; placing it once under
integration_test.envkeeps config centralized.Also applies to: 395-396
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/router-ci.yaml around lines 373 - 374, Move the duplicated GOMAXPROCS environment variable out of the individual test steps and into the job-level env so both integration_test steps inherit the value; specifically, remove the GOMAXPROCS entries currently set under the two test step env blocks and add a single GOMAXPROCS: 4 entry under the job's env section so the integration_test steps no longer repeat it.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In @.github/workflows/router-ci.yaml:
- Around line 373-374: Move the duplicated GOMAXPROCS environment variable out
of the individual test steps and into the job-level env so both integration_test
steps inherit the value; specifically, remove the GOMAXPROCS entries currently
set under the two test step env blocks and add a single GOMAXPROCS: 4 entry
under the job's env section so the integration_test steps no longer repeat it.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2577 +/- ##
===========================================
- Coverage 89.30% 62.65% -26.66%
===========================================
Files 21 244 +223
Lines 4527 25831 +21304
Branches 1248 0 -1248
===========================================
+ Hits 4043 16184 +12141
- Misses 484 8298 +7814
- Partials 0 1349 +1349
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
router-tests/events/nats_events_test.go (1)
1719-1719: Consider reducing the polling interval for faster test feedback.The 10-second polling interval on
require.Eventuallyseems long given the 30-secondNatsWaitTimeout. A shorter interval (e.g., 1 second) would provide faster test feedback while still being reasonable.- }, NatsWaitTimeout, 10*time.Second) + }, NatsWaitTimeout, time.Second)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@router-tests/events/nats_events_test.go` at line 1719, The test uses require.Eventually with a long polling interval (10*time.Second) against NatsWaitTimeout; change the polling interval in that require.Eventually call to a shorter value (e.g., 1*time.Second) so the test checks more frequently while keeping the same overall timeout (NatsWaitTimeout) — locate the require.Eventually invocation in the test (around the block using NatsWaitTimeout) and replace the interval argument (currently 10*time.Second) with 1*time.Second.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@router/pkg/statistics/engine_stats.go`:
- Around line 68-90: The wait goroutine in EngineStats.Wait can leak when ctx is
cancelled because after s.cond.Broadcast() the goroutine wakes, re-evaluates the
predicate and may call s.cond.Wait() again and block forever; modify the
goroutine started in EngineStats.Wait (the anonymous func) to also observe ctx
cancellation inside its loop (e.g., after acquiring s.mu and before calling
s.cond.Wait()), so it breaks out and closes done when ctx.Done() is signaled;
ensure you reference s.cond.Wait, predicate(s.GetReport()), done and ctx to
perform a select or explicit check and return early to avoid re-waiting on the
cond after context cancellation.
---
Nitpick comments:
In `@router-tests/events/nats_events_test.go`:
- Line 1719: The test uses require.Eventually with a long polling interval
(10*time.Second) against NatsWaitTimeout; change the polling interval in that
require.Eventually call to a shorter value (e.g., 1*time.Second) so the test
checks more frequently while keeping the same overall timeout (NatsWaitTimeout)
— locate the require.Eventually invocation in the test (around the block using
NatsWaitTimeout) and replace the interval argument (currently 10*time.Second)
with 1*time.Second.
ℹ️ Review info
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
router-tests/events/nats_events_test.gorouter-tests/testenv/testenv.gorouter/pkg/statistics/engine_stats.go
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
router/pkg/statistics/engine_stats.go (1)
74-92:⚠️ Potential issue | 🔴 CriticalCancellation can still miss the wake-up and strand the waiter goroutine.
At Line 90–Line 91,
Broadcast()is not synchronized withs.mu. If cancellation fires just before the waiter callss.cond.Wait(), the wake-up can be missed and the goroutine can block indefinitely (alsodoneis never closed on cancel path).🐛 Proposed fix
func (s *EngineStats) Wait(ctx context.Context, predicate func(*UsageReport) bool) { if predicate(s.GetReport()) { return } done := make(chan struct{}) go func() { s.mu.Lock() defer s.mu.Unlock() - for !predicate(s.GetReport()) { - s.cond.Wait() - if ctx.Err() != nil { - return - } - } - close(done) + defer close(done) + for !predicate(s.GetReport()) { + if ctx.Err() != nil { + return + } + s.cond.Wait() + } }() select { case <-done: case <-ctx.Done(): - // Unblock the goroutine waiting on cond.Wait() - s.cond.Broadcast() + // Serialize with waiter loop to avoid lost wake-ups. + s.mu.Lock() + s.cond.Broadcast() + s.mu.Unlock() + <-done } }In Go sync.Cond usage, can a Broadcast be missed if it happens before a goroutine starts Wait, and does taking the same mutex around Broadcast avoid this lost-wakeup race with cancellation?🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@router/pkg/statistics/engine_stats.go` around lines 74 - 92, The waiter goroutine can be stranded because s.cond.Broadcast() is called without holding s.mu and done is never closed on cancel; fix by taking s.mu before broadcasting and closing done under that same lock so the wakeup cannot be missed: in the cancel branch of the select acquire s.mu, call s.cond.Broadcast(), close(done) (or set a cancellation flag and close done) and then release s.mu so the goroutine waiting in the anonymous function will observe ctx.Err() and return; reference s.mu, s.cond.Broadcast(), done, predicate, and the anonymous goroutine that calls s.cond.Wait().
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/router-ci.yaml:
- Around line 373-375: The current test_params uses the flawed regex string
"-run '^Test[^(Flaky)]'" which incorrectly excludes any test whose second
character is one of (F,l,a,k,y,) and thus omits valid tests like
TestFallbackErrors and others; replace this pattern by running all tests
matching "^Test" and explicitly skipping flaky ones by adding a separate skip
flag, e.g. use "-run '^Test' -skip '^TestFlaky'" (remove the "[^(Flaky)]"
pattern) so TestFallbackErrors, TestFeatureFlags, TestFileUpload_*,
TestForwardHeaders, TestForwardRenamedHeaders, etc., are included while flaky
tests named like TestFlaky* are excluded.
---
Duplicate comments:
In `@router/pkg/statistics/engine_stats.go`:
- Around line 74-92: The waiter goroutine can be stranded because
s.cond.Broadcast() is called without holding s.mu and done is never closed on
cancel; fix by taking s.mu before broadcasting and closing done under that same
lock so the wakeup cannot be missed: in the cancel branch of the select acquire
s.mu, call s.cond.Broadcast(), close(done) (or set a cancellation flag and close
done) and then release s.mu so the goroutine waiting in the anonymous function
will observe ctx.Err() and return; reference s.mu, s.cond.Broadcast(), done,
predicate, and the anonymous goroutine that calls s.cond.Wait().
ℹ️ Review info
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
.github/workflows/router-ci.yamlrouter-tests/events/kafka_events_test.gorouter-tests/events/nats_events_test.gorouter-tests/testenv/testenv.gorouter-tests/websocket_test.gorouter/pkg/statistics/engine_stats.go
🚧 Files skipped from review as they are similar to previous changes (1)
- router-tests/events/nats_events_test.go
There was a problem hiding this comment.
🧹 Nitpick comments (1)
.github/workflows/router-ci.yaml (1)
375-395: Consider deduplicating shared test flags to avoid drift between flaky/non-flaky jobs.
--timeoutand--parallelare repeated in both commands; centralizing them in job env makes future tuning safer.♻️ Proposed refactor
integration_test: runs-on: ubuntu-latest-l timeout-minutes: 30 env: GOMAXPROCS: 4 + TEST_TIMEOUT: 8m + TEST_PARALLEL: 2 @@ - name: Run Integration tests ${{ matrix.test_target }} working-directory: ./router-tests - run: make test-coverage test_retry_count=0 test_params="-run '^Test' -skip '^TestFlaky' --timeout=8m -p 1 --parallel 2" test_target="${{ matrix.test_target }}" + run: make test-coverage test_retry_count=0 test_params="-run '^Test' -skip '^TestFlaky' --timeout=${TEST_TIMEOUT} -p 1 --parallel ${TEST_PARALLEL}" test_target="${{ matrix.test_target }}" @@ - name: Run Flaky Integration tests ${{ matrix.test_target }} working-directory: ./router-tests - run: make test-coverage test_retry_count=3 test_params="-run '^TestFlaky' --timeout=8m -p 1 --parallel 2" test_target="${{ matrix.test_target }}" + run: make test-coverage test_retry_count=3 test_params="-run '^TestFlaky' --timeout=${TEST_TIMEOUT} -p 1 --parallel ${TEST_PARALLEL}" test_target="${{ matrix.test_target }}"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/router-ci.yaml around lines 375 - 395, The repeated test flags (--timeout and --parallel) in the two make invocations should be centralized into job-level environment variables to avoid drift: define e.g. env entries TEST_TIMEOUT and TEST_PARALLEL (or a single TEST_PARAMS) at the job level, then replace the inline test_params in both make test-coverage calls (the run lines invoking make test-coverage with test_retry_count and test_target) to reference those env vars; also propagate the same variable into the flaky and non-flaky commands so matrix.test_target and steps.artifact_name.outputs.sanitized logic remains unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In @.github/workflows/router-ci.yaml:
- Around line 375-395: The repeated test flags (--timeout and --parallel) in the
two make invocations should be centralized into job-level environment variables
to avoid drift: define e.g. env entries TEST_TIMEOUT and TEST_PARALLEL (or a
single TEST_PARAMS) at the job level, then replace the inline test_params in
both make test-coverage calls (the run lines invoking make test-coverage with
test_retry_count and test_target) to reference those env vars; also propagate
the same variable into the flaky and non-flaky commands so matrix.test_target
and steps.artifact_name.outputs.sanitized logic remains unchanged.
ℹ️ Review info
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
.github/workflows/router-ci.yamlrouter-tests/events/kafka_events_test.gorouter-tests/events/nats_events_test.gorouter-tests/events/redis_events_test.go
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
router-tests/events/redis_events_test.go (1)
785-865: Good isolation of timing-sensitive tests.Creating a separate
TestFlakyRedisEventssuite for tests that are more sensitive to timing/resource constraints is a reasonable approach for test organization and CI stability.The "subscribe sync sse legacy method works" test (lines 792-865) is structurally very similar to "subscribe sync sse" (lines 552-624). If these tests are intentionally covering different code paths (current vs. legacy SSE handling), consider adding a brief comment explaining what distinguishes the "legacy method" to help future maintainers understand why both tests exist.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@router-tests/events/redis_events_test.go` around lines 785 - 865, Add a short clarifying comment above the t.Run block named "subscribe sync sse legacy method works" inside TestFlakyRedisEvents explaining what makes this test exercise the legacy SSE handling versus the "subscribe sync sse" test (e.g., different request headers/format, older endpoint, or alternate server path), so future maintainers can see why both tests exist; locate the t.Run with the exact name "subscribe sync sse legacy method works" in TestFlakyRedisEvents and insert the one-line explanation just before the test setup (after the t.Run(...) line and before building subscribePayload).router-tests/events/kafka_events_test.go (1)
587-588: Avoid duplicated magic read timeout values.Line 587 and Line 657 both hard-code
5 * time.Second. Consider extracting a small constant (e.g.,KafkaReadDeadline) so timeout tuning stays centralized.Also applies to: 657-658
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@router-tests/events/kafka_events_test.go` around lines 587 - 588, Extract the duplicated magic timeout into a single constant (e.g., KafkaReadDeadline time.Duration = 5 * time.Second) at the top of the test file and replace the hard-coded uses in the conn.SetReadDeadline calls and any related ReadJSON timeout usages (currently the 5 * time.Second literals used around conn.SetReadDeadline and subsequent reads) with that constant; ensure the constant is a time.Duration so callers like conn.SetReadDeadline(time.Now().Add(KafkaReadDeadline)) continue to work and update both occurrences (the ones around conn.SetReadDeadline and the read calls mentioned) to reference KafkaReadDeadline.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@router-tests/events/kafka_events_test.go`:
- Around line 934-936: The new top-level integration test TestFlakyKafkaEvents
lacks a short-mode guard and will run during go test -short; update
TestFlakyKafkaEvents to check testing.Short() and call t.Skip("skipping
integration test in short mode") (same pattern as TestKafkaEvents) at the top of
the function so Kafka-dependent integration tests are skipped when tests run in
short mode.
In `@router-tests/singleflight_test.go`:
- Line 566: Update the other three tests that call
xEnv.WaitForSubscriptionCount(..., time.Second*5) to use time.Second*15 instead
so the timeout increase is applied consistently; locate the calls to
WaitForSubscriptionCount in the tests labeled "subscription deduplication with
multiple subgraphs", "subscription deduplication with multiple subgraphs -
single flight disabled", and "subscription deduplication with multiple subgraphs
- same headers" (they pass uint64(numOfOperations) and use the xEnv variable)
and change the timeout argument from time.Second*5 to time.Second*15.
---
Nitpick comments:
In `@router-tests/events/kafka_events_test.go`:
- Around line 587-588: Extract the duplicated magic timeout into a single
constant (e.g., KafkaReadDeadline time.Duration = 5 * time.Second) at the top of
the test file and replace the hard-coded uses in the conn.SetReadDeadline calls
and any related ReadJSON timeout usages (currently the 5 * time.Second literals
used around conn.SetReadDeadline and subsequent reads) with that constant;
ensure the constant is a time.Duration so callers like
conn.SetReadDeadline(time.Now().Add(KafkaReadDeadline)) continue to work and
update both occurrences (the ones around conn.SetReadDeadline and the read calls
mentioned) to reference KafkaReadDeadline.
In `@router-tests/events/redis_events_test.go`:
- Around line 785-865: Add a short clarifying comment above the t.Run block
named "subscribe sync sse legacy method works" inside TestFlakyRedisEvents
explaining what makes this test exercise the legacy SSE handling versus the
"subscribe sync sse" test (e.g., different request headers/format, older
endpoint, or alternate server path), so future maintainers can see why both
tests exist; locate the t.Run with the exact name "subscribe sync sse legacy
method works" in TestFlakyRedisEvents and insert the one-line explanation just
before the test setup (after the t.Run(...) line and before building
subscribePayload).
ℹ️ Review info
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
router-tests/events/kafka_events_test.gorouter-tests/events/redis_events_test.gorouter-tests/singleflight_test.gorouter-tests/websocket_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- router-tests/websocket_test.go
|
Have you considered using something like a "SyncReporter" that fulfills the Reporter interface in these tests instead of the hacked on synchronisation to EngineStats? It would remove a lot of the complexity around awaiting for the engine to report something if it was directly channel based and blocking. I don't believe any of the methods on reporter are called on a timer or anything but if they are then it could either be configurable which report functions to care about or separate reporters that block on each signal or a mask of signals to mitigate the issue. |
88188d0 to
eae1390
Compare
9bcc31f to
1d21013
Compare
- Extract duplicate event test helpers into shared helpers_test.go - Rename import alias from `integration` to `routertests` in observability tests - Replace AGENTS.md symlink with standalone file referencing CLAUDE.md Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… tests Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…t aliases Move utils.go from root integration package to router-tests/testutils/ package, update all 25 consuming files to import testutils instead of using the confusing integration alias, and remove the "trigger CI 2" comment artifact. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Noroth
left a comment
There was a problem hiding this comment.
Approved from my end, but it's a lot. You might want a second approval
Merge from main introduced references to NatsWaitTimeout which doesn't exist. The correct constant is EventWaitTimeout from helpers_test.go. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary
signal: killed), not actual test bugsChanges
'./. ./fuzzquery ./lifecycle ./modules'→'./.'+'./fuzzquery ./lifecycle ./modules'(5 jobs instead of 4)--parallel 10→--parallel 4(each parallel test spins up a full router + gRPC plugin subprocess; 10 concurrent with-raceexhausts 16 GB RAM)kafka-broker-api-versions.sh --versiononly checks binary exists; now uses--bootstrap-server localhost:9092to verify broker readinessEvidence
Analyzed 7 most recent CI failures — all showed
signal: killedon plugin subprocesses (OOM), zero actual--- FAILtest assertions:Test plan
signal: killed🤖 Generated with Claude Code
Summary by CodeRabbit