Skip to content

fix(ci): improve reliability of the router-tests#2577

Merged
jensneuse merged 489 commits intomainfrom
jensneuse/fix-flaky-tests
Mar 13, 2026
Merged

fix(ci): improve reliability of the router-tests#2577
jensneuse merged 489 commits intomainfrom
jensneuse/fix-flaky-tests

Conversation

@jensneuse
Copy link
Copy Markdown
Member

@jensneuse jensneuse commented Mar 3, 2026

Summary

  • All recent CI failures on integration tests were caused by OOM (signal: killed), not actual test bugs
  • Reduces memory pressure by splitting the largest matrix entry, reducing parallelism (10→4), adding GOMAXPROCS=4, and fixing the Kafka health check
  • Increases test timeout (5m→8m) to accommodate lower parallelism

Changes

  1. Split test matrix: './. ./fuzzquery ./lifecycle ./modules''./.' + './fuzzquery ./lifecycle ./modules' (5 jobs instead of 4)
  2. Reduce parallelism: --parallel 10--parallel 4 (each parallel test spins up a full router + gRPC plugin subprocess; 10 concurrent with -race exhausts 16 GB RAM)
  3. Add GOMAXPROCS=4: Limits Go scheduler threads, reducing race detector memory overhead
  4. Fix Kafka health check: kafka-broker-api-versions.sh --version only checks binary exists; now uses --bootstrap-server localhost:9092 to verify broker readiness
  5. Increase test timeout: 5m → 8m to accommodate fewer parallel tests

Evidence

Analyzed 7 most recent CI failures — all showed signal: killed on plugin subprocesses (OOM), zero actual --- FAIL test assertions:

Run killed count test failures
22612622051 3 0
22612367177 6 0
22602367418 3 0
22593260747 4 0
22592631652 6 0
22592307099 6 0
22591585050 6 0

Test plan

  • All 5 integration_test matrix jobs pass without signal: killed
  • Wall-clock time stays under 25 minutes per job
  • Re-run CI 2-3 times to verify stability

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Tests
    • Added separate "flaky" event test suites for NATS, Kafka and Redis; moved and expanded several scenarios into them, added explicit synchronization after message production, and adjusted read deadline handling.
  • Refactor
    • Test helpers switched to condition-based waiting tied to engine statistics for more deterministic waits.
  • Chores
    • CI adjustments: increased test timeouts, reduced parallelism, set GOMAXPROCS, and strengthened Kafka health checks and retries.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 3, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

CI job timeouts/parallelism and Kafka health checks updated; EngineStatistics gained a predicate-based Wait with condition variable and broadcasts; testenv WaitFor* helpers now use EngineStats.Wait; multiple event tests reorganized into TestFlaky* suites and synchronized via message-count waits/deadlines.

Changes

Cohort / File(s) Summary
CI Workflow
/.github/workflows/router-ci.yaml
Adjusted integration job matrix entries; set GOMAXPROCS=4; changed Kafka health check to use `--bootstrap-server ...
Engine statistics core
router/pkg/statistics/engine_stats.go
Added Wait(ctx, predicate) to EngineStatistics; EngineStats now includes cond *sync.Cond and implements predicate-based Wait; state-mutating methods call cond.Broadcast(); NoopEngineStats implements Wait by waiting for context cancellation.
Test environment helpers
router-tests/testenv/testenv.go
Imported statistics; replaced manual polling loops in multiple WaitFor* helpers with EngineStats.Wait predicate-based waits and perform final assertions using the latest report.
NATS tests
router-tests/events/nats_events_test.go
Added TestFlakyNatsEvents; removed two subtests from the original suite; tightened one retry timeout (10s→1s); added explicit read-deadline set/reset around ReadJSON; reorganized flaky scenarios and client cleanup.
Kafka tests
router-tests/events/kafka_events_test.go
Moved two subtests into new TestFlakyKafkaEvents; introduced expectedMessages counters and WaitForMessagesSent gating before reads; switched to explicit per-read deadlines (5s) and added additional WaitForMessagesSent synchronization.
Redis tests
router-tests/events/redis_events_test.go
Removed a legacy SSE subtest from main suite; added TestFlakyRedisEvents containing moved/reworked tests (mutation/typename checks, invalid JSON/missing entity cases, multi-client lifecycle); added WaitForMessagesSent gating after produces.
WebSocket test
router-tests/websocket_test.go
Replaced atomic completion flag and require.Eventually with WaitForMessagesSent(1, 10s); increased some subscription wait timeouts (5s→10s).
Singleflight test
router-tests/singleflight_test.go
Increased a subscription wait timeout from 5s to 15s inside TestSingleFlight goroutine.
Module / Misc
go.mod, manifest_file
Minor module/dependency adjustments accompanying test changes.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 10.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The PR title claims to 'improve reliability' but the actual objectives focus on reducing OOM kills and resource pressure through specific optimizations. The title is vague and doesn't capture the main technical changes. Clarify the title to more specifically reference OOM/memory reduction (e.g., 'fix(ci): reduce OOM kills in router integration tests') or resource optimization rather than generic 'reliability improvement'.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

CodeRabbit can use TruffleHog to scan for secrets in your code with verification capabilities.

Add a TruffleHog config file (e.g. trufflehog-config.yml, trufflehog.yml) to your project to customize detectors and scanning behavior. The tool runs only when a config file is present.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Mar 3, 2026

Router image scan passed

✅ No security vulnerabilities found in image:

ghcr.io/wundergraph/cosmo/router:sha-da87bfa75ea16eaf734d53c764215f25a8d1947e

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
.github/workflows/router-ci.yaml (1)

373-374: Consider lifting GOMAXPROCS to job-level env to avoid drift.

Both test steps set the same value; placing it once under integration_test.env keeps config centralized.

Also applies to: 395-396

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/router-ci.yaml around lines 373 - 374, Move the duplicated
GOMAXPROCS environment variable out of the individual test steps and into the
job-level env so both integration_test steps inherit the value; specifically,
remove the GOMAXPROCS entries currently set under the two test step env blocks
and add a single GOMAXPROCS: 4 entry under the job's env section so the
integration_test steps no longer repeat it.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In @.github/workflows/router-ci.yaml:
- Around line 373-374: Move the duplicated GOMAXPROCS environment variable out
of the individual test steps and into the job-level env so both integration_test
steps inherit the value; specifically, remove the GOMAXPROCS entries currently
set under the two test step env blocks and add a single GOMAXPROCS: 4 entry
under the job's env section so the integration_test steps no longer repeat it.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b5bc471 and 7daf34b.

📒 Files selected for processing (1)
  • .github/workflows/router-ci.yaml

@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 3, 2026

Codecov Report

❌ Patch coverage is 80.00000% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 62.65%. Comparing base (56859f0) to head (655cbf5).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
router/core/router.go 66.66% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##             main    #2577       +/-   ##
===========================================
- Coverage   89.30%   62.65%   -26.66%     
===========================================
  Files          21      244      +223     
  Lines        4527    25831    +21304     
  Branches     1248        0     -1248     
===========================================
+ Hits         4043    16184    +12141     
- Misses        484     8298     +7814     
- Partials        0     1349     +1349     
Files with missing lines Coverage Δ
router/core/cache_warmup.go 88.94% <100.00%> (ø)
router/pkg/pubsub/kafka/adapter.go 65.46% <100.00%> (ø)
router/pkg/statistics/engine_stats.go 3.38% <ø> (ø)
router/pkg/trace/meter.go 44.35% <ø> (ø)
router/core/router.go 69.41% <66.66%> (ø)

... and 260 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
router-tests/events/nats_events_test.go (1)

1719-1719: Consider reducing the polling interval for faster test feedback.

The 10-second polling interval on require.Eventually seems long given the 30-second NatsWaitTimeout. A shorter interval (e.g., 1 second) would provide faster test feedback while still being reasonable.

-			}, NatsWaitTimeout, 10*time.Second)
+			}, NatsWaitTimeout, time.Second)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@router-tests/events/nats_events_test.go` at line 1719, The test uses
require.Eventually with a long polling interval (10*time.Second) against
NatsWaitTimeout; change the polling interval in that require.Eventually call to
a shorter value (e.g., 1*time.Second) so the test checks more frequently while
keeping the same overall timeout (NatsWaitTimeout) — locate the
require.Eventually invocation in the test (around the block using
NatsWaitTimeout) and replace the interval argument (currently 10*time.Second)
with 1*time.Second.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@router/pkg/statistics/engine_stats.go`:
- Around line 68-90: The wait goroutine in EngineStats.Wait can leak when ctx is
cancelled because after s.cond.Broadcast() the goroutine wakes, re-evaluates the
predicate and may call s.cond.Wait() again and block forever; modify the
goroutine started in EngineStats.Wait (the anonymous func) to also observe ctx
cancellation inside its loop (e.g., after acquiring s.mu and before calling
s.cond.Wait()), so it breaks out and closes done when ctx.Done() is signaled;
ensure you reference s.cond.Wait, predicate(s.GetReport()), done and ctx to
perform a select or explicit check and return early to avoid re-waiting on the
cond after context cancellation.

---

Nitpick comments:
In `@router-tests/events/nats_events_test.go`:
- Line 1719: The test uses require.Eventually with a long polling interval
(10*time.Second) against NatsWaitTimeout; change the polling interval in that
require.Eventually call to a shorter value (e.g., 1*time.Second) so the test
checks more frequently while keeping the same overall timeout (NatsWaitTimeout)
— locate the require.Eventually invocation in the test (around the block using
NatsWaitTimeout) and replace the interval argument (currently 10*time.Second)
with 1*time.Second.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7daf34b and a4160ea.

📒 Files selected for processing (3)
  • router-tests/events/nats_events_test.go
  • router-tests/testenv/testenv.go
  • router/pkg/statistics/engine_stats.go

Comment thread router/pkg/statistics/engine_stats.go Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
router/pkg/statistics/engine_stats.go (1)

74-92: ⚠️ Potential issue | 🔴 Critical

Cancellation can still miss the wake-up and strand the waiter goroutine.

At Line 90–Line 91, Broadcast() is not synchronized with s.mu. If cancellation fires just before the waiter calls s.cond.Wait(), the wake-up can be missed and the goroutine can block indefinitely (also done is never closed on cancel path).

🐛 Proposed fix
 func (s *EngineStats) Wait(ctx context.Context, predicate func(*UsageReport) bool) {
 	if predicate(s.GetReport()) {
 		return
 	}

 	done := make(chan struct{})
 	go func() {
 		s.mu.Lock()
 		defer s.mu.Unlock()
-		for !predicate(s.GetReport()) {
-			s.cond.Wait()
-			if ctx.Err() != nil {
-				return
-			}
-		}
-		close(done)
+		defer close(done)
+		for !predicate(s.GetReport()) {
+			if ctx.Err() != nil {
+				return
+			}
+			s.cond.Wait()
+		}
 	}()

 	select {
 	case <-done:
 	case <-ctx.Done():
-		// Unblock the goroutine waiting on cond.Wait()
-		s.cond.Broadcast()
+		// Serialize with waiter loop to avoid lost wake-ups.
+		s.mu.Lock()
+		s.cond.Broadcast()
+		s.mu.Unlock()
+		<-done
 	}
 }
In Go sync.Cond usage, can a Broadcast be missed if it happens before a goroutine starts Wait, and does taking the same mutex around Broadcast avoid this lost-wakeup race with cancellation?
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@router/pkg/statistics/engine_stats.go` around lines 74 - 92, The waiter
goroutine can be stranded because s.cond.Broadcast() is called without holding
s.mu and done is never closed on cancel; fix by taking s.mu before broadcasting
and closing done under that same lock so the wakeup cannot be missed: in the
cancel branch of the select acquire s.mu, call s.cond.Broadcast(), close(done)
(or set a cancellation flag and close done) and then release s.mu so the
goroutine waiting in the anonymous function will observe ctx.Err() and return;
reference s.mu, s.cond.Broadcast(), done, predicate, and the anonymous goroutine
that calls s.cond.Wait().
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/router-ci.yaml:
- Around line 373-375: The current test_params uses the flawed regex string
"-run '^Test[^(Flaky)]'" which incorrectly excludes any test whose second
character is one of (F,l,a,k,y,) and thus omits valid tests like
TestFallbackErrors and others; replace this pattern by running all tests
matching "^Test" and explicitly skipping flaky ones by adding a separate skip
flag, e.g. use "-run '^Test' -skip '^TestFlaky'" (remove the "[^(Flaky)]"
pattern) so TestFallbackErrors, TestFeatureFlags, TestFileUpload_*,
TestForwardHeaders, TestForwardRenamedHeaders, etc., are included while flaky
tests named like TestFlaky* are excluded.

---

Duplicate comments:
In `@router/pkg/statistics/engine_stats.go`:
- Around line 74-92: The waiter goroutine can be stranded because
s.cond.Broadcast() is called without holding s.mu and done is never closed on
cancel; fix by taking s.mu before broadcasting and closing done under that same
lock so the wakeup cannot be missed: in the cancel branch of the select acquire
s.mu, call s.cond.Broadcast(), close(done) (or set a cancellation flag and close
done) and then release s.mu so the goroutine waiting in the anonymous function
will observe ctx.Err() and return; reference s.mu, s.cond.Broadcast(), done,
predicate, and the anonymous goroutine that calls s.cond.Wait().

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a4160ea and 97bae6a.

📒 Files selected for processing (6)
  • .github/workflows/router-ci.yaml
  • router-tests/events/kafka_events_test.go
  • router-tests/events/nats_events_test.go
  • router-tests/testenv/testenv.go
  • router-tests/websocket_test.go
  • router/pkg/statistics/engine_stats.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • router-tests/events/nats_events_test.go

Comment thread .github/workflows/router-ci.yaml Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
.github/workflows/router-ci.yaml (1)

375-395: Consider deduplicating shared test flags to avoid drift between flaky/non-flaky jobs.

--timeout and --parallel are repeated in both commands; centralizing them in job env makes future tuning safer.

♻️ Proposed refactor
   integration_test:
     runs-on: ubuntu-latest-l
     timeout-minutes: 30
     env:
       GOMAXPROCS: 4
+      TEST_TIMEOUT: 8m
+      TEST_PARALLEL: 2
@@
       - name: Run Integration tests ${{ matrix.test_target }}
         working-directory: ./router-tests
-        run: make test-coverage test_retry_count=0 test_params="-run '^Test' -skip '^TestFlaky' --timeout=8m -p 1 --parallel 2" test_target="${{ matrix.test_target }}"
+        run: make test-coverage test_retry_count=0 test_params="-run '^Test' -skip '^TestFlaky' --timeout=${TEST_TIMEOUT} -p 1 --parallel ${TEST_PARALLEL}" test_target="${{ matrix.test_target }}"
@@
       - name: Run Flaky Integration tests ${{ matrix.test_target }}
         working-directory: ./router-tests
-        run: make test-coverage test_retry_count=3 test_params="-run '^TestFlaky' --timeout=8m -p 1 --parallel 2" test_target="${{ matrix.test_target }}"
+        run: make test-coverage test_retry_count=3 test_params="-run '^TestFlaky' --timeout=${TEST_TIMEOUT} -p 1 --parallel ${TEST_PARALLEL}" test_target="${{ matrix.test_target }}"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/router-ci.yaml around lines 375 - 395, The repeated test
flags (--timeout and --parallel) in the two make invocations should be
centralized into job-level environment variables to avoid drift: define e.g. env
entries TEST_TIMEOUT and TEST_PARALLEL (or a single TEST_PARAMS) at the job
level, then replace the inline test_params in both make test-coverage calls (the
run lines invoking make test-coverage with test_retry_count and test_target) to
reference those env vars; also propagate the same variable into the flaky and
non-flaky commands so matrix.test_target and
steps.artifact_name.outputs.sanitized logic remains unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In @.github/workflows/router-ci.yaml:
- Around line 375-395: The repeated test flags (--timeout and --parallel) in the
two make invocations should be centralized into job-level environment variables
to avoid drift: define e.g. env entries TEST_TIMEOUT and TEST_PARALLEL (or a
single TEST_PARAMS) at the job level, then replace the inline test_params in
both make test-coverage calls (the run lines invoking make test-coverage with
test_retry_count and test_target) to reference those env vars; also propagate
the same variable into the flaky and non-flaky commands so matrix.test_target
and steps.artifact_name.outputs.sanitized logic remains unchanged.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 97bae6a and b04b401.

📒 Files selected for processing (4)
  • .github/workflows/router-ci.yaml
  • router-tests/events/kafka_events_test.go
  • router-tests/events/nats_events_test.go
  • router-tests/events/redis_events_test.go

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
router-tests/events/redis_events_test.go (1)

785-865: Good isolation of timing-sensitive tests.

Creating a separate TestFlakyRedisEvents suite for tests that are more sensitive to timing/resource constraints is a reasonable approach for test organization and CI stability.

The "subscribe sync sse legacy method works" test (lines 792-865) is structurally very similar to "subscribe sync sse" (lines 552-624). If these tests are intentionally covering different code paths (current vs. legacy SSE handling), consider adding a brief comment explaining what distinguishes the "legacy method" to help future maintainers understand why both tests exist.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@router-tests/events/redis_events_test.go` around lines 785 - 865, Add a short
clarifying comment above the t.Run block named "subscribe sync sse legacy method
works" inside TestFlakyRedisEvents explaining what makes this test exercise the
legacy SSE handling versus the "subscribe sync sse" test (e.g., different
request headers/format, older endpoint, or alternate server path), so future
maintainers can see why both tests exist; locate the t.Run with the exact name
"subscribe sync sse legacy method works" in TestFlakyRedisEvents and insert the
one-line explanation just before the test setup (after the t.Run(...) line and
before building subscribePayload).
router-tests/events/kafka_events_test.go (1)

587-588: Avoid duplicated magic read timeout values.

Line 587 and Line 657 both hard-code 5 * time.Second. Consider extracting a small constant (e.g., KafkaReadDeadline) so timeout tuning stays centralized.

Also applies to: 657-658

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@router-tests/events/kafka_events_test.go` around lines 587 - 588, Extract the
duplicated magic timeout into a single constant (e.g., KafkaReadDeadline
time.Duration = 5 * time.Second) at the top of the test file and replace the
hard-coded uses in the conn.SetReadDeadline calls and any related ReadJSON
timeout usages (currently the 5 * time.Second literals used around
conn.SetReadDeadline and subsequent reads) with that constant; ensure the
constant is a time.Duration so callers like
conn.SetReadDeadline(time.Now().Add(KafkaReadDeadline)) continue to work and
update both occurrences (the ones around conn.SetReadDeadline and the read calls
mentioned) to reference KafkaReadDeadline.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@router-tests/events/kafka_events_test.go`:
- Around line 934-936: The new top-level integration test TestFlakyKafkaEvents
lacks a short-mode guard and will run during go test -short; update
TestFlakyKafkaEvents to check testing.Short() and call t.Skip("skipping
integration test in short mode") (same pattern as TestKafkaEvents) at the top of
the function so Kafka-dependent integration tests are skipped when tests run in
short mode.

In `@router-tests/singleflight_test.go`:
- Line 566: Update the other three tests that call
xEnv.WaitForSubscriptionCount(..., time.Second*5) to use time.Second*15 instead
so the timeout increase is applied consistently; locate the calls to
WaitForSubscriptionCount in the tests labeled "subscription deduplication with
multiple subgraphs", "subscription deduplication with multiple subgraphs -
single flight disabled", and "subscription deduplication with multiple subgraphs
- same headers" (they pass uint64(numOfOperations) and use the xEnv variable)
and change the timeout argument from time.Second*5 to time.Second*15.

---

Nitpick comments:
In `@router-tests/events/kafka_events_test.go`:
- Around line 587-588: Extract the duplicated magic timeout into a single
constant (e.g., KafkaReadDeadline time.Duration = 5 * time.Second) at the top of
the test file and replace the hard-coded uses in the conn.SetReadDeadline calls
and any related ReadJSON timeout usages (currently the 5 * time.Second literals
used around conn.SetReadDeadline and subsequent reads) with that constant;
ensure the constant is a time.Duration so callers like
conn.SetReadDeadline(time.Now().Add(KafkaReadDeadline)) continue to work and
update both occurrences (the ones around conn.SetReadDeadline and the read calls
mentioned) to reference KafkaReadDeadline.

In `@router-tests/events/redis_events_test.go`:
- Around line 785-865: Add a short clarifying comment above the t.Run block
named "subscribe sync sse legacy method works" inside TestFlakyRedisEvents
explaining what makes this test exercise the legacy SSE handling versus the
"subscribe sync sse" test (e.g., different request headers/format, older
endpoint, or alternate server path), so future maintainers can see why both
tests exist; locate the t.Run with the exact name "subscribe sync sse legacy
method works" in TestFlakyRedisEvents and insert the one-line explanation just
before the test setup (after the t.Run(...) line and before building
subscribePayload).

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b04b401 and 80e9f89.

📒 Files selected for processing (4)
  • router-tests/events/kafka_events_test.go
  • router-tests/events/redis_events_test.go
  • router-tests/singleflight_test.go
  • router-tests/websocket_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • router-tests/websocket_test.go

Comment thread router-tests/events/kafka_events_test.go Outdated
Comment thread router-tests/operations/singleflight_test.go
Comment thread router-tests/observability/main_test.go Outdated
Comment thread router-tests/events/nats_events_test.go
Comment thread router/pkg/statistics/engine_stats.go Outdated
Comment thread router-tests/utils.go Outdated
Comment thread router/pkg/statistics/engine_stats.go Outdated
Comment thread router/pkg/statistics/engine_stats.go Outdated
@jensneuse jensneuse changed the title fix(ci): reduce OOM kills in router integration tests fix(ci): improve reliability of the router-tests Mar 5, 2026
@endigma
Copy link
Copy Markdown
Member

endigma commented Mar 5, 2026

Have you considered using something like a "SyncReporter" that fulfills the Reporter interface in these tests instead of the hacked on synchronisation to EngineStats?

It would remove a lot of the complexity around awaiting for the engine to report something if it was directly channel based and blocking.

I don't believe any of the methods on reporter are called on a timer or anything but if they are then it could either be configurable which report functions to care about or separate reporters that block on each signal or a mask of signals to mitigate the issue.

@jensneuse jensneuse force-pushed the jensneuse/fix-flaky-tests branch from 88188d0 to eae1390 Compare March 5, 2026 17:11
@jensneuse jensneuse closed this Mar 5, 2026
@jensneuse jensneuse reopened this Mar 5, 2026
@jensneuse jensneuse force-pushed the jensneuse/fix-flaky-tests branch from 9bcc31f to 1d21013 Compare March 5, 2026 22:51
@jensneuse jensneuse closed this Mar 5, 2026
@jensneuse jensneuse reopened this Mar 5, 2026
@jensneuse jensneuse closed this Mar 5, 2026
@jensneuse jensneuse reopened this Mar 5, 2026
Comment thread router-tests/events/nats_events_test.go Outdated
Comment thread router-tests/observability/graphql_metrics_test.go
Comment thread router-tests/persisted_operations_test.go
Comment thread router/core/router.go
Comment thread router-tests/testutils/utils.go
Comment thread router-tests/protocol/integration_test.go Outdated
jensneuse and others added 4 commits March 10, 2026 13:57
- Extract duplicate event test helpers into shared helpers_test.go
- Rename import alias from `integration` to `routertests` in observability tests
- Replace AGENTS.md symlink with standalone file referencing CLAUDE.md

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… tests

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…t aliases

Move utils.go from root integration package to router-tests/testutils/ package,
update all 25 consuming files to import testutils instead of using the confusing
integration alias, and remove the "trigger CI 2" comment artifact.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@jensneuse jensneuse requested a review from Noroth March 10, 2026 13:53
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@jensneuse jensneuse requested a review from StarpTech March 10, 2026 13:53
Copy link
Copy Markdown
Contributor

@Noroth Noroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved from my end, but it's a lot. You might want a second approval

@jensneuse jensneuse requested a review from SkArchon as a code owner March 11, 2026 09:50
Merge from main introduced references to NatsWaitTimeout which doesn't
exist. The correct constant is EventWaitTimeout from helpers_test.go.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Comment thread router-tests/protocol/integration_test.go Outdated
Comment thread router-tests/protocol/router_config_test.go Outdated
Comment thread router-tests/protocol/integration_test.go
Comment thread router-tests/testenv/sync_reporter.go Outdated
@jensneuse jensneuse requested a review from endigma March 11, 2026 22:29
@jensneuse jensneuse enabled auto-merge (squash) March 13, 2026 12:37
@jensneuse jensneuse disabled auto-merge March 13, 2026 12:45
@jensneuse jensneuse merged commit 4aacdee into main Mar 13, 2026
36 checks passed
@jensneuse jensneuse deleted the jensneuse/fix-flaky-tests branch March 13, 2026 12:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants