fix(ci): improve reliability of the router-tests by jensneuse · Pull Request #2577 · wundergraph/cosmo

jensneuse · 2026-03-03T08:11:40Z

Summary

All recent CI failures on integration tests were caused by OOM (signal: killed), not actual test bugs
Reduces memory pressure by splitting the largest matrix entry, reducing parallelism (10→4), adding GOMAXPROCS=4, and fixing the Kafka health check
Increases test timeout (5m→8m) to accommodate lower parallelism

Changes

Split test matrix: './. ./fuzzquery ./lifecycle ./modules' → './.' + './fuzzquery ./lifecycle ./modules' (5 jobs instead of 4)
Reduce parallelism: --parallel 10 → --parallel 4 (each parallel test spins up a full router + gRPC plugin subprocess; 10 concurrent with -race exhausts 16 GB RAM)
Add GOMAXPROCS=4: Limits Go scheduler threads, reducing race detector memory overhead
Fix Kafka health check: kafka-broker-api-versions.sh --version only checks binary exists; now uses --bootstrap-server localhost:9092 to verify broker readiness
Increase test timeout: 5m → 8m to accommodate fewer parallel tests

Evidence

Analyzed 7 most recent CI failures — all showed signal: killed on plugin subprocesses (OOM), zero actual --- FAIL test assertions:

Run	killed count	test failures
22612622051	3	0
22612367177	6	0
22602367418	3	0
22593260747	4	0
22592631652	6	0
22592307099	6	0
22591585050	6	0

Test plan

All 5 integration_test matrix jobs pass without signal: killed
Wall-clock time stays under 25 minutes per job
Re-run CI 2-3 times to verify stability

🤖 Generated with Claude Code

Summary by CodeRabbit

Tests
- Added separate "flaky" event test suites for NATS, Kafka and Redis; moved and expanded several scenarios into them, added explicit synchronization after message production, and adjusted read deadline handling.
Refactor
- Test helpers switched to condition-based waiting tied to engine statistics for more deterministic waits.
Chores
- CI adjustments: increased test timeouts, reduced parallelism, set GOMAXPROCS, and strengthened Kafka health checks and retries.

coderabbitai · 2026-03-03T08:11:58Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

Walkthrough

CI job timeouts/parallelism and Kafka health checks updated; EngineStatistics gained a predicate-based Wait with condition variable and broadcasts; testenv WaitFor* helpers now use EngineStats.Wait; multiple event tests reorganized into TestFlaky* suites and synchronized via message-count waits/deadlines.

Changes

Cohort / File(s)	Summary
CI Workflow `/.github/workflows/router-ci.yaml`	Adjusted integration job matrix entries; set `GOMAXPROCS=4`; changed Kafka health check to use `--bootstrap-server ...
Engine statistics core `router/pkg/statistics/engine_stats.go`	Added `Wait(ctx, predicate)` to `EngineStatistics`; `EngineStats` now includes `cond *sync.Cond` and implements predicate-based `Wait`; state-mutating methods call `cond.Broadcast()`; `NoopEngineStats` implements `Wait` by waiting for context cancellation.
Test environment helpers `router-tests/testenv/testenv.go`	Imported `statistics`; replaced manual polling loops in multiple `WaitFor*` helpers with `EngineStats.Wait` predicate-based waits and perform final assertions using the latest report.
NATS tests `router-tests/events/nats_events_test.go`	Added `TestFlakyNatsEvents`; removed two subtests from the original suite; tightened one retry timeout (10s→1s); added explicit read-deadline set/reset around `ReadJSON`; reorganized flaky scenarios and client cleanup.
Kafka tests `router-tests/events/kafka_events_test.go`	Moved two subtests into new `TestFlakyKafkaEvents`; introduced `expectedMessages` counters and `WaitForMessagesSent` gating before reads; switched to explicit per-read deadlines (5s) and added additional WaitForMessagesSent synchronization.
Redis tests `router-tests/events/redis_events_test.go`	Removed a legacy SSE subtest from main suite; added `TestFlakyRedisEvents` containing moved/reworked tests (mutation/typename checks, invalid JSON/missing entity cases, multi-client lifecycle); added WaitForMessagesSent gating after produces.
WebSocket test `router-tests/websocket_test.go`	Replaced atomic completion flag and `require.Eventually` with `WaitForMessagesSent(1, 10s)`; increased some subscription wait timeouts (5s→10s).
Singleflight test `router-tests/singleflight_test.go`	Increased a subscription wait timeout from 5s to 15s inside TestSingleFlight goroutine.
Module / Misc `go.mod`, `manifest_file`	Minor module/dependency adjustments accompanying test changes.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

fix: closing a graph server should force all subscriptions to close #2188 — touches router-tests/events/nats_events_test.go and modifies NATS subscription-related tests and lifecycle behavior, closely related to the NATS test reorganizations in this PR.

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 10.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The PR title claims to 'improve reliability' but the actual objectives focus on reducing OOM kills and resource pressure through specific optimizations. The title is vague and doesn't capture the main technical changes.	Clarify the title to more specifically reference OOM/memory reduction (e.g., 'fix(ci): reduce OOM kills in router integration tests') or resource optimization rather than generic 'reliability improvement'.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

📝 Coding Plan

Generate coding plan for human review comments

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Tip

CodeRabbit can use TruffleHog to scan for secrets in your code with verification capabilities.

Add a TruffleHog config file (e.g. trufflehog-config.yml, trufflehog.yml) to your project to customize detectors and scanning behavior. The tool runs only when a config file is present.

github-actions · 2026-03-03T08:13:44Z

Router image scan passed

✅ No security vulnerabilities found in image:

ghcr.io/wundergraph/cosmo/router:sha-da87bfa75ea16eaf734d53c764215f25a8d1947e

coderabbitai

🧹 Nitpick comments (1)

.github/workflows/router-ci.yaml (1)
373-374: Consider lifting GOMAXPROCS to job-level env to avoid drift.

Both test steps set the same value; placing it once under integration_test.env keeps config centralized.

Also applies to: 395-396
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/router-ci.yaml around lines 373 - 374, Move the duplicated
GOMAXPROCS environment variable out of the individual test steps and into the
job-level env so both integration_test steps inherit the value; specifically,
remove the GOMAXPROCS entries currently set under the two test step env blocks
and add a single GOMAXPROCS: 4 entry under the job's env section so the
integration_test steps no longer repeat it.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In @.github/workflows/router-ci.yaml:
- Around line 373-374: Move the duplicated GOMAXPROCS environment variable out
of the individual test steps and into the job-level env so both integration_test
steps inherit the value; specifically, remove the GOMAXPROCS entries currently
set under the two test step env blocks and add a single GOMAXPROCS: 4 entry
under the job's env section so the integration_test steps no longer repeat it.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b5bc471 and 7daf34b.

📒 Files selected for processing (1)

.github/workflows/router-ci.yaml

codecov · 2026-03-03T08:17:51Z

Codecov Report

❌ Patch coverage is 80.00000% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 62.65%. Comparing base (56859f0) to head (655cbf5).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
router/core/router.go	66.66%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##             main    #2577       +/-   ##
===========================================
- Coverage   89.30%   62.65%   -26.66%     
===========================================
  Files          21      244      +223     
  Lines        4527    25831    +21304     
  Branches     1248        0     -1248     
===========================================
+ Hits         4043    16184    +12141     
- Misses        484     8298     +7814     
- Partials        0     1349     +1349

Files with missing lines	Coverage Δ
router/core/cache_warmup.go	`88.94% <100.00%> (ø)`
router/pkg/pubsub/kafka/adapter.go	`65.46% <100.00%> (ø)`
router/pkg/statistics/engine_stats.go	`3.38% <ø> (ø)`
router/pkg/trace/meter.go	`44.35% <ø> (ø)`
router/core/router.go	`69.41% <66.66%> (ø)`

... and 260 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

router-tests/events/nats_events_test.go (1)
1719-1719: Consider reducing the polling interval for faster test feedback.

The 10-second polling interval on require.Eventually seems long given the 30-second NatsWaitTimeout. A shorter interval (e.g., 1 second) would provide faster test feedback while still being reasonable.
-			}, NatsWaitTimeout, 10*time.Second)
+			}, NatsWaitTimeout, time.Second)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@router-tests/events/nats_events_test.go` at line 1719, The test uses
require.Eventually with a long polling interval (10*time.Second) against
NatsWaitTimeout; change the polling interval in that require.Eventually call to
a shorter value (e.g., 1*time.Second) so the test checks more frequently while
keeping the same overall timeout (NatsWaitTimeout) — locate the
require.Eventually invocation in the test (around the block using
NatsWaitTimeout) and replace the interval argument (currently 10*time.Second)
with 1*time.Second.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@router/pkg/statistics/engine_stats.go`:
- Around line 68-90: The wait goroutine in EngineStats.Wait can leak when ctx is
cancelled because after s.cond.Broadcast() the goroutine wakes, re-evaluates the
predicate and may call s.cond.Wait() again and block forever; modify the
goroutine started in EngineStats.Wait (the anonymous func) to also observe ctx
cancellation inside its loop (e.g., after acquiring s.mu and before calling
s.cond.Wait()), so it breaks out and closes done when ctx.Done() is signaled;
ensure you reference s.cond.Wait, predicate(s.GetReport()), done and ctx to
perform a select or explicit check and return early to avoid re-waiting on the
cond after context cancellation.

---

Nitpick comments:
In `@router-tests/events/nats_events_test.go`:
- Line 1719: The test uses require.Eventually with a long polling interval
(10*time.Second) against NatsWaitTimeout; change the polling interval in that
require.Eventually call to a shorter value (e.g., 1*time.Second) so the test
checks more frequently while keeping the same overall timeout (NatsWaitTimeout)
— locate the require.Eventually invocation in the test (around the block using
NatsWaitTimeout) and replace the interval argument (currently 10*time.Second)
with 1*time.Second.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7daf34b and a4160ea.

📒 Files selected for processing (3)

router-tests/events/nats_events_test.go
router-tests/testenv/testenv.go
router/pkg/statistics/engine_stats.go

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

router/pkg/statistics/engine_stats.go (1)

74-92: ⚠️ Potential issue | 🔴 Critical

Cancellation can still miss the wake-up and strand the waiter goroutine.

At Line 90–Line 91, Broadcast() is not synchronized with s.mu. If cancellation fires just before the waiter calls s.cond.Wait(), the wake-up can be missed and the goroutine can block indefinitely (also done is never closed on cancel path).

🐛 Proposed fix

 func (s *EngineStats) Wait(ctx context.Context, predicate func(*UsageReport) bool) {
 	if predicate(s.GetReport()) {
 		return
 	}

 	done := make(chan struct{})
 	go func() {
 		s.mu.Lock()
 		defer s.mu.Unlock()
-		for !predicate(s.GetReport()) {
-			s.cond.Wait()
-			if ctx.Err() != nil {
-				return
-			}
-		}
-		close(done)
+		defer close(done)
+		for !predicate(s.GetReport()) {
+			if ctx.Err() != nil {
+				return
+			}
+			s.cond.Wait()
+		}
 	}()

 	select {
 	case <-done:
 	case <-ctx.Done():
-		// Unblock the goroutine waiting on cond.Wait()
-		s.cond.Broadcast()
+		// Serialize with waiter loop to avoid lost wake-ups.
+		s.mu.Lock()
+		s.cond.Broadcast()
+		s.mu.Unlock()
+		<-done
 	}
 }

In Go sync.Cond usage, can a Broadcast be missed if it happens before a goroutine starts Wait, and does taking the same mutex around Broadcast avoid this lost-wakeup race with cancellation?

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@router/pkg/statistics/engine_stats.go` around lines 74 - 92, The waiter
goroutine can be stranded because s.cond.Broadcast() is called without holding
s.mu and done is never closed on cancel; fix by taking s.mu before broadcasting
and closing done under that same lock so the wakeup cannot be missed: in the
cancel branch of the select acquire s.mu, call s.cond.Broadcast(), close(done)
(or set a cancellation flag and close done) and then release s.mu so the
goroutine waiting in the anonymous function will observe ctx.Err() and return;
reference s.mu, s.cond.Broadcast(), done, predicate, and the anonymous goroutine
that calls s.cond.Wait().

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/router-ci.yaml:
- Around line 373-375: The current test_params uses the flawed regex string
"-run '^Test[^(Flaky)]'" which incorrectly excludes any test whose second
character is one of (F,l,a,k,y,) and thus omits valid tests like
TestFallbackErrors and others; replace this pattern by running all tests
matching "^Test" and explicitly skipping flaky ones by adding a separate skip
flag, e.g. use "-run '^Test' -skip '^TestFlaky'" (remove the "[^(Flaky)]"
pattern) so TestFallbackErrors, TestFeatureFlags, TestFileUpload_*,
TestForwardHeaders, TestForwardRenamedHeaders, etc., are included while flaky
tests named like TestFlaky* are excluded.

---

Duplicate comments:
In `@router/pkg/statistics/engine_stats.go`:
- Around line 74-92: The waiter goroutine can be stranded because
s.cond.Broadcast() is called without holding s.mu and done is never closed on
cancel; fix by taking s.mu before broadcasting and closing done under that same
lock so the wakeup cannot be missed: in the cancel branch of the select acquire
s.mu, call s.cond.Broadcast(), close(done) (or set a cancellation flag and close
done) and then release s.mu so the goroutine waiting in the anonymous function
will observe ctx.Err() and return; reference s.mu, s.cond.Broadcast(), done,
predicate, and the anonymous goroutine that calls s.cond.Wait().

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a4160ea and 97bae6a.

📒 Files selected for processing (6)

.github/workflows/router-ci.yaml
router-tests/events/kafka_events_test.go
router-tests/events/nats_events_test.go
router-tests/testenv/testenv.go
router-tests/websocket_test.go
router/pkg/statistics/engine_stats.go

🚧 Files skipped from review as they are similar to previous changes (1)

router-tests/events/nats_events_test.go

coderabbitai

🧹 Nitpick comments (1)

.github/workflows/router-ci.yaml (1)

375-395: Consider deduplicating shared test flags to avoid drift between flaky/non-flaky jobs.

--timeout and --parallel are repeated in both commands; centralizing them in job env makes future tuning safer.

♻️ Proposed refactor

   integration_test:
     runs-on: ubuntu-latest-l
     timeout-minutes: 30
     env:
       GOMAXPROCS: 4
+      TEST_TIMEOUT: 8m
+      TEST_PARALLEL: 2
@@
       - name: Run Integration tests ${{ matrix.test_target }}
         working-directory: ./router-tests
-        run: make test-coverage test_retry_count=0 test_params="-run '^Test' -skip '^TestFlaky' --timeout=8m -p 1 --parallel 2" test_target="${{ matrix.test_target }}"
+        run: make test-coverage test_retry_count=0 test_params="-run '^Test' -skip '^TestFlaky' --timeout=${TEST_TIMEOUT} -p 1 --parallel ${TEST_PARALLEL}" test_target="${{ matrix.test_target }}"
@@
       - name: Run Flaky Integration tests ${{ matrix.test_target }}
         working-directory: ./router-tests
-        run: make test-coverage test_retry_count=3 test_params="-run '^TestFlaky' --timeout=8m -p 1 --parallel 2" test_target="${{ matrix.test_target }}"
+        run: make test-coverage test_retry_count=3 test_params="-run '^TestFlaky' --timeout=${TEST_TIMEOUT} -p 1 --parallel ${TEST_PARALLEL}" test_target="${{ matrix.test_target }}"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In @.github/workflows/router-ci.yaml around lines 375 - 395, The repeated test
flags (--timeout and --parallel) in the two make invocations should be
centralized into job-level environment variables to avoid drift: define e.g. env
entries TEST_TIMEOUT and TEST_PARALLEL (or a single TEST_PARAMS) at the job
level, then replace the inline test_params in both make test-coverage calls (the
run lines invoking make test-coverage with test_retry_count and test_target) to
reference those env vars; also propagate the same variable into the flaky and
non-flaky commands so matrix.test_target and
steps.artifact_name.outputs.sanitized logic remains unchanged.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In @.github/workflows/router-ci.yaml:
- Around line 375-395: The repeated test flags (--timeout and --parallel) in the
two make invocations should be centralized into job-level environment variables
to avoid drift: define e.g. env entries TEST_TIMEOUT and TEST_PARALLEL (or a
single TEST_PARAMS) at the job level, then replace the inline test_params in
both make test-coverage calls (the run lines invoking make test-coverage with
test_retry_count and test_target) to reference those env vars; also propagate
the same variable into the flaky and non-flaky commands so matrix.test_target
and steps.artifact_name.outputs.sanitized logic remains unchanged.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 97bae6a and b04b401.

📒 Files selected for processing (4)

.github/workflows/router-ci.yaml
router-tests/events/kafka_events_test.go
router-tests/events/nats_events_test.go
router-tests/events/redis_events_test.go

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (2)

router-tests/events/redis_events_test.go (1)
785-865: Good isolation of timing-sensitive tests.

Creating a separate TestFlakyRedisEvents suite for tests that are more sensitive to timing/resource constraints is a reasonable approach for test organization and CI stability.

The "subscribe sync sse legacy method works" test (lines 792-865) is structurally very similar to "subscribe sync sse" (lines 552-624). If these tests are intentionally covering different code paths (current vs. legacy SSE handling), consider adding a brief comment explaining what distinguishes the "legacy method" to help future maintainers understand why both tests exist.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@router-tests/events/redis_events_test.go` around lines 785 - 865, Add a short
clarifying comment above the t.Run block named "subscribe sync sse legacy method
works" inside TestFlakyRedisEvents explaining what makes this test exercise the
legacy SSE handling versus the "subscribe sync sse" test (e.g., different
request headers/format, older endpoint, or alternate server path), so future
maintainers can see why both tests exist; locate the t.Run with the exact name
"subscribe sync sse legacy method works" in TestFlakyRedisEvents and insert the
one-line explanation just before the test setup (after the t.Run(...) line and
before building subscribePayload).
router-tests/events/kafka_events_test.go (1)
587-588: Avoid duplicated magic read timeout values.

Line 587 and Line 657 both hard-code 5 * time.Second. Consider extracting a small constant (e.g., KafkaReadDeadline) so timeout tuning stays centralized.

Also applies to: 657-658
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@router-tests/events/kafka_events_test.go` around lines 587 - 588, Extract the
duplicated magic timeout into a single constant (e.g., KafkaReadDeadline
time.Duration = 5 * time.Second) at the top of the test file and replace the
hard-coded uses in the conn.SetReadDeadline calls and any related ReadJSON
timeout usages (currently the 5 * time.Second literals used around
conn.SetReadDeadline and subsequent reads) with that constant; ensure the
constant is a time.Duration so callers like
conn.SetReadDeadline(time.Now().Add(KafkaReadDeadline)) continue to work and
update both occurrences (the ones around conn.SetReadDeadline and the read calls
mentioned) to reference KafkaReadDeadline.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@router-tests/events/kafka_events_test.go`:
- Around line 934-936: The new top-level integration test TestFlakyKafkaEvents
lacks a short-mode guard and will run during go test -short; update
TestFlakyKafkaEvents to check testing.Short() and call t.Skip("skipping
integration test in short mode") (same pattern as TestKafkaEvents) at the top of
the function so Kafka-dependent integration tests are skipped when tests run in
short mode.

In `@router-tests/singleflight_test.go`:
- Line 566: Update the other three tests that call
xEnv.WaitForSubscriptionCount(..., time.Second*5) to use time.Second*15 instead
so the timeout increase is applied consistently; locate the calls to
WaitForSubscriptionCount in the tests labeled "subscription deduplication with
multiple subgraphs", "subscription deduplication with multiple subgraphs -
single flight disabled", and "subscription deduplication with multiple subgraphs
- same headers" (they pass uint64(numOfOperations) and use the xEnv variable)
and change the timeout argument from time.Second*5 to time.Second*15.

---

Nitpick comments:
In `@router-tests/events/kafka_events_test.go`:
- Around line 587-588: Extract the duplicated magic timeout into a single
constant (e.g., KafkaReadDeadline time.Duration = 5 * time.Second) at the top of
the test file and replace the hard-coded uses in the conn.SetReadDeadline calls
and any related ReadJSON timeout usages (currently the 5 * time.Second literals
used around conn.SetReadDeadline and subsequent reads) with that constant;
ensure the constant is a time.Duration so callers like
conn.SetReadDeadline(time.Now().Add(KafkaReadDeadline)) continue to work and
update both occurrences (the ones around conn.SetReadDeadline and the read calls
mentioned) to reference KafkaReadDeadline.

In `@router-tests/events/redis_events_test.go`:
- Around line 785-865: Add a short clarifying comment above the t.Run block
named "subscribe sync sse legacy method works" inside TestFlakyRedisEvents
explaining what makes this test exercise the legacy SSE handling versus the
"subscribe sync sse" test (e.g., different request headers/format, older
endpoint, or alternate server path), so future maintainers can see why both
tests exist; locate the t.Run with the exact name "subscribe sync sse legacy
method works" in TestFlakyRedisEvents and insert the one-line explanation just
before the test setup (after the t.Run(...) line and before building
subscribePayload).

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b04b401 and 80e9f89.

📒 Files selected for processing (4)

router-tests/events/kafka_events_test.go
router-tests/events/redis_events_test.go
router-tests/singleflight_test.go
router-tests/websocket_test.go

🚧 Files skipped from review as they are similar to previous changes (1)

router-tests/websocket_test.go

endigma · 2026-03-05T14:19:12Z

Have you considered using something like a "SyncReporter" that fulfills the Reporter interface in these tests instead of the hacked on synchronisation to EngineStats?

It would remove a lot of the complexity around awaiting for the engine to report something if it was directly channel based and blocking.

I don't believe any of the methods on reporter are called on a timer or anything but if they are then it could either be configurable which report functions to care about or separate reporters that block on each signal or a mask of signals to mitigate the issue.

- Extract duplicate event test helpers into shared helpers_test.go - Rename import alias from `integration` to `routertests` in observability tests - Replace AGENTS.md symlink with standalone file referencing CLAUDE.md Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

… tests Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…t aliases Move utils.go from root integration package to router-tests/testutils/ package, update all 25 consuming files to import testutils instead of using the confusing integration alias, and remove the "trigger CI 2" comment artifact. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Noroth

Approved from my end, but it's a lot. You might want a second approval

Merge from main introduced references to NatsWaitTimeout which doesn't exist. The correct constant is EventWaitTimeout from helpers_test.go. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ests

jensneuse requested review from a team as code owners March 3, 2026 08:11

jensneuse requested review from JivusAyrus, StarpTech, alepane21 and thisisnithin March 3, 2026 08:11

coderabbitai Bot reviewed Mar 3, 2026

View reviewed changes

jensneuse requested review from Noroth, devsergiy and endigma as code owners March 3, 2026 09:10

github-actions Bot added the router label Mar 3, 2026

coderabbitai Bot reviewed Mar 3, 2026

View reviewed changes

Comment thread router/pkg/statistics/engine_stats.go Outdated

jensneuse added the query-planner-skip label Mar 3, 2026

coderabbitai Bot reviewed Mar 3, 2026

View reviewed changes

Comment thread .github/workflows/router-ci.yaml Outdated

coderabbitai Bot reviewed Mar 3, 2026

View reviewed changes

Comment thread router-tests/events/kafka_events_test.go Outdated

Comment thread router-tests/operations/singleflight_test.go

endigma requested changes Mar 5, 2026

View reviewed changes

jensneuse changed the title ~~fix(ci): reduce OOM kills in router integration tests~~ fix(ci): improve reliability of the router-tests Mar 5, 2026

jensneuse force-pushed the jensneuse/fix-flaky-tests branch from 88188d0 to eae1390 Compare March 5, 2026 17:11

jensneuse closed this Mar 5, 2026

jensneuse reopened this Mar 5, 2026

jensneuse force-pushed the jensneuse/fix-flaky-tests branch from 9bcc31f to 1d21013 Compare March 5, 2026 22:51

jensneuse closed this Mar 5, 2026

jensneuse reopened this Mar 5, 2026

jensneuse closed this Mar 5, 2026

jensneuse reopened this Mar 5, 2026

jensneuse added 2 commits March 9, 2026 19:34

Merge branch 'main' into jensneuse/fix-flaky-tests

3a7de10

Merge branch 'main' into jensneuse/fix-flaky-tests

b089c9a

Noroth reviewed Mar 10, 2026

View reviewed changes

jensneuse and others added 4 commits March 10, 2026 13:57

fix: restore cacheHashNotStored constant and comment in persisted ops…

acd896b

… tests Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Merge branch 'main' into jensneuse/fix-flaky-tests

1a5c5b8

jensneuse requested a review from Noroth March 10, 2026 13:53

chore: add .serena/ to .gitignore

4b68521

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

jensneuse requested a review from StarpTech March 10, 2026 13:53

Noroth approved these changes Mar 10, 2026

View reviewed changes

jensneuse added 2 commits March 10, 2026 15:52

Merge branch 'main' into jensneuse/fix-flaky-tests

9c92bd4

Merge branch 'main' into jensneuse/fix-flaky-tests

6d52b19

jensneuse requested a review from SkArchon as a code owner March 11, 2026 09:50

fix: replace undefined NatsWaitTimeout with EventWaitTimeout

21be65e

Merge from main introduced references to NatsWaitTimeout which doesn't exist. The correct constant is EventWaitTimeout from helpers_test.go. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

endigma requested changes Mar 11, 2026

View reviewed changes

Comment thread router-tests/protocol/integration_test.go Outdated

Comment thread router-tests/protocol/router_config_test.go Outdated

Comment thread router-tests/protocol/integration_test.go

Comment thread router-tests/testenv/sync_reporter.go Outdated

jensneuse added 4 commits March 11, 2026 16:29

Merge branch 'main' into jensneuse/fix-flaky-tests

ac05837

Merge branch 'main' into jensneuse/fix-flaky-tests

acb2a02

test(router-tests): address review follow-ups

8d43d87

test(router-tests): reduce protocol test diff noise

dbfbaf9

jensneuse requested a review from endigma March 11, 2026 22:29

jensneuse added 3 commits March 12, 2026 09:24

Merge remote-tracking branch 'origin/main' into jensneuse/fix-flaky-t…

5cbde3c

…ests

Merge remote-tracking branch 'origin/main' into jensneuse/fix-flaky-t…

e33e467

…ests

refactor(router-tests): move wait sync into SyncReporter

83e5e17

endigma approved these changes Mar 13, 2026

View reviewed changes

Merge branch 'main' into jensneuse/fix-flaky-tests

655cbf5

jensneuse enabled auto-merge (squash) March 13, 2026 12:37

jensneuse disabled auto-merge March 13, 2026 12:45

jensneuse merged commit 4aacdee into main Mar 13, 2026
36 checks passed

jensneuse deleted the jensneuse/fix-flaky-tests branch March 13, 2026 12:45

Conversation

jensneuse commented Mar 3, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Evidence

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

github-actions Bot commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Router image scan passed

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

endigma commented Mar 5, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Noroth left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jensneuse commented Mar 3, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Mar 3, 2026 •

edited

Loading

github-actions Bot commented Mar 3, 2026 •

edited

Loading

codecov Bot commented Mar 3, 2026 •

edited

Loading