Skip to content

fix: fix flaky engine subscription tests#1318

Merged
dkorittki merged 10 commits intotopic/streams-v1from
dominik/eng-8289-fix-flaky-engine-subscription-tests
Oct 16, 2025
Merged

fix: fix flaky engine subscription tests#1318
dkorittki merged 10 commits intotopic/streams-v1from
dominik/eng-8289-fix-flaky-engine-subscription-tests

Conversation

@dkorittki
Copy link
Copy Markdown
Contributor

@dkorittki dkorittki commented Oct 9, 2025

@coderabbitai summary

Checklist

  • I have discussed my proposed changes in an issue and have received approval to proceed.
  • I have followed the coding standards of the project.
  • Tests or benchmarks have been added or updated.

Context

Some notes first:

  • Its only happening for tests introduced on the cosmo streams topic branch
  • It seems to be a race condition in tests rather than actual engine code

I spotted two tests failing on Github Actions due to race conditions. They work locally and are CPU timings related.

Those two tests are

  • test 1 SubscriptionOnStart ctx updater only updates the right subscription
  • test 2 SubscriptionOnStart ctx updater on multiple subscriptions with same trigger works

test 1:

There is a race condition going on. Here is the output of the test on Github runners with engine logs enabled.

resolver:trigger:subscription:add:17241709254077376921:1
resolver:create:trigger:17241709254077376921
resolver:trigger:start:17241709254077376921
resolver:subscription_updater:update:17241709254077376921
resolver:trigger:initialized:17241709254077376921
resolver:subscription_updater:update:17241709254077376921
resolver:trigger:subscription:update:17241709254077376921:1,1
resolver:trigger:update:17241709254077376921
resolver:trigger:subscription:add:17241709254077376921:2
resolver:trigger:subscription:added:17241709254077376921:2
resolver:trigger:subscription:update:1
resolver:trigger:subscription:flushed:1
resolver:trigger:subscription:update:1
resolver:trigger:subscription:flushed:1
resolver:trigger:started:17241709254077376921
resolver:subscription_updater:complete:17241709254077376921
resolver:subscription_updater:complete:sent_event:17241709254077376921
resolver:trigger:complete:17241709254077376921
resolver:trigger:complete:17241709254077376921
resolver:trigger:subscription:closed:17241709254077376921:1
resolver:trigger:subscription:closed:17241709254077376921:2

recorder 1 messages: [{"data":{"counter":1000}} {"data":{"counter":0}}]
recorder 2 messages: []

As you can see recorder 2 misses its one expected message. The reason is that we update the trigger with the counter=0 message (line 8) before the second subscriber is added (line 9). So it misses the message. This happens because in the test we don't wait for the subscriber to finish registration on the trigger before sending the counter=0 message. Now we actually wait for that.

test 2:

Kind of the same error. Here is the engine debug output from a failing Github Actions run:

resolver:trigger:subscription:add:15889878720417707388:1
resolver:create:trigger:15889878720417707388
resolver:trigger:start:15889878720417707388
resolver:subscription_updater:update:15889878720417707388
resolver:trigger:initialized:15889878720417707388
resolver:subscription_updater:update:15889878720417707388
resolver:trigger:subscription:update:15889878720417707388:1,1
resolver:trigger:update:15889878720417707388
resolver:trigger:subscription:add:15889878720417707388:2
resolver:trigger:subscription:added:15889878720417707388:2
resolver:subscription_updater:update:15889878720417707388
resolver:trigger:subscription:update:15889878720417707388:1,2
resolver:trigger:subscription:update:2
resolver:trigger:started:15889878720417707388
resolver:trigger:subscription:update:1
resolver:trigger:subscription:flushed:2
resolver:trigger:subscription:flushed:1
resolver:trigger:subscription:update:1
resolver:trigger:subscription:flushed:1
resolver:subscription_updater:complete:15889878720417707388
resolver:subscription_updater:complete:sent_event:15889878720417707388
resolver:trigger:complete:15889878720417707388
resolver:trigger:complete:15889878720417707388
resolver:trigger:subscription:closed:15889878720417707388:1
resolver:trigger:subscription:closed:15889878720417707388:2

recorder 1 messages: [{"data":{"counter":1000}} {"data":{"counter":0}}]
recorder 2 messages: [{"data":{"counter":1000}}]

As you can see recorder 2 misses the counter=0 message. Both are expected to have the same messages in the same order. Both recorders have the counter=1000 message, which is delivered via subscription-on-start hook but recorder 2 misses the counter=0 message, delivered via fake stream. The count=0 message is delivered (line 8) before recorder 2 is subscribed (line 9). This happens because in this test, like in the other, we don't wait for the recorders to finish subscribing to the trigger, and sending off the counter=0 messages via fake stream early. Its fixed by waiting for a complete subscription.

@dkorittki dkorittki changed the base branch from master to topic/streams-v1 October 9, 2025 16:54
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Oct 9, 2025

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

Synchronizes GraphQL subscription startup in tests by adding startup hooks, a coordination channel, and a wait group. Test logic now waits for both subscriptions to register before emitting messages, restructures fake stream wiring, updates assertions for ordering, and expands cases to cover multi-startup sequences and cross-recorder interactions.

Changes

Cohort / File(s) Summary
Resolve tests synchronization
v2/pkg/engine/resolve/resolve_test.go
Added startup synchronization for subscriptions (startup hooks, wait group, streamCanStart), introduced onStart/subscriptionOnStart handlers updating shared state via Updater, delayed first message until all subscriptions are ready, adjusted assertions for ordered delivery, and expanded test coverage around multi-recorder sequencing.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title succinctly captures the primary change, which is fixing flaky engine subscription tests, and directly aligns with the synchronization updates described in the summary. It clearly conveys the PR’s purpose without extraneous details or vague language. A reviewer scanning the list will immediately understand that the focus is on making subscription tests reliable.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@dkorittki dkorittki marked this pull request as ready for review October 13, 2025 15:48
@dkorittki
Copy link
Copy Markdown
Contributor Author

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Oct 13, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
v2/pkg/engine/resolve/resolve_test.go (2)

5560-5561: Minor: Make UniqueRequestID explicit (consistency with next test)

Here uniqueRequestFn returns nil without writing to the digest (produces the same zero-hash). Consider writing a constant (like "unique") to xxh, as you do in the next test, for clarity and to avoid accidental collisions if other cases appear.

Example:

fakeStream.uniqueRequestFn = func(ctx *Context, input []byte, xxh *xxhash.Digest) error {
    _, err := xxh.WriteString("unique")
    return err
}

5620-5622: Optional: fail fast on final asserts

Using require.True would stop immediately on failure and print recorder messages as you already pass them.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ba9a827 and 62d6e70.

📒 Files selected for processing (1)
  • v2/pkg/engine/resolve/resolve_test.go (4 hunks)
🔇 Additional comments (6)
v2/pkg/engine/resolve/resolve_test.go (6)

5539-5544: Deterministic gating of stream emission: LGTM

Blocking message production on streamCanStart and asserting onStart input makes the test deterministic and fixes the race.

Also applies to: 5546-5547


5549-5558: Startup hook handling is correct

Using defer startupHookWaitGroup.Done ensures both hooks count down; atomic flag avoids duplicate Updater calls. Good.


5606-5607: Readability improvement: LGTM

Collecting recorders into a slice simplifies the loop below.


5633-5651: Second test synchronization: LGTM

  • Gating messageFn on streamCanStart is solid.
  • Using subscription-on-start to push the first message to both subscribers verifies proper registration.
  • Explicit UniqueRequestID hashing with "unique" is clear.

5686-5692: Wait for initial messages before releasing the stream: LGTM

Awaiting any message from both recorders proves both are registered before sending the final stream message.


5696-5701: Order validations after synchronization: LGTM

Both recorders getting [1000, 0] confirms correct sequencing post‑sync.

Comment thread v2/pkg/engine/resolve/resolve_test.go
@ysmolski
Copy link
Copy Markdown
Contributor

@dkorittki since I am not familiar with [topic/streams-v1](https://github.com/wundergraph/graphql-go-tools/tree/topic/streams-v1) should I even try to review it? If not then who is the good reviewer for it?

@dkorittki
Copy link
Copy Markdown
Contributor Author

@ysmolski Yeah sorry you got auto selected as the reviewer. The best person to do this is @alepane21 . He´s already on it

@dkorittki dkorittki requested a review from alepane21 October 15, 2025 12:29
@dkorittki dkorittki merged commit 1353de9 into topic/streams-v1 Oct 16, 2025
10 checks passed
@dkorittki dkorittki deleted the dominik/eng-8289-fix-flaky-engine-subscription-tests branch October 16, 2025 13:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants