Skip to content

fix(resolve): fix flaky singleflight deduplication tests#1393

Merged
jensneuse merged 8 commits intomasterfrom
jensneuse/flaky-test-eval
Feb 19, 2026
Merged

fix(resolve): fix flaky singleflight deduplication tests#1393
jensneuse merged 8 commits intomasterfrom
jensneuse/flaky-test-eval

Conversation

@jensneuse
Copy link
Copy Markdown
Member

@jensneuse jensneuse commented Feb 19, 2026

Summary by CodeRabbit

  • Refactor

    • Improved request deduplication and follower tracking by switching to a low‑contention atomic counter for more reliable concurrency and streamlined lifecycle handling.
  • Tests

    • Updated test coordination to poll for follower registration with time‑based deadlines and added gating for deterministic ordering.
    • Converted several previously concurrent test actions to synchronous calls to stabilize timing.

The inbound request singleflight tests used a channel-based synchronization pattern that signalled "entered" before followers actually called GetOrCreate. Under -race, the leader could finish and delete the singleflight key before followers registered, causing them to start fresh requests instead of deduplicating — making TestResolver_ArenaResolveGraphQLResponse_RequestDeduplication and two related tests flaky.

Replace HasFollowers bool + Mu sync.Mutex on InflightRequest with a single followerCount atomic.Int32, incremented inside GetOrCreate at the exact point a follower joins. Tests now poll this counter to guarantee all followers have registered before the leader is unblocked, eliminating the race entirely.

Checklist

  • I have discussed my proposed changes in an issue and have received approval to proceed.
  • I have followed the coding standards of the project.
  • Tests or benchmarks have been added or updated.

jensneuse and others added 2 commits February 19, 2026 01:23
The tests used a followersEntered channel that signaled before followers
actually called GetOrCreate/LoadOrStore. Under the race detector, the
leader could finish and delete the singleflight key before followers
entered, causing them to start fresh instead of deduplicating.

Add followerCount atomic counter to InflightRequest and poll it in tests
to confirm all followers have registered before releasing the data source.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The followerCount atomic.Int32 already tracks follower presence. Replace
the mutex-guarded HasFollowers bool with followerCount.Load() > 0, removing
two struct fields and ~10 lines of lock/unlock code with identical correctness.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Feb 19, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Replaced mutex-guarded follower flag in InflightRequest with an atomic follower counter and accessor methods; updated GetOrCreate/FinishOk to use the atomic counter. Tests and helpers were modified to poll HasFollowers/waitForFollowerCount or use small gating channels instead of channel-based follower signaling. Minor test timing adjustments.

Changes

Cohort / File(s) Summary
Core synchronization
v2/pkg/engine/resolve/inbound_request_singleflight.go
Removed Mu sync.Mutex and HasFollowers bool; added followerCount atomic.Int32, AddFollower() and HasFollowers() methods. Updated GetOrCreate to call AddFollower() when reusing an inflight request and FinishOk to use HasFollowers().
Singleflight tests
v2/pkg/engine/resolve/inbound_request_singleflight_test.go
Replaced channel-based follower signaling with polling on inflight.HasFollowers() using a deadline loop; added time import and removed followerReady channel usage.
Resolver tests & helpers
v2/pkg/engine/resolve/resolve_test.go
Added helpers findAnyInflight and waitForFollowerCount; replaced legacy channel coordination in multiple deduplication/shared-data tests with waitForFollowerCount polling; introduced a sub2Ready gating channel for one subscription test.
Subscription test timing
execution/subscription/legacy_handler_test.go
Removed two goroutine-based concurrent mutation invocations; mutations are now invoked synchronously in those tests, altering timing/order but not APIs.

Sequence Diagram(s)

(Skipped — changes are internal synchronization refactors and test synchronization adjustments, not a new multi-component control flow that requires a sequence diagram.)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

🚥 Pre-merge checks | ✅ 2
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main objective: fixing flaky tests in the singleflight deduplication logic by replacing channel-based synchronization with atomic operations.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch jensneuse/flaky-test-eval

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
v2/pkg/engine/resolve/resolve_test.go (1)

169-183: Consider adding a tiny backoff to the polling loop.

The tight Gosched() loop can burn CPU on slow CI. A minimal sleep keeps behavior deterministic while reducing spin.

♻️ Suggested tweak
 	default:
-		runtime.Gosched()
+		runtime.Gosched()
+		time.Sleep(time.Millisecond)
 	}
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@v2/pkg/engine/resolve/resolve_test.go` around lines 169 - 183, The polling
loop in waitForFollowerCount spins with runtime.Gosched(), which can waste CPU
on slow CI; change the loop to use a tiny backoff (e.g., time.Sleep for 1–5ms)
instead of Gosched() while preserving the deadline and the check against
inflight.followerCount.Load() (and the use of findAnyInflight and t.Fatal on
timeout) so the test remains deterministic but less CPU-intensive.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@v2/pkg/engine/resolve/resolve_test.go`:
- Around line 169-183: The polling loop in waitForFollowerCount spins with
runtime.Gosched(), which can waste CPU on slow CI; change the loop to use a tiny
backoff (e.g., time.Sleep for 1–5ms) instead of Gosched() while preserving the
deadline and the check against inflight.followerCount.Load() (and the use of
findAnyInflight and t.Fatal on timeout) so the test remains deterministic but
less CPU-intensive.

jensneuse and others added 2 commits February 19, 2026 10:39
The "two subscriptions to the same trigger" test was flaky because the
data source's emitting goroutine could send counter=0 before sub2's
addSubscription event was processed on the unbuffered events channel.
Gate the data source start via the onStart callback until sub2 is
registered.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Go 1.24+ panics when t.Fail() is called from a goroutine after the
test has completed. Two sendChatMutation calls were launched in
goroutines that could outlive their subtests. Call them synchronously
instead — the HTTP request completes quickly and doesn't need to be
async.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@jensneuse jensneuse requested a review from a team as a code owner February 19, 2026 09:45
Copy link
Copy Markdown
Contributor

@dkorittki dkorittki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some minor questions/suggestions, otherwise LGTM

Comment thread v2/pkg/engine/resolve/inbound_request_singleflight.go Outdated
Comment thread v2/pkg/engine/resolve/inbound_request_singleflight_test.go Outdated
Comment thread v2/pkg/engine/resolve/inbound_request_singleflight_test.go Outdated
Comment thread v2/pkg/engine/resolve/resolve_test.go Outdated
jensneuse and others added 2 commits February 19, 2026 12:04
Replace runtime.Gosched() with time.Sleep(10ms) to reduce CPU usage
in test polling loops. Increase deadline from 1s to 3s to accommodate
slow CI runners.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…quest

Wrap the atomic followerCount operations behind AddFollower() and
HasFollowers() methods to keep production code cleaner.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@v2/pkg/engine/resolve/resolve_test.go`:
- Line 186: This file (resolve_test.go) fails gci/imports formatting; re-run the
import and formatting tools (gci and goimports/gofmt) on
v2/pkg/engine/resolve/resolve_test.go to fix import ordering and formatting so
the CI lint passes—ensure imports are grouped/ordered per gci and files are
formatted (e.g., run gci then goimports/gofmt) and re-run tests to confirm no
other import-related changes are needed.

Comment thread v2/pkg/engine/resolve/resolve_test.go Outdated
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@jensneuse jensneuse merged commit 4105082 into master Feb 19, 2026
9 checks passed
@jensneuse jensneuse deleted the jensneuse/flaky-test-eval branch February 19, 2026 12:50
jensneuse pushed a commit that referenced this pull request Feb 19, 2026
🤖 I have created a release *beep* *boop*
---


##
[2.0.0-rc.252](v2.0.0-rc.251...v2.0.0-rc.252)
(2026-02-19)


### Features

* forward headers to grpc subgraphs
([#1382](#1382))
([8459b34](8459b34))


### Bug Fixes

* **resolve:** fix flaky singleflight deduplication tests
([#1393](#1393))
([4105082](4105082))
* **resolve:** guard OnFinished against nil loaderHookContext on skipped
fetches
([#1394](#1394))
([f79d071](f79d071))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).
jensneuse pushed a commit that referenced this pull request Feb 19, 2026
🤖 I have created a release *beep* *boop*
---


##
[1.8.1](execution/v1.8.0...execution/v1.8.1)
(2026-02-19)


### Bug Fixes

* **resolve:** fix flaky singleflight deduplication tests
([#1393](#1393))
([4105082](4105082))
* **resolve:** guard OnFinished against nil loaderHookContext on skipped
fetches
([#1394](#1394))
([f79d071](f79d071))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants