Skip to content

fix: if blocked, defer async event insertion to prevent deadlocks#1298

Merged
endigma merged 4 commits intomasterfrom
jesse/eng-8055-high-subscription-load-causes-deadlocks
Sep 23, 2025
Merged

fix: if blocked, defer async event insertion to prevent deadlocks#1298
endigma merged 4 commits intomasterfrom
jesse/eng-8055-high-subscription-load-causes-deadlocks

Conversation

@endigma
Copy link
Copy Markdown
Member

@endigma endigma commented Sep 17, 2025

Summary by CodeRabbit

  • Bug Fixes

    • Improved subscription lifecycle handling to avoid deadlocks and ensure reliable unsubscribe on errors or shutdown.
    • Fixed rare freezes by making unsubscribe emission non-blocking under contention.
  • Performance

    • Reduced stalls by deferring enqueue of removal events when channels are full and by pre-checking shutdown/disconnect during subscription setup.
  • Chores

    • Public subscription API signature updated.

Checklist

  • I have discussed my proposed changes in an issue and have received approval to proceed.
  • I have followed the coding standards of the project.
  • Tests or benchmarks have been added or updated.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Sep 17, 2025

Walkthrough

Added non-blocking enqueue behavior for unsubscribe paths by deferring event posting to goroutines when the events channel is full; updated AsyncResolveGraphQLSubscription to use a named error return and to perform pre-enqueue shutdown/client-disconnect checks within a select before emitting AddSubscription events. (≤50 words)

Changes

Cohort / File(s) Summary of changes
Resolver subscription flow
v2/pkg/engine/resolve/resolve.go
- AsyncUnsubscribeSubscription and AsyncUnsubscribeClient: added default branches that spawn a goroutine to enqueue removal events when the events channel cannot accept immediately, preventing blocking.
- AsyncResolveGraphQLSubscription: changed signature to use a named error return (err error) and consolidated event emission into a select that pre-checks resolver shutdown and client disconnect before sending the AddSubscription event; adjusted earlier enqueue paths accordingly.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title succinctly and accurately describes the primary change—deferring asynchronous event insertion when the events channel is blocked to prevent deadlocks—and it aligns with the changes in v2/pkg/engine/resolve/resolve.go that add non-blocking enqueue paths and deferred goroutine enqueuing; it also follows the Conventional Commits "fix:" style.
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.
✨ Finishing touches
  • 📝 Generate Docstrings

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
v2/pkg/engine/resolve/resolve.go (1)

983-997: Same goroutine-fan-out risk; apply the same bounded-forwarder pattern.

Mirror the fix from AsyncUnsubscribeSubscription to avoid unbounded goroutine creation here as well. Also tweak comment grammar.

-        // In the event we cannot insert immediately, defer insertion a goroutine, this should prevent deadlocks, at the cost of goroutine creation.
+        // If we cannot insert immediately, defer insertion via a goroutine to avoid deadlocks, at the cost of a goroutine.
🧹 Nitpick comments (1)
v2/pkg/engine/resolve/resolve.go (1)

1155-1171: Also short‑circuit on client ctx cancellation to avoid enqueuing a dead subscription.

Add ctx.ctx.Done() to the select to skip creating a subscription if the client already disconnected. Keeps work off the event loop under churn.

 select {
-case <-r.ctx.Done():
+case <-r.ctx.Done():
 	// Stop resolving if the resolver is shutting down
 	return r.ctx.Err()
+case <-ctx.ctx.Done():
+	// Client went away before we could enqueue the subscription
+	return ctx.ctx.Err()
 case r.events <- subscriptionEvent{
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8686cb4 and c9d5e84.

📒 Files selected for processing (1)
  • v2/pkg/engine/resolve/resolve.go (4 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: Build and test (go 1.25 / windows-latest)
  • GitHub Check: Build and test (go 1.25 / ubuntu-latest)
  • GitHub Check: Build and test (go 1.25 / windows-latest)
  • GitHub Check: Build and test (go 1.25 / ubuntu-latest)
🔇 Additional comments (1)
v2/pkg/engine/resolve/resolve.go (1)

957-969: ```shell
#!/bin/bash

Safe, tolerant inspection script (avoids failing if utilities are missing)

echo "PWD: $(pwd)"
echo

FILE="v2/pkg/engine/resolve/resolve.go"
if [ -f "$FILE" ]; then
echo "=== Excerpt: $FILE (lines 930-1000) ==="
awk 'NR>=930 && NR<=1000 {printf "%6d %s\n", NR, $0}' "$FILE" || true
else
echo "File not found: $FILE"
fi
echo

Search helper: prefer rg, fall back to grep

run_search() {
pattern="$1"
label="$2"
echo "=== Search ($label): $pattern ==="
if command -v rg >/dev/null 2>&1; then
rg -n --hidden -S --glob 'v2/**' "$pattern" || true
else
# fallback to grep (may produce more noise)
grep -R --line-number -n --exclude-dir=.git -E "$pattern" v2 || true
fi
echo
}

run_search '\btype\s+Resolver\b' "type Resolver"
run_search 'type\s+Resolver' "type Resolver (loose)"
run_search '\bevents\b' "events identifier"
run_search '\br.events\b' "r.events usages"
run_search '\bdeferred\b' "deferred identifier"
run_search '\bpostEventNonBlocking\b' "postEventNonBlocking helper"
run_search '\bsubscriptionEvent\b' "subscriptionEvent type/usages"
run_search 'subscriptionEventKindRemoveSubscription' "specific kind"
run_search 'make(\schan\s+subscriptionEvent' "make(chan subscriptionEvent"
run_search 'go func()\s
{[^}]r.events\s<' "goroutine sending to r.events (rg-only regex)"


</blockquote></details>

</blockquote></details>

</details>

<!-- This is an auto-generated comment by CodeRabbit for review status -->

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (1)
v2/pkg/engine/resolve/resolve.go (1)

350-359: Unsubscribe on heartbeat/work error — LGTM; consider coalescing repeats.

This correctly breaks circular waits. Optional: guard repeated calls (e.g., sub.unsubOnce.CompareAndSwap(false, true)) to avoid spamming events while removal is pending.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d326df3 and 78dfefe.

📒 Files selected for processing (1)
  • v2/pkg/engine/resolve/resolve.go (14 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
v2/pkg/engine/resolve/resolve.go (4)
v2/pkg/engine/resolve/response.go (2)
  • SubscriptionCloseKindGoingAway (64-64)
  • SubscriptionCloseKind (56-59)
v2/pkg/engine/resolve/context.go (1)
  • Context (16-35)
v2/pkg/graphqlerrors/response.go (1)
  • Response (9-14)
pkg/engine/resolve/resolve.go (1)
  • Fetches (318-318)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: Build and test (go 1.25 / ubuntu-latest)
  • GitHub Check: Build and test (go 1.25 / windows-latest)
  • GitHub Check: Build and test (go 1.25 / windows-latest)
  • GitHub Check: Build and test (go 1.25 / ubuntu-latest)
🔇 Additional comments (9)
v2/pkg/engine/resolve/resolve.go (9)

303-305: workItem returns error — LGTM

The change to fn func() error is consistent with the new error‑propagation path from workers.


384-387: Unsubscribe on work error — LGTM

Erroring work item triggers async removal as intended.


397-406: complete() now returns error — LGTM

Signature aligns with workItem.fn. Returning nil is fine since writer.Complete() has no error.


409-418: close(kind) now returns error — LGTM

Matches the new worker contract; writer.Close(kind) is still void, returning nil here is appropriate.


420-483: Error propagation from subscription update — LGTM; verify metrics on failure.

Behavior is correct. Minor: you call reporter.SubscriptionUpdateSent() even on init/load/resolve errors. Confirm if “sent” should include failed attempts; otherwise introduce a separate metric.


532-562: Heartbeat returns error — LGTM; check metric semantics.

Counting heartbeats via SubscriptionUpdateSent() may inflate “updates sent”. Confirm intent or split metrics.


803-809: Good: closure captures and error propagation.

Capturing c, s := c, s avoids the loop‑var trap; returning the error to trigger unsubscribe is correct.


914-916: Close work item wiring — LGTM

Wrapping s.close(closeKind) to match func() error is correct.


1170-1189: Async subscription early‑exit checks — LGTM

Good inline send with cancel checks for resolver/client contexts before enqueue.

Comment thread v2/pkg/engine/resolve/resolve.go
Comment thread v2/pkg/engine/resolve/resolve.go
Comment thread v2/pkg/engine/resolve/resolve.go
Comment thread v2/pkg/engine/resolve/resolve.go
@endigma endigma force-pushed the jesse/eng-8055-high-subscription-load-causes-deadlocks branch from 78dfefe to 02e215f Compare September 19, 2025 14:51
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (5)
v2/pkg/engine/resolve/resolve.go (5)

350-358: Fail-fast heartbeat/work → evented unsubscribe — correct and safe.

Scheduling via AsyncUnsubscribeSubscription avoids races and honors the loop’s invariants. Consider a tiny debug hook for reason codes (heartbeat vs write) if you need postmortems.


420-483: executeSubscriptionUpdate now returns errors; unsubscribe on flush error — solid.

  • Timeout-scoped ctx, input copy, and worker-threaded writes are correct.
  • Minor: SubscriptionUpdateSent is incremented on error paths (init/load/resolve). If this metric means “success,” consider splitting into Sent/Failed (or Attempted).

532-561: handleHeartbeat() returns error and checks ctxs — correct threading discipline.

Heartbeat increments SubscriptionUpdateSent; if that metric is intended for payload updates only, consider a separate heartbeat counter to avoid skew.


999-1011: Same non-blocking defer for client-wide unsubscribe — consistent.

Consider using the same enqueueEventNonBlocking helper to avoid duplication.


972-984: Approve — de-duplicate non-blocking enqueue into helper

Per-call goroutine fallback is fine and matches team guidance. I ran ripgrep for AsyncUnsubscribeSubscription and found calls only from subscription worker goroutines and tests (resolve.go:352, 357, 386, 467; resolve_test.go:5270, 5439) — none inside processEvents/handleEvent, so no self-send deadlock risk.

Replace the duplicated goroutine block with:

default:
-  // In the event we cannot insert immediately, defer insertion a goroutine, this should prevent deadlocks, at the cost of goroutine creation.
-  go func() {
-    select {
-    case <-r.ctx.Done():
-      return
-    case r.events <- subscriptionEvent{
-      id:   id,
-      kind: subscriptionEventKindRemoveSubscription,
-    }:
-    }
-  }()
+  r.enqueueEventNonBlocking(subscriptionEvent{
+    id:   id,
+    kind: subscriptionEventKindRemoveSubscription,
+  })

Add once in this file:

func (r *Resolver) enqueueEventNonBlocking(ev subscriptionEvent) {
	go func() {
		select {
		case <-r.ctx.Done():
			return
		case r.events <- ev:
		}
	}()
}
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 78dfefe and 02e215f.

📒 Files selected for processing (1)
  • v2/pkg/engine/resolve/resolve.go (14 hunks)
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: endigma
PR: wundergraph/graphql-go-tools#1298
File: v2/pkg/engine/resolve/resolve.go:999-1011
Timestamp: 2025-09-19T14:50:19.506Z
Learning: In the graphql-go-tools resolver, when handling backpressure for async unsubscribe operations, the team prefers using individual goroutines over bounded deferrer implementations. The rationale is that bounded implementations can also overflow, and the preferred solution for overload is scaling the deployment rather than internal buffering. The current approach prevents deadlocks at the cost of goroutine creation, which is an acceptable tradeoff.
Learnt from: endigma
PR: wundergraph/graphql-go-tools#1298
File: v2/pkg/engine/resolve/resolve.go:339-347
Timestamp: 2025-09-19T14:51:33.690Z
Learning: In the graphql-go-tools resolver, the event loop follows a strict single-threaded design where all subscription lifecycle management (add/remove from triggers) must happen through the events channel to the main processEvents() goroutine. Worker goroutines should not directly modify subscription state or call unsubscribe methods, as this would break the single-threaded event loop architecture and create race conditions.
📚 Learning: 2025-09-19T14:51:33.690Z
Learnt from: endigma
PR: wundergraph/graphql-go-tools#1298
File: v2/pkg/engine/resolve/resolve.go:339-347
Timestamp: 2025-09-19T14:51:33.690Z
Learning: In the graphql-go-tools resolver, the event loop follows a strict single-threaded design where all subscription lifecycle management (add/remove from triggers) must happen through the events channel to the main processEvents() goroutine. Worker goroutines should not directly modify subscription state or call unsubscribe methods, as this would break the single-threaded event loop architecture and create race conditions.

Applied to files:

  • v2/pkg/engine/resolve/resolve.go
📚 Learning: 2025-09-19T14:50:19.506Z
Learnt from: endigma
PR: wundergraph/graphql-go-tools#1298
File: v2/pkg/engine/resolve/resolve.go:999-1011
Timestamp: 2025-09-19T14:50:19.506Z
Learning: In the graphql-go-tools resolver, when handling backpressure for async unsubscribe operations, the team prefers using individual goroutines over bounded deferrer implementations. The rationale is that bounded implementations can also overflow, and the preferred solution for overload is scaling the deployment rather than internal buffering. The current approach prevents deadlocks at the cost of goroutine creation, which is an acceptable tradeoff.

Applied to files:

  • v2/pkg/engine/resolve/resolve.go
🧬 Code graph analysis (1)
v2/pkg/engine/resolve/resolve.go (3)
v2/pkg/engine/resolve/response.go (2)
  • SubscriptionCloseKindGoingAway (64-64)
  • SubscriptionCloseKind (56-59)
v2/pkg/engine/resolve/context.go (1)
  • Context (16-35)
v2/pkg/graphqlerrors/response.go (1)
  • Response (9-14)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: Build and test (go 1.25 / windows-latest)
  • GitHub Check: Build and test (go 1.25 / ubuntu-latest)
  • GitHub Check: Build and test (go 1.25 / ubuntu-latest)
  • GitHub Check: Build and test (go 1.25 / windows-latest)
🔇 Additional comments (7)
v2/pkg/engine/resolve/resolve.go (7)

373-387: No-heartbeat worker mirrors error/close handling — looks good.

Parity with heartbeat path maintained.


397-406: complete() now returns error (always nil) — fine for uniformity with work items.

Keeps signatures consistent; no behavior change.


803-809: Enqueue update as workItem with error propagation — correct place for writer-only ops.

Error path writes via asyncErrorWriter and lets worker trigger unsubscribe.


914-916: Close via closure conforms to new workItem signature — good.

Keeps all writer calls on the worker goroutine and marks final.


409-418: Approved — no double-close risk found.

complete() and close() both defer close(s.completed), but the code ensures only one terminal action runs: the event-loop enqueues exactly one terminal workItem (complete or close) then closes s.workChan, and if the worker invokes s.close directly (context/resolver done) it returns before processing queued terminal work — no path executes close twice while the worker is active.


303-305: Approve — workItem initializations all wrap func() error.
Verified: in v2/pkg/engine/resolve/resolve.go the workItem usages are workItem{fn, false} (fn := func() error {...}), workItem{s.complete, true} (complete() returns error), and workItem{func() error { return s.close(closeKind) }, true).


1170-1189: Approve: async add-subscription with early-exit is correct.

Inline select prevents stale sends and respects resolver/client cancellation. Verified: processEvents/handleEvent do not invoke AsyncUnsubscribeSubscription/AsyncUnsubscribeClient/AsyncResolveGraphQLSubscription; Async* calls live outside the single-threaded event loop (subscription worker/flush paths).

Comment thread v2/pkg/engine/resolve/resolve.go
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
v2/pkg/engine/resolve/resolve.go (2)

958-958: Comment grammar nit.

Tweak wording for clarity.

-        // In the event we cannot insert immediately, defer insertion a goroutine, this should prevent deadlocks, at the cost of goroutine creation.
+        // If the channel is full, defer insertion to a goroutine to prevent deadlocks (at the cost of goroutine creation).

985-985: Comment grammar nit.

Same clarity tweak as above.

-        // In the event we cannot insert immediately, defer insertion a goroutine, this should prevent deadlocks, at the cost of goroutine creation.
+        // If the channel is full, defer insertion to a goroutine to prevent deadlocks (at the cost of goroutine creation).
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 02e215f and 7c67341.

📒 Files selected for processing (1)
  • v2/pkg/engine/resolve/resolve.go (4 hunks)
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: endigma
PR: wundergraph/graphql-go-tools#1298
File: v2/pkg/engine/resolve/resolve.go:999-1011
Timestamp: 2025-09-19T14:50:19.506Z
Learning: In the graphql-go-tools resolver, when handling backpressure for async unsubscribe operations, the team prefers using individual goroutines over bounded deferrer implementations. The rationale is that bounded implementations can also overflow, and the preferred solution for overload is scaling the deployment rather than internal buffering. The current approach prevents deadlocks at the cost of goroutine creation, which is an acceptable tradeoff.
Learnt from: endigma
PR: wundergraph/graphql-go-tools#1298
File: v2/pkg/engine/resolve/resolve.go:339-347
Timestamp: 2025-09-19T14:51:33.690Z
Learning: In the graphql-go-tools resolver, the event loop follows a strict single-threaded design where all subscription lifecycle management (add/remove from triggers) must happen through the events channel to the main processEvents() goroutine. Worker goroutines should not directly modify subscription state or call unsubscribe methods, as this would break the single-threaded event loop architecture and create race conditions.
📚 Learning: 2025-09-19T14:50:19.506Z
Learnt from: endigma
PR: wundergraph/graphql-go-tools#1298
File: v2/pkg/engine/resolve/resolve.go:999-1011
Timestamp: 2025-09-19T14:50:19.506Z
Learning: In the graphql-go-tools resolver, when handling backpressure for async unsubscribe operations, the team prefers using individual goroutines over bounded deferrer implementations. The rationale is that bounded implementations can also overflow, and the preferred solution for overload is scaling the deployment rather than internal buffering. The current approach prevents deadlocks at the cost of goroutine creation, which is an acceptable tradeoff.

Applied to files:

  • v2/pkg/engine/resolve/resolve.go
📚 Learning: 2025-09-19T14:51:33.690Z
Learnt from: endigma
PR: wundergraph/graphql-go-tools#1298
File: v2/pkg/engine/resolve/resolve.go:339-347
Timestamp: 2025-09-19T14:51:33.690Z
Learning: In the graphql-go-tools resolver, the event loop follows a strict single-threaded design where all subscription lifecycle management (add/remove from triggers) must happen through the events channel to the main processEvents() goroutine. Worker goroutines should not directly modify subscription state or call unsubscribe methods, as this would break the single-threaded event loop architecture and create race conditions.

Applied to files:

  • v2/pkg/engine/resolve/resolve.go
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: Build and test (go 1.25 / windows-latest)
  • GitHub Check: Build and test (go 1.25 / ubuntu-latest)
  • GitHub Check: Build and test (go 1.25 / ubuntu-latest)
  • GitHub Check: Build and test (go 1.25 / windows-latest)
🔇 Additional comments (4)
v2/pkg/engine/resolve/resolve.go (4)

957-968: Non-blocking deferral on unsubscribe looks good.

This aligns with the single-threaded event-loop design and the team’s stated preference to prevent deadlocks by deferring via per-call goroutines under backpressure.


983-996: Non-blocking deferral on client-unsubscribe looks good.

Consistent with the approach above and with the event-loop ownership model.


1155-1173: Pre-enqueue shutdown/disconnect checks are solid; confirm blocking send is intentional.

This still blocks on r.events (no default/goroutine fallback). If that’s by design for add-subscription semantics, all good; otherwise consider mirroring the non-blocking pattern.


1101-1105: Verify: sync unsubscribe uses a blocking send.

This path directly sends to r.events without a non-blocking fallback. Is this safe in all shutdown/teardown scenarios, or should it also use a select with ctx/r.ctx (or defer via goroutine) to avoid potential stalls?

Copy link
Copy Markdown
Collaborator

@StarpTech StarpTech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@endigma endigma merged commit df38c31 into master Sep 23, 2025
11 checks passed
@endigma endigma deleted the jesse/eng-8055-high-subscription-load-causes-deadlocks branch September 23, 2025 12:35
endigma pushed a commit that referenced this pull request Sep 23, 2025
🤖 I have created a release *beep* *boop*
---


##
[2.0.0-rc.227](v2.0.0-rc.226...v2.0.0-rc.227)
(2025-09-23)


### Bug Fixes

* if blocked, defer async event insertion to prevent deadlocks
([#1298](#1298))
([df38c31](df38c31))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Bug Fixes**
* Resolved potential deadlocks by deferring async event insertion when
operations are blocked, improving stability and responsiveness under
load.
* **Documentation**
  * Updated changelog for version 2.0.0-rc.227.
* **Chores**
  * Bumped version to 2.0.0-rc.227.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
@coderabbitai coderabbitai Bot mentioned this pull request Mar 18, 2026
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants