Skip to content

Fix unbounded retry loop on durable receiver shutdown (GH-2671)#2701

Merged
jeremydmiller merged 1 commit intomainfrom
issue-2671-shutdown-retry-loop
May 7, 2026
Merged

Fix unbounded retry loop on durable receiver shutdown (GH-2671)#2701
jeremydmiller merged 1 commit intomainfrom
issue-2671-shutdown-retry-loop

Conversation

@jeremydmiller
Copy link
Copy Markdown
Member

Closes #2671.

Diagnosis

DurableReceiver.executeWithRetriesAsync was an unbounded while (true) loop that:

  1. Never gave up — no retry cap
  2. Ignored cancellation — neither the loop guard nor the Task.Delay honoured the durability cancellation token
  3. Always logged at Error level — including expected failures during host teardown

On shutdown, DrainAsync calls _inbox.ReleaseIncomingAsync(...) through this helper. When the Npgsql DbDataSource has already been disposed by DI ordering, every retry hammers a dead socket and emits a SocketException at Error level. The user reported the flood started after upgrading 5.22 → 5.27 — PR #2391's shutdown reordering made DrainAsync reliably reach this code path on a torn-down data source, exposing the latent bug.

Fix

Three changes to executeWithRetriesAsync:

  1. Bounded retries — cap at MaxReleaseRetries = 5. After the budget is exhausted, log a single Error and return rather than spinning forever. Any owned envelopes left as owner_id = node_id are reclaimed by DurabilityAgent's recovery polling on the next live node, so silently giving up is functionally safe for this best-effort cleanup path.

  2. Cancellation-aware exit — when _settings.Cancellation is signalled (which happens via DurabilitySettings.Cancel() during shutdown), bail out on the first failure. The cancellation token is also passed to Task.Delay so the backoff doesn't sleep through the entire shutdown sequence.

  3. Demoted log level on cancelled exit — these failures are expected during teardown and don't deserve Error-level noise. Outside cancellation, the existing Error log is preserved with attempt counters added for diagnostics.

Test plan

New durable_receiver_release_incoming_during_shutdown test class:

  • drain_terminates_within_seconds_when_inbox_release_throws_repeatedly asserts DrainAsync finishes inside a 5-second budget when the inbox throws SocketException on every call — the pre-fix code loops forever here.
  • drain_exits_immediately_when_cancellation_is_signalled cancels DurabilitySettings.Cancel() first and asserts the drain exits inside one second.
  • Full Runtime.WorkerQueues suite green (25/25) on net9.0 — existing receiver tests are unaffected.

🤖 Generated with Claude Code

…lease-ownership loop

Closes GH-2671.

`DurableReceiver.executeWithRetriesAsync` was an unbounded `while (true)`
loop that ignored both the retry budget and the cancellation token. On
host shutdown, when the Npgsql `DbDataSource` had already been disposed
by DI ordering, every retry hammered a dead socket and emitted an
Error-level `SocketException` log line — sigridbra reported a flood of
these on application shutdown after upgrading from 5.22 to 5.27 (the
underlying bug was always latent, but PR #2391's shutdown reordering
made `DrainAsync.ReleaseIncomingAsync` reliably reach the dead socket).

Three changes to the helper:

1. Bounded retries — cap at `MaxReleaseRetries = 5`. After the budget
   is exhausted, log a single Error and return rather than spinning
   forever. Any owned envelopes left as `owner_id = node_id` are
   reclaimed by `DurabilityAgent`'s recovery polling on the next live
   node, so silently giving up is functionally safe for this
   best-effort cleanup path.

2. Cancellation-aware exit — when `_settings.Cancellation` is
   signalled (which happens on `DurabilitySettings.Cancel()` during
   shutdown), bail out of the retry loop on the very first failure.
   Retrying is futile when the data source is being torn down by the
   host. The cancellation token is also passed to `Task.Delay` for
   the backoff so we don't sit in a sleep through the entire shutdown
   sequence.

3. Demoted log level on cancelled exit — these failures are expected
   during teardown (the connection pool is gone) and don't deserve
   Error-level noise. Outside cancellation, the existing Error log is
   preserved with attempt counters added for diagnostics.

Regression coverage: new
`durable_receiver_release_incoming_during_shutdown` test class with
two cases:
* `drain_terminates_within_seconds_when_inbox_release_throws_repeatedly`
  asserts `DrainAsync` finishes inside a 5-second budget when the
  inbox throws `SocketException` on every call — the pre-fix code
  loops forever here.
* `drain_exits_immediately_when_cancellation_is_signalled` cancels
  `DurabilitySettings.Cancel()` first and asserts the drain exits
  inside one second, validating the shutdown short-circuit path.

Tests: full `Runtime.WorkerQueues` suite green (25/25) on net9.0 — the
existing receiver tests are unaffected, only the retry shape changed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SocketException on application shutdown from Wolverine.Runtime.WorkerQueues.DurableReceiver

1 participant