Fix unbounded retry loop on durable receiver shutdown (GH-2671) by jeremydmiller · Pull Request #2701 · JasperFx/wolverine

jeremydmiller · 2026-05-07T17:55:13Z

Closes #2671.

Diagnosis

DurableReceiver.executeWithRetriesAsync was an unbounded while (true) loop that:

Never gave up — no retry cap
Ignored cancellation — neither the loop guard nor the Task.Delay honoured the durability cancellation token
Always logged at Error level — including expected failures during host teardown

On shutdown, DrainAsync calls _inbox.ReleaseIncomingAsync(...) through this helper. When the Npgsql DbDataSource has already been disposed by DI ordering, every retry hammers a dead socket and emits a SocketException at Error level. The user reported the flood started after upgrading 5.22 → 5.27 — PR #2391's shutdown reordering made DrainAsync reliably reach this code path on a torn-down data source, exposing the latent bug.

Fix

Three changes to executeWithRetriesAsync:

Bounded retries — cap at MaxReleaseRetries = 5. After the budget is exhausted, log a single Error and return rather than spinning forever. Any owned envelopes left as owner_id = node_id are reclaimed by DurabilityAgent's recovery polling on the next live node, so silently giving up is functionally safe for this best-effort cleanup path.
Cancellation-aware exit — when _settings.Cancellation is signalled (which happens via DurabilitySettings.Cancel() during shutdown), bail out on the first failure. The cancellation token is also passed to Task.Delay so the backoff doesn't sleep through the entire shutdown sequence.
Demoted log level on cancelled exit — these failures are expected during teardown and don't deserve Error-level noise. Outside cancellation, the existing Error log is preserved with attempt counters added for diagnostics.

Test plan

New durable_receiver_release_incoming_during_shutdown test class:

drain_terminates_within_seconds_when_inbox_release_throws_repeatedly asserts DrainAsync finishes inside a 5-second budget when the inbox throws SocketException on every call — the pre-fix code loops forever here.
drain_exits_immediately_when_cancellation_is_signalled cancels DurabilitySettings.Cancel() first and asserts the drain exits inside one second.
Full Runtime.WorkerQueues suite green (25/25) on net9.0 — existing receiver tests are unaffected.

🤖 Generated with Claude Code

…lease-ownership loop Closes GH-2671. `DurableReceiver.executeWithRetriesAsync` was an unbounded `while (true)` loop that ignored both the retry budget and the cancellation token. On host shutdown, when the Npgsql `DbDataSource` had already been disposed by DI ordering, every retry hammered a dead socket and emitted an Error-level `SocketException` log line — sigridbra reported a flood of these on application shutdown after upgrading from 5.22 to 5.27 (the underlying bug was always latent, but PR #2391's shutdown reordering made `DrainAsync.ReleaseIncomingAsync` reliably reach the dead socket). Three changes to the helper: 1. Bounded retries — cap at `MaxReleaseRetries = 5`. After the budget is exhausted, log a single Error and return rather than spinning forever. Any owned envelopes left as `owner_id = node_id` are reclaimed by `DurabilityAgent`'s recovery polling on the next live node, so silently giving up is functionally safe for this best-effort cleanup path. 2. Cancellation-aware exit — when `_settings.Cancellation` is signalled (which happens on `DurabilitySettings.Cancel()` during shutdown), bail out of the retry loop on the very first failure. Retrying is futile when the data source is being torn down by the host. The cancellation token is also passed to `Task.Delay` for the backoff so we don't sit in a sleep through the entire shutdown sequence. 3. Demoted log level on cancelled exit — these failures are expected during teardown (the connection pool is gone) and don't deserve Error-level noise. Outside cancellation, the existing Error log is preserved with attempt counters added for diagnostics. Regression coverage: new `durable_receiver_release_incoming_during_shutdown` test class with two cases: * `drain_terminates_within_seconds_when_inbox_release_throws_repeatedly` asserts `DrainAsync` finishes inside a 5-second budget when the inbox throws `SocketException` on every call — the pre-fix code loops forever here. * `drain_exits_immediately_when_cancellation_is_signalled` cancels `DurabilitySettings.Cancel()` first and asserts the drain exits inside one second, validating the shutdown short-circuit path. Tests: full `Runtime.WorkerQueues` suite green (25/25) on net9.0 — the existing receiver tests are unaffected, only the retry shape changed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

jeremydmiller merged commit 32ec347 into main May 7, 2026
21 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix unbounded retry loop on durable receiver shutdown (GH-2671)#2701

Fix unbounded retry loop on durable receiver shutdown (GH-2671)#2701
jeremydmiller merged 1 commit intomainfrom
issue-2671-shutdown-retry-loop

jeremydmiller commented May 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

jeremydmiller commented May 7, 2026

Diagnosis

Fix

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant