Fix unbounded retry loop on durable receiver shutdown (GH-2671)#2701
Merged
jeremydmiller merged 1 commit intomainfrom May 7, 2026
Merged
Fix unbounded retry loop on durable receiver shutdown (GH-2671)#2701jeremydmiller merged 1 commit intomainfrom
jeremydmiller merged 1 commit intomainfrom
Conversation
…lease-ownership loop Closes GH-2671. `DurableReceiver.executeWithRetriesAsync` was an unbounded `while (true)` loop that ignored both the retry budget and the cancellation token. On host shutdown, when the Npgsql `DbDataSource` had already been disposed by DI ordering, every retry hammered a dead socket and emitted an Error-level `SocketException` log line — sigridbra reported a flood of these on application shutdown after upgrading from 5.22 to 5.27 (the underlying bug was always latent, but PR #2391's shutdown reordering made `DrainAsync.ReleaseIncomingAsync` reliably reach the dead socket). Three changes to the helper: 1. Bounded retries — cap at `MaxReleaseRetries = 5`. After the budget is exhausted, log a single Error and return rather than spinning forever. Any owned envelopes left as `owner_id = node_id` are reclaimed by `DurabilityAgent`'s recovery polling on the next live node, so silently giving up is functionally safe for this best-effort cleanup path. 2. Cancellation-aware exit — when `_settings.Cancellation` is signalled (which happens on `DurabilitySettings.Cancel()` during shutdown), bail out of the retry loop on the very first failure. Retrying is futile when the data source is being torn down by the host. The cancellation token is also passed to `Task.Delay` for the backoff so we don't sit in a sleep through the entire shutdown sequence. 3. Demoted log level on cancelled exit — these failures are expected during teardown (the connection pool is gone) and don't deserve Error-level noise. Outside cancellation, the existing Error log is preserved with attempt counters added for diagnostics. Regression coverage: new `durable_receiver_release_incoming_during_shutdown` test class with two cases: * `drain_terminates_within_seconds_when_inbox_release_throws_repeatedly` asserts `DrainAsync` finishes inside a 5-second budget when the inbox throws `SocketException` on every call — the pre-fix code loops forever here. * `drain_exits_immediately_when_cancellation_is_signalled` cancels `DurabilitySettings.Cancel()` first and asserts the drain exits inside one second, validating the shutdown short-circuit path. Tests: full `Runtime.WorkerQueues` suite green (25/25) on net9.0 — the existing receiver tests are unaffected, only the retry shape changed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced May 7, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #2671.
Diagnosis
DurableReceiver.executeWithRetriesAsyncwas an unboundedwhile (true)loop that:Task.Delayhonoured the durability cancellation tokenOn shutdown,
DrainAsynccalls_inbox.ReleaseIncomingAsync(...)through this helper. When the NpgsqlDbDataSourcehas already been disposed by DI ordering, every retry hammers a dead socket and emits aSocketExceptionat Error level. The user reported the flood started after upgrading 5.22 → 5.27 — PR #2391's shutdown reordering madeDrainAsyncreliably reach this code path on a torn-down data source, exposing the latent bug.Fix
Three changes to
executeWithRetriesAsync:Bounded retries — cap at
MaxReleaseRetries = 5. After the budget is exhausted, log a single Error and return rather than spinning forever. Any owned envelopes left asowner_id = node_idare reclaimed byDurabilityAgent's recovery polling on the next live node, so silently giving up is functionally safe for this best-effort cleanup path.Cancellation-aware exit — when
_settings.Cancellationis signalled (which happens viaDurabilitySettings.Cancel()during shutdown), bail out on the first failure. The cancellation token is also passed toTask.Delayso the backoff doesn't sleep through the entire shutdown sequence.Demoted log level on cancelled exit — these failures are expected during teardown and don't deserve Error-level noise. Outside cancellation, the existing Error log is preserved with attempt counters added for diagnostics.
Test plan
New
durable_receiver_release_incoming_during_shutdowntest class:drain_terminates_within_seconds_when_inbox_release_throws_repeatedlyassertsDrainAsyncfinishes inside a 5-second budget when the inbox throwsSocketExceptionon every call — the pre-fix code loops forever here.drain_exits_immediately_when_cancellation_is_signalledcancelsDurabilitySettings.Cancel()first and asserts the drain exits inside one second.Runtime.WorkerQueuessuite green (25/25) on net9.0 — existing receiver tests are unaffected.🤖 Generated with Claude Code