Skip to content

Fix stale messages during rollout and improve graceful shutdown#2287

Merged
jeremydmiller merged 1 commit intomainfrom
fix/2279-stale-message-recovery
Mar 11, 2026
Merged

Fix stale messages during rollout and improve graceful shutdown#2287
jeremydmiller merged 1 commit intomainfrom
fix/2279-stale-message-recovery

Conversation

@jeremydmiller
Copy link
Member

@jeremydmiller jeremydmiller commented Mar 11, 2026

Summary

Orphaned message recovery (#2279)

  • Adds periodic recovery of messages stranded by dead nodes during rolling deployments
  • ReleaseOrphanedMessagesOperation (main databases): uses NOT IN (SELECT node_number FROM wolverine_nodes) subquery
  • ReleaseOrphanedMessagesForAncillaryOperation (ancillary/tenant databases): fetches active node numbers from the main store, builds explicit NOT IN (1, 2, 3) list
  • Both operations run every recovery cycle in DurabilityAgent.buildOperationBatch()

Graceful shutdown improvements (#2282)

  • Reorder shutdown sequence: drain endpoints first (complete in-flight handlers), then release ownership and teardown agents
  • Add bounded WaitForCompletionAsync in DurableReceiver.DrainAsync() and BufferedReceiver.DrainAsync() so in-flight message handlers finish before ownership is released
  • Add configurable DrainTimeout setting (default 30 seconds) to DurabilitySettings
  • New GracefulShutdown and RollingRestart chaos test scripts exercising shutdown scenarios

Test plan

  • CoreTests: 1160 passed, 0 failed
  • PostgreSQL tests: 331 passed, 4 failed (pre-existing compliance test flakiness)
  • SQL Server tests: 294 passed, 1 failed (pre-existing tracking/timing issue)
  • Added GracefulShutdown and RollingRestart chaos test scripts with 6 new test methods

Closes #2279
Closes #2282

🤖 Generated with Claude Code

During rolling deployments, a race condition can strand messages with
owner_id pointing to a node that no longer exists. The DurabilityAgent
now periodically releases these orphaned messages by resetting owner_id
to 0 for any message whose owner is not in the active nodes list.

For main databases, uses a subquery against the co-located wolverine_nodes
table. For ancillary/tenant databases, fetches active node numbers from
the main store and builds an explicit NOT IN list.

Closes #2279

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@jeremydmiller jeremydmiller merged commit e93531b into main Mar 11, 2026
6 of 11 checks passed
@jeremydmiller jeremydmiller changed the title Release orphaned inbox/outbox messages owned by dead nodes Fix stale messages during rollout and improve graceful shutdown Mar 11, 2026
@linxuhao
Copy link

Weird, I don't see changes related to Graceful shutdown improvements (#2282) about WaitForCompletionAsync

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Wolverine instance graceful shutdown Wolverine stale messages during rollout (Teardown Lifecycle Race Condition)

2 participants