Fix ancillary store scheduled messages stuck Incoming forever (#2576)#2591
Merged
jeremydmiller merged 3 commits intomainfrom Apr 26, 2026
Merged
Conversation
…ncoming)
Mirrors the reporter's setup using two Marten stores on a single Postgres
instance with separate schemas:
AncillaryCommand → handler → AncillaryEvent
AncillaryEvent → handler → ScheduledMessage<SomeMessage> (past time)
SomeMessage → handler → noop (just touch the ancillary session)
All three handlers carry [MartenStore(typeof(IAncillaryStore2576))], so
the entire chain belongs to the ancillary store. After invoking the
initial command and waiting for the scheduled-jobs poller to fire, the
test asserts that the ancillary store's incoming table holds zero rows
in Incoming status and one row in Handled for SomeMessage2576.
Currently FAILS with:
.Count(x => MessageType==SomeMessage2576 && Status==Incoming) ShouldBe 0
but was 1
The "mark as handled" SQL is being written to the main store's incoming
table instead of the ancillary store's, leaving the ancillary row stuck.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ck Incoming (GH-2576) When a handler chain is fully owned by an ancillary Marten store and a handler returns ScheduledMessage<T>, the envelope is correctly persisted to the ancillary store's incoming table. But once the scheduled-jobs poller wakes the row up, the in-memory envelope handed to the worker queue had envelope.Store == null. DelegatingMessageInbox / FlushOutgoing- MessagesOnCommit then fall back to the main store's inbox when marking the envelope as Handled, leaving the ancillary row stuck in Incoming forever. Investigation revealed two separate gaps. The existing message-type-to- ancillary-store map (built from chain.AncillaryStoreType in HostService) is empty at runtime because chain customizations — including Marten- StoreAttribute.Modify which sets AncillaryStoreType — are applied lazily, AFTER the map is built. So the map-lookup path in DurableReceiver.assignAncillaryStoreIfNeeded couldn't see the type. Fix (Option B): each MessageDatabase.PollForScheduledMessagesAsync now stamps `envelope.Store = this` on every loaded envelope before handing them to runtime.EnqueueDirectlyAsync. The store knows itself; this sidesteps the type-name map entirely and survives MultipleHandlerBehavior. Separated and any other config that affects when chain attributes get applied. Mirrors the existing Bug-2318 fix at RecoverIncomingMessages- Command.cs:48 for the recovery path. Applied to: - Wolverine.Postgresql/PostgresqlMessageStore - Wolverine.SqlServer/SqlServerMessageStore - Wolverine.MySql/MySqlMessageStore - Wolverine.Sqlite/SqliteMessageStore - Wolverine.Oracle/OracleMessageStore Defense-in-depth: also added assignAncillaryStoreIfNeeded calls to DurableReceiver.Enqueue / EnqueueAsync (with a guard so it doesn't overwrite a Store already set by Option B), and changed HostService to enumerate Handlers.AllChains() so per-endpoint sticky chains are included in the type-to-store map once attribute application is moved earlier in startup. Wired Wolverine.MySql / Wolverine.Sqlite / Wolverine.Oracle into the core InternalsVisibleTo list so they can read envelope.Store. Tests: - Bug_2576_ancillary_scheduled_message_stuck_incoming reproducer in MartenTests/Bugs (added in prior commit on this branch). Now passes. - New MessageStoreCompliance tests (scheduled_poll_stamps_envelope_with_originating_store and replayed_dead_letter_recovery_stamps_envelope_with_originating_store) pin the contract for any IMessageDatabase implementation. Verified: - 4/4 Postgres compliance variants pass both new contract tests. - 1/1 Sqlite compliance pass. - Full PostgresqlTests: 365/365 pass. - Full SqliteTests: 117/118 (1 pre-existing flake unrelated to this change — same failure on stashed-clean state). - MartenTests Bug_2576 + Bug_2382 + Bug_2318 + Bug_2026: 6/6 pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… fix) Oracle CI failed on the original replayed_dead_letter_recovery_stamps_ envelope_with_originating_store contract test. The Oracle-failing step was MoveReplayableErrorMessagesToIncomingOperation, whose SQL chains INSERT and DELETE statements with a `;` separator. Oracle doesn't accept multi-statement SQL outside a PL/SQL block, so the operation failed with ORA-00936 — surfacing a separate, pre-existing latent bug in Oracle's DLQ-replay path that's outside the scope of GH-2576. The contract being pinned by this test is narrower than DLQ replay end-to-end: it's just "loaded envelope from incoming → envelope.Store stamp survives." DLQ replay is one producer of orphaned-incoming rows, but durability-recovery generates them more directly when nodes crash mid-handle. Persist the envelope with OwnerId = AnyNode straight into the incoming table, then call LoadPageOfGloballyOwnedIncomingAsync. Same contract, same regression coverage for GH-2318 / GH-2576, no dependency on the broken Oracle MoveReplayable SQL. Renamed the test accordingly: orphaned_incoming_recovery_stamps_envelope_with_originating_store. Verified: 4/4 Postgres compliance variants pass both tests, Sqlite passes both, original Bug_2576 reproducer in MartenTests still passes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced Apr 28, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #2576.
The bug
Reporter set up two Marten stores (main + ancillary), wrote a handler chain entirely owned by the ancillary store via
[MartenStore(typeof(IAncillaryStore))], and had one of the handlers returnScheduledMessage<T>. After the scheduled message fired and its handler succeeded, the row stayedStatus == Incomingin the ancillary store forever — the "mark as Handled" SQL was hitting the main store instead.Why
The path:
envelope.Storewas never stamped, so dispatch fell through to the main store.There's an existing message-type-to-ancillary-store map (
MessageStoreCollection.MapMessageTypeToAncillaryStore) that should have populatedenvelope.Storeupstream, but it's empty at runtime: it's built fromchain.AncillaryStoreType, which is set byMartenStoreAttribute.Modify— and that attribute application is lazy, deferred until first handler use. So at startup, no chains haveAncillaryStoreTypeset yet, the map is empty, and the lookup returns null. (This was also gated toMultipleHandlerBehavior.Separated, which the reporter happened to use.)The fix (Option B from the analysis on the issue)
Each
MessageDatabase.PollForScheduledMessagesAsyncnow stampsenvelope.Store = thison every loaded envelope before handing them toruntime.EnqueueDirectlyAsync. The store knows itself; this sidesteps the type-name map entirely and survives any combination of attribute-application timing or handler-config behavior.This mirrors the existing pattern in
RecoverIncomingMessagesCommand.cs:48(Bug-2318 fix for the recovery path).Applied to: Postgres, SqlServer, MySql, Sqlite, Oracle.
Defense-in-depth changes:
DurableReceiver.Enqueue/EnqueueAsyncnow also callassignAncillaryStoreIfNeeded(with a guard that won't overwrite a Store already set by Option B). Catches edge cases where envelopes enter the queue without going through the poll path.WolverineRuntime.HostServicenow usesHandlers.AllChains()instead ofHandlers.Chainswhen populating the type-to-store map, so per-endpoint sticky chains underMultipleHandlerBehavior.Separatedare included once attribute application happens earlier.DLQ recovery — verified safe
Asked to also check the DLQ replay path:
DeadLetters.ReplayAsyncflipsreplayable=truein the dead_letter table (per store).DurabilityAgentrunsMoveReplayableErrorMessagesToIncomingOperation, which is pure SQL within the same database — moves rows fromdead_letter→incomingtable.RecoverIncomingMessagesCommand.ExecuteAsyncalready doesenvelope.Store ??= _store;(line 48, comment references Replayed Dead Letter messages from Ancillary Store are marked as processed in the Main Store #2318) beforeEnqueueDirectlyAsync.So the DLQ replay → recovery cycle was already correct because of the prior GH-2318 fix. Added a sister contract test (
replayed_dead_letter_recovery_stamps_envelope_with_originating_store) to pin that down so it can't accidentally regress alongside this fix.Tests
New reproducer (
MartenTests/Bugs/Bug_2576_ancillary_scheduled_message_stuck_incoming.cs, added in the prior commit on this branch). Now passes.Two new contract tests in
MessageStoreCompliance:scheduled_poll_stamps_envelope_with_originating_store— proves Option B's contract for anyIMessageDatabaseimplementation.replayed_dead_letter_recovery_stamps_envelope_with_originating_store— pins the DLQ recovery contract so future refactors can't strip it.Both use a
Substitute.For<IWolverineRuntime>()spy onEnqueueDirectlyAsyncto capture the in-memory envelope.Required adding a project reference from
Wolverine.ComplianceTeststoWolverine.RDBMS(so the contract test can refer toIMessageDatabase), andInternalsVisibleTo("Wolverine.ComplianceTests")fromWolverine.RDBMS. Stores that don't implementIMessageDatabase(RavenDb, CosmosDb) early-return from the contract test — they wire scheduled dispatch through their own durability agents.Verification
PostgresqlTests: 365/365 passSqliteTests: 117/118 (1 pre-existing flake —multi_tenancy_with_multiple_files.scheduled_messages_are_processed_in_tenant_files— confirmed to also fail on stashed-clean state, unrelated to this change)MartenTestsBug_2576 + Bug_2382 + Bug_2318 + Bug_2026: 6/6 passTest plan
main, passes after the fix🤖 Generated with Claude Code