Skip to content

Fix ancillary store scheduled messages stuck Incoming forever (#2576)#2591

Merged
jeremydmiller merged 3 commits intomainfrom
bugfix/2576-ancillary-store-scheduled-stuck-incoming
Apr 26, 2026
Merged

Fix ancillary store scheduled messages stuck Incoming forever (#2576)#2591
jeremydmiller merged 3 commits intomainfrom
bugfix/2576-ancillary-store-scheduled-stuck-incoming

Conversation

@jeremydmiller
Copy link
Copy Markdown
Member

Fixes #2576.

The bug

Reporter set up two Marten stores (main + ancillary), wrote a handler chain entirely owned by the ancillary store via [MartenStore(typeof(IAncillaryStore))], and had one of the handlers return ScheduledMessage<T>. After the scheduled message fired and its handler succeeded, the row stayed Status == Incoming in the ancillary store forever — the "mark as Handled" SQL was hitting the main store instead.

Why

The path:

PollForScheduledMessagesAsync (per-store)
  → loaded envelopes hand-off
  → runtime.EnqueueDirectlyAsync(envelopes)
  → ListeningAgent → DurableReceiver.EnqueueAsync(envelope)
  → handler runs, succeeds
  → DurableReceiver._markAsHandled
  → DelegatingMessageInbox.MarkIncomingEnvelopeAsHandledAsync
       (uses envelope.Store?.Inbox ?? _inner)
  → _inner == main store ✗

envelope.Store was never stamped, so dispatch fell through to the main store.

There's an existing message-type-to-ancillary-store map (MessageStoreCollection.MapMessageTypeToAncillaryStore) that should have populated envelope.Store upstream, but it's empty at runtime: it's built from chain.AncillaryStoreType, which is set by MartenStoreAttribute.Modify — and that attribute application is lazy, deferred until first handler use. So at startup, no chains have AncillaryStoreType set yet, the map is empty, and the lookup returns null. (This was also gated to MultipleHandlerBehavior.Separated, which the reporter happened to use.)

The fix (Option B from the analysis on the issue)

Each MessageDatabase.PollForScheduledMessagesAsync now stamps envelope.Store = this on every loaded envelope before handing them to runtime.EnqueueDirectlyAsync. The store knows itself; this sidesteps the type-name map entirely and survives any combination of attribute-application timing or handler-config behavior.

This mirrors the existing pattern in RecoverIncomingMessagesCommand.cs:48 (Bug-2318 fix for the recovery path).

Applied to: Postgres, SqlServer, MySql, Sqlite, Oracle.

Defense-in-depth changes:

  • DurableReceiver.Enqueue / EnqueueAsync now also call assignAncillaryStoreIfNeeded (with a guard that won't overwrite a Store already set by Option B). Catches edge cases where envelopes enter the queue without going through the poll path.
  • WolverineRuntime.HostService now uses Handlers.AllChains() instead of Handlers.Chains when populating the type-to-store map, so per-endpoint sticky chains under MultipleHandlerBehavior.Separated are included once attribute application happens earlier.

DLQ recovery — verified safe

Asked to also check the DLQ replay path:

  1. DeadLetters.ReplayAsync flips replayable=true in the dead_letter table (per store).
  2. Per-store DurabilityAgent runs MoveReplayableErrorMessagesToIncomingOperation, which is pure SQL within the same database — moves rows from dead_letterincoming table.
  3. RecoverIncomingMessagesCommand.ExecuteAsync already does envelope.Store ??= _store; (line 48, comment references Replayed Dead Letter messages from Ancillary Store are marked as processed in the Main Store #2318) before EnqueueDirectlyAsync.

So the DLQ replay → recovery cycle was already correct because of the prior GH-2318 fix. Added a sister contract test (replayed_dead_letter_recovery_stamps_envelope_with_originating_store) to pin that down so it can't accidentally regress alongside this fix.

Tests

New reproducer (MartenTests/Bugs/Bug_2576_ancillary_scheduled_message_stuck_incoming.cs, added in the prior commit on this branch). Now passes.

Two new contract tests in MessageStoreCompliance:

  • scheduled_poll_stamps_envelope_with_originating_store — proves Option B's contract for any IMessageDatabase implementation.
  • replayed_dead_letter_recovery_stamps_envelope_with_originating_store — pins the DLQ recovery contract so future refactors can't strip it.

Both use a Substitute.For<IWolverineRuntime>() spy on EnqueueDirectlyAsync to capture the in-memory envelope.

Required adding a project reference from Wolverine.ComplianceTests to Wolverine.RDBMS (so the contract test can refer to IMessageDatabase), and InternalsVisibleTo("Wolverine.ComplianceTests") from Wolverine.RDBMS. Stores that don't implement IMessageDatabase (RavenDb, CosmosDb) early-return from the contract test — they wire scheduled dispatch through their own durability agents.

Verification

  • 4/4 Postgres compliance variants pass both new contract tests
  • 1/1 Sqlite compliance passes
  • Full PostgresqlTests: 365/365 pass
  • Full SqliteTests: 117/118 (1 pre-existing flake — multi_tenancy_with_multiple_files.scheduled_messages_are_processed_in_tenant_files — confirmed to also fail on stashed-clean state, unrelated to this change)
  • MartenTests Bug_2576 + Bug_2382 + Bug_2318 + Bug_2026: 6/6 pass

Test plan

🤖 Generated with Claude Code

jeremydmiller and others added 3 commits April 25, 2026 20:17
…ncoming)

Mirrors the reporter's setup using two Marten stores on a single Postgres
instance with separate schemas:

  AncillaryCommand → handler → AncillaryEvent
  AncillaryEvent   → handler → ScheduledMessage<SomeMessage> (past time)
  SomeMessage      → handler → noop (just touch the ancillary session)

All three handlers carry [MartenStore(typeof(IAncillaryStore2576))], so
the entire chain belongs to the ancillary store. After invoking the
initial command and waiting for the scheduled-jobs poller to fire, the
test asserts that the ancillary store's incoming table holds zero rows
in Incoming status and one row in Handled for SomeMessage2576.

Currently FAILS with:
    .Count(x => MessageType==SomeMessage2576 && Status==Incoming) ShouldBe 0
    but was 1

The "mark as handled" SQL is being written to the main store's incoming
table instead of the ancillary store's, leaving the ancillary row stuck.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ck Incoming (GH-2576)

When a handler chain is fully owned by an ancillary Marten store and a
handler returns ScheduledMessage<T>, the envelope is correctly persisted
to the ancillary store's incoming table. But once the scheduled-jobs
poller wakes the row up, the in-memory envelope handed to the worker
queue had envelope.Store == null. DelegatingMessageInbox / FlushOutgoing-
MessagesOnCommit then fall back to the main store's inbox when marking
the envelope as Handled, leaving the ancillary row stuck in Incoming
forever.

Investigation revealed two separate gaps. The existing message-type-to-
ancillary-store map (built from chain.AncillaryStoreType in HostService)
is empty at runtime because chain customizations — including Marten-
StoreAttribute.Modify which sets AncillaryStoreType — are applied
lazily, AFTER the map is built. So the map-lookup path in
DurableReceiver.assignAncillaryStoreIfNeeded couldn't see the type.

Fix (Option B): each MessageDatabase.PollForScheduledMessagesAsync now
stamps `envelope.Store = this` on every loaded envelope before handing
them to runtime.EnqueueDirectlyAsync. The store knows itself; this
sidesteps the type-name map entirely and survives MultipleHandlerBehavior.
Separated and any other config that affects when chain attributes get
applied. Mirrors the existing Bug-2318 fix at RecoverIncomingMessages-
Command.cs:48 for the recovery path.

Applied to:
- Wolverine.Postgresql/PostgresqlMessageStore
- Wolverine.SqlServer/SqlServerMessageStore
- Wolverine.MySql/MySqlMessageStore
- Wolverine.Sqlite/SqliteMessageStore
- Wolverine.Oracle/OracleMessageStore

Defense-in-depth: also added assignAncillaryStoreIfNeeded calls to
DurableReceiver.Enqueue / EnqueueAsync (with a guard so it doesn't
overwrite a Store already set by Option B), and changed HostService to
enumerate Handlers.AllChains() so per-endpoint sticky chains are
included in the type-to-store map once attribute application is moved
earlier in startup.

Wired Wolverine.MySql / Wolverine.Sqlite / Wolverine.Oracle into the
core InternalsVisibleTo list so they can read envelope.Store.

Tests:
- Bug_2576_ancillary_scheduled_message_stuck_incoming reproducer in
  MartenTests/Bugs (added in prior commit on this branch). Now passes.
- New MessageStoreCompliance tests
  (scheduled_poll_stamps_envelope_with_originating_store and
  replayed_dead_letter_recovery_stamps_envelope_with_originating_store)
  pin the contract for any IMessageDatabase implementation.

Verified:
- 4/4 Postgres compliance variants pass both new contract tests.
- 1/1 Sqlite compliance pass.
- Full PostgresqlTests: 365/365 pass.
- Full SqliteTests: 117/118 (1 pre-existing flake unrelated to this
  change — same failure on stashed-clean state).
- MartenTests Bug_2576 + Bug_2382 + Bug_2318 + Bug_2026: 6/6 pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… fix)

Oracle CI failed on the original replayed_dead_letter_recovery_stamps_
envelope_with_originating_store contract test. The Oracle-failing step
was MoveReplayableErrorMessagesToIncomingOperation, whose SQL chains
INSERT and DELETE statements with a `;` separator. Oracle doesn't accept
multi-statement SQL outside a PL/SQL block, so the operation failed with
ORA-00936 — surfacing a separate, pre-existing latent bug in Oracle's
DLQ-replay path that's outside the scope of GH-2576.

The contract being pinned by this test is narrower than DLQ replay
end-to-end: it's just "loaded envelope from incoming → envelope.Store
stamp survives." DLQ replay is one producer of orphaned-incoming rows,
but durability-recovery generates them more directly when nodes crash
mid-handle. Persist the envelope with OwnerId = AnyNode straight into
the incoming table, then call LoadPageOfGloballyOwnedIncomingAsync.

Same contract, same regression coverage for GH-2318 / GH-2576, no
dependency on the broken Oracle MoveReplayable SQL. Renamed the test
accordingly: orphaned_incoming_recovery_stamps_envelope_with_originating_store.

Verified: 4/4 Postgres compliance variants pass both tests, Sqlite
passes both, original Bug_2576 reproducer in MartenTests still passes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Scheduled messages from ancillary Marten store get stuck in "Incoming" status.

1 participant