Skip to content

Fix scheduled-cascade loss from [ReadAggregate]/[DocumentExists] handlers (closes #2941)#2943

Merged
jeremydmiller merged 4 commits into
mainfrom
fix-2941-readaggregate-scheduled-cascade
May 28, 2026
Merged

Fix scheduled-cascade loss from [ReadAggregate]/[DocumentExists] handlers (closes #2941)#2943
jeremydmiller merged 4 commits into
mainfrom
fix-2941-readaggregate-scheduled-cascade

Conversation

@jeremydmiller

@jeremydmiller jeremydmiller commented May 28, 2026

Copy link
Copy Markdown
Member

Closes #2941. Builds on @JurJean's repro from PR #2941 (test cherry-picked + extended).

The bug (verified)

A handler whose only Marten/Polecat dependency comes from an attribute - [ReadAggregate], [DocumentExists<T>], [DocumentDoesNotExist<T>] - silently lost any cascading DeliveryMessage<T>.DelayedFor(...) it emitted. Non-scheduled cascades from the same handlers were unaffected (they take Wolverine's in-memory delivery path), so the bug only surfaces when scheduling.

TrackedSession trace before the fix:

ScheduleViaReader     | 622 ms | ExecutionFinished
SomethingWasScheduled | 623 ms | Sent              ← recorded as Sent
ScheduleViaReader     | 626 ms | MessageSucceeded
                                  (never Received - 30s timeout)

The reporter saw it directly: StoreIncomingEnvelope shows up in the unit of work but is never saved.

Root cause (Wolverine side)

  1. AutoApplyTransactions decides which IPersistenceFrameProvider to apply by calling CanApply, which queries chain.ServiceDependencies() for IDocumentSession.
  2. Chain.serviceDependencies (src/Wolverine/Configuration/Chain.cs:276) only walks Middleware.OfType<MethodCall>() - other frame types are not inspected.
  3. [ReadAggregate] injects FetchLatestAggregateFrame (AsyncFrame). [DocumentExists<T>]/[DocumentDoesNotExist<T>] inject DocumentExistenceCheckFrame (AsyncFrame). Both depend on IDocumentSession but neither is a MethodCall, so the dependency is invisible to ServiceDependencies.
  4. WolverineParameterAttribute.Modify and ModifyChainAttribute.Modify both run lazily in HandlerChain.applyCustomizations - long after AutoApplyTransactions has evaluated CanApply. So self-declaring the dependency from inside Modify is too late.

Result: CanApply returns false, no DocumentSessionSaveChanges postprocessor is attached, and MartenEnvelopeTransaction.PersistIncomingAsync / its Polecat equivalent queue StoreIncoming(...) on the session that is never flushed.

[WriteAggregate] and [BoundaryModel] do not need this fix - their Modify() paths explicitly call ApplyTransactionSupport themselves (AggregateHandling.Apply line 43; BoundaryModelAttribute.cs:86). Saga chains hit the SagaChain short-circuit in CanApply.

Polecat side companion fix (shipped in Polecat 4.2.1)

Diagnosing this surfaced a second bug on the Polecat side: DocumentSessionBase.SaveChangesAsync early-returned when _workTracker.HasOutstandingWork() was false. _workTracker tracks document ops + event streams - it does NOT include ITransactionParticipants. Wolverine.Polecat's Session.StoreIncoming(...) adds a StoreIncomingEnvelopeParticipant whose BeforeCommitAsync writes the scheduled inbox row; if that's the only thing on the session (handler had no doc ops), SaveChangesAsync skipped the entire pipeline and the participant's BeforeCommitAsync never ran. Marten doesn't have this gap.

Fixed upstream in JasperFx/polecat#161, shipped as Polecat 4.2.1. This PR bumps the Polecat pin to 4.2.1 so the Wolverine-side CanApply fix actually closes the loop.

Fix (this PR, Wolverine side)

Direct attribute detection in MartenPersistenceFrameProvider.CanApply and PolecatPersistenceFrameProvider.CanApply. Walks handler-method parameters for [ReadAggregate], and handler-method / handler-type / message-type attributes for DocumentExistsAttribute<> / DocumentDoesNotExistAttribute<>. Detection works by reflection on existing handler metadata, so it does not depend on attribute Modify() having run yet.

Also bumps Polecat 4.1.1 → 4.2.1 in Directory.Packages.props to pick up polecat#161.

Tests + verification

Path Test Status
Marten [ReadAggregate] Bug_aggregate_should_still_publish (cherry-picked from #2941; fixed the test bug of PublishReader missing from IncludeType) local 6/6 pass; CI green
Marten [DocumentExists<T>]/[DocumentDoesNotExist<T>] Bug_2941_document_exists_scheduled_cascade (new) local 2/2 pass; negative-control 2/2 fail without the DocumentExists branch
Polecat [ReadAggregate] Bug_2941_read_aggregate_scheduled_cascade (new) exercised end-to-end on CI (was [Skip]'d in earlier commits pending polecat#161; unskipped now that Polecat 4.2.1 ships the upstream companion fix)
Polecat [DocumentExists<T>]/[DocumentDoesNotExist<T>] Bug_2941_document_exists_scheduled_cascade (new) same — unskipped now that 4.2.1 ships

Full Marten Bugs sweep: 45/45 pass. Full dotnet build wolverine.slnx -c Release clean (0 warnings, 0 errors).

Out of scope (follow-up worth tracking separately)

The IMartenDataRequirement / IPolecatDataRequirement return-value continuation strategies create MartenDataRequirementFrame / PolecatDataRequirementFrame (AsyncFrames using IDocumentSession) via a different code path - they are produced by *DataRequirementContinuationStrategy.TryFindContinuationHandler when a handler returns I*DataRequirement. The same scheduled-cascade-loss class could in theory occur there too; would want its own test scenarios in a follow-up.

🤖 Generated with Claude Code

jeremydmiller and others added 4 commits May 28, 2026 07:44
…lers (GH-2941)

Closes #2941. Co-authored from @JurJean's repro in PR #2941.

## Root cause

A handler whose only Marten/Polecat dependency comes from a parameter or chain
attribute - [ReadAggregate], [DocumentExists<T>], [DocumentDoesNotExist<T>] -
silently lost any cascading DeliveryMessage<T>.DelayedFor(...) it emitted. The
non-scheduled cascade case is unaffected because local non-scheduled cascades
take Wolverine's in-memory delivery path that does not require flushing the
session.

Mechanism:

1. AutoApplyTransactions decides which persistence provider to apply by calling
   IPersistenceFrameProvider.CanApply, which queries chain.ServiceDependencies()
   for IDocumentSession.

2. Chain.serviceDependencies only walks Middleware.OfType<MethodCall>() - other
   frame types (AsyncFrame, etc.) are not inspected.

3. [ReadAggregate] injects FetchLatestAggregateFrame (AsyncFrame). [DocumentExists<T>]/
   [DocumentDoesNotExist<T>] inject DocumentExistenceCheckFrame (AsyncFrame).
   Both depend on IDocumentSession but neither is a MethodCall, so the
   dependency is invisible to ServiceDependencies.

4. Worse: WolverineParameterAttribute.Modify and ModifyChainAttribute.Modify
   both run lazily inside HandlerChain.applyCustomizations - long AFTER
   AutoApplyTransactions has evaluated CanApply. Even adding chain.AddDependencyType
   inside Modify is too late.

Result: CanApply returns false, no DocumentSessionSaveChanges postprocessor is
attached, and MartenEnvelopeTransaction.PersistIncomingAsync / its Polecat
equivalent queue StoreIncoming(...) on the session that is never flushed. The
scheduled envelope never lands in wolverine_incoming_envelopes, so the scheduler
never picks it up.

[WriteAggregate] and [BoundaryModel] do not need this fix - their Modify()
paths explicitly call new XxxPersistenceFrameProvider().ApplyTransactionSupport(...)
themselves (AggregateHandling.Apply, BoundaryModelAttribute.Modify). Saga
chains hit the SagaChain short-circuit in CanApply.

## Fix

Add direct attribute detection to MartenPersistenceFrameProvider.CanApply and
PolecatPersistenceFrameProvider.CanApply - walk handler-method parameters for
[ReadAggregate], and handler-method/handler-type/message-type attributes for
DocumentExistsAttribute<>/DocumentDoesNotExistAttribute<>. Detection happens by
reflection on the existing handler metadata, so it doesn't depend on
attribute Modify having run yet.

## Tests

- src/Persistence/MartenTests/Bugs/Bug_aggregate_should_still_publish.cs:
  carries @JurJean's PR #2941 scheduled-cascade scenarios. Also fixed the test
  setup by adding PublishReader to IncludeType - the PR refactored the single
  ScheduleReader class into PublishReader (non-scheduled) and ScheduleReader
  (scheduled) but only kept ScheduleReader registered, which made the publishes
  test fail for "no handler" reasons rather than the real bug.

- Bug_2941_document_exists_scheduled_cascade.cs (Marten + Polecat parallels):
  pin the same contract for [DocumentExists<T>]/[DocumentDoesNotExist<T>].
  Negative-control verified locally: temporarily disabling the DocumentExists
  branch of the CanApply fix makes both tests fail (2/2).

- Bug_2941_read_aggregate_scheduled_cascade.cs (Polecat parallel of the Marten
  Bug_aggregate test): pins the [ReadAggregate] contract on the Polecat side.

Local verification: Marten Bugs sweep 45/45 (full Postgres-backed integration
tests including the new scheduled-cascade pins) + full `wolverine.slnx -c Release`
clean (0 warnings, 0 errors). The Polecat tests build cleanly; local SQL Server
2025 emulation on Apple Silicon is too slow to validate them locally (existing
Polecat durable tests time out the same way against this image), so they will
be validated by CI's native Linux SQL Server.

## Out of scope (follow-up)

The IMartenDataRequirement / IPolecatDataRequirement return-value continuation
strategies create MartenDataRequirementFrame / PolecatDataRequirementFrame
(AsyncFrames using IDocumentSession) via a different code path (handler return
value, not attribute injection). They could in theory have a parallel issue
when combined with a scheduled cascade, but require a separate test surface.

Co-Authored-By: Jur Balledux <JurJean@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ng upstream fix

Reverts the diagnostic; restores the proper CanApply attribute-detection in
PolecatPersistenceFrameProvider with a comment explaining that the Wolverine-side
fix alone is insufficient on Polecat.

Polecat 4.1.1's DocumentSessionBase.SaveChangesAsync (DocumentSessionBase.cs:312)
early-returns when _workTracker.HasOutstandingWork() is false. _workTracker tracks
document operations + event streams, not transaction participants. So a chain that
has only added a StoreIncomingEnvelopeParticipant via Session.StoreIncoming(...) -
the path PolecatEnvelopeTransaction.PersistIncomingAsync takes for scheduled
cascades from [ReadAggregate] / [DocumentExists] handlers - never has its
participant executed.

Marten doesn't have this gap; its session.SaveChangesAsync flushes queued ops
unconditionally.

The Polecat scheduled-cascade tests for [ReadAggregate] and [DocumentExists]/
[DocumentDoesNotExist] are temporarily Skip'd with a reason pointing at the
upstream Polecat issue. Unskip when a fixed Polecat version is pinned.
@jeremydmiller jeremydmiller merged commit 7e883e7 into main May 28, 2026
23 checks passed
outofrange-consulting pushed a commit to outofrange-consulting/wolverine that referenced this pull request May 28, 2026
…scade tests

Polecat 4.2.1 ships polecat#161, the upstream companion fix to JasperFxGH-2941. Its
DocumentSessionBase.SaveChangesAsync no longer early-returns when only
ITransactionParticipants are queued, so the StoreIncomingEnvelopeParticipant
added via Session.StoreIncoming(...) inside PolecatEnvelopeTransaction.
PersistIncomingAsync actually runs for scheduled cascades from
[ReadAggregate] / [DocumentExists] handlers that don't write any documents.

That lets the three previously [Skip]'d Polecat tests (added in JasperFx#2943 with a
Skip reason pointing at polecat#161) exercise the end-to-end path on CI:
  - Bug_2941_read_aggregate_scheduled_cascade.read_aggregate_handler_schedules_its_cascading_message
  - Bug_2941_document_exists_scheduled_cascade.document_exists_handler_schedules_its_cascading_message
  - Bug_2941_document_exists_scheduled_cascade.document_does_not_exist_handler_schedules_its_cascading_message

Also tightens the PolecatPersistenceFrameProvider.CanApply comment so it
documents the pairing with polecat#161 / Polecat 4.2.1 rather than the
'necessary but not sufficient' caveat from the original commit.

Local: full wolverine.slnx -c Release builds clean (0 warnings, 0 errors).
Marten Bugs sweep 45/45 pass. Polecat tests will be validated by CI - the
local SQL Server 2025 image on Apple Silicon is too slow to run them
(existing Polecat durable tests time out the same way, unrelated to this
change).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
outofrange-consulting pushed a commit to outofrange-consulting/wolverine that referenced this pull request May 28, 2026
…JasperFx#2949)

Closes JasperFx#2949. The flake symptom on three consecutive PRs (JasperFx#2943, JasperFx#2947, JasperFx#2948
in May 2026) was:

  System.TimeoutException : Timed out waiting for expected response
  Wolverine.Runtime.Agents.AgentsStarted for original message <id>
  of type Wolverine.Runtime.Agents.StartAgents with a configured timeout
  of 10000 milliseconds

Tests affected: RavenDbTests.LeaderElection.leadership_election_compliance.
take_over_leader_ship_if_leader_becomes_stale and .leader_switchover_between_nodes.

Root cause: WolverineRuntime.Agents.InvokeAsync<T>(NodeDestination, IAgentCommand)
had an asymmetric timeout - same-node calls got 30s, remote-node calls got 10s.
The asymmetry is backwards: a remote request-reply traverses the control
endpoint + serialization + network, so it should have AT LEAST as much budget
as a same-node in-memory invocation, not less. Under load on shared GitHub
runners that 10s was a real timing race on the cross-node leadership-takeover
scenarios the failing tests exercise, where StartAgents -> AgentsStarted is
sent from the new leader to a target node and the runner can stall just long
enough for the reply not to land in time.

Fix: align remote with same-node at 30s. This is the production constant, not a
test-only setting - the same flake would (rarely) bite users on busy or
slow-network multi-node Wolverine clusters whose leadership-takeover scenario
hits the same StartAgents -> AgentsStarted ack. Same-node already accepts 30s
so the change is conservatively in-line with existing precedent.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant