Fix scheduled-cascade loss from [ReadAggregate]/[DocumentExists] handlers (closes #2941)#2943
Merged
Merged
Conversation
…lers (GH-2941) Closes #2941. Co-authored from @JurJean's repro in PR #2941. ## Root cause A handler whose only Marten/Polecat dependency comes from a parameter or chain attribute - [ReadAggregate], [DocumentExists<T>], [DocumentDoesNotExist<T>] - silently lost any cascading DeliveryMessage<T>.DelayedFor(...) it emitted. The non-scheduled cascade case is unaffected because local non-scheduled cascades take Wolverine's in-memory delivery path that does not require flushing the session. Mechanism: 1. AutoApplyTransactions decides which persistence provider to apply by calling IPersistenceFrameProvider.CanApply, which queries chain.ServiceDependencies() for IDocumentSession. 2. Chain.serviceDependencies only walks Middleware.OfType<MethodCall>() - other frame types (AsyncFrame, etc.) are not inspected. 3. [ReadAggregate] injects FetchLatestAggregateFrame (AsyncFrame). [DocumentExists<T>]/ [DocumentDoesNotExist<T>] inject DocumentExistenceCheckFrame (AsyncFrame). Both depend on IDocumentSession but neither is a MethodCall, so the dependency is invisible to ServiceDependencies. 4. Worse: WolverineParameterAttribute.Modify and ModifyChainAttribute.Modify both run lazily inside HandlerChain.applyCustomizations - long AFTER AutoApplyTransactions has evaluated CanApply. Even adding chain.AddDependencyType inside Modify is too late. Result: CanApply returns false, no DocumentSessionSaveChanges postprocessor is attached, and MartenEnvelopeTransaction.PersistIncomingAsync / its Polecat equivalent queue StoreIncoming(...) on the session that is never flushed. The scheduled envelope never lands in wolverine_incoming_envelopes, so the scheduler never picks it up. [WriteAggregate] and [BoundaryModel] do not need this fix - their Modify() paths explicitly call new XxxPersistenceFrameProvider().ApplyTransactionSupport(...) themselves (AggregateHandling.Apply, BoundaryModelAttribute.Modify). Saga chains hit the SagaChain short-circuit in CanApply. ## Fix Add direct attribute detection to MartenPersistenceFrameProvider.CanApply and PolecatPersistenceFrameProvider.CanApply - walk handler-method parameters for [ReadAggregate], and handler-method/handler-type/message-type attributes for DocumentExistsAttribute<>/DocumentDoesNotExistAttribute<>. Detection happens by reflection on the existing handler metadata, so it doesn't depend on attribute Modify having run yet. ## Tests - src/Persistence/MartenTests/Bugs/Bug_aggregate_should_still_publish.cs: carries @JurJean's PR #2941 scheduled-cascade scenarios. Also fixed the test setup by adding PublishReader to IncludeType - the PR refactored the single ScheduleReader class into PublishReader (non-scheduled) and ScheduleReader (scheduled) but only kept ScheduleReader registered, which made the publishes test fail for "no handler" reasons rather than the real bug. - Bug_2941_document_exists_scheduled_cascade.cs (Marten + Polecat parallels): pin the same contract for [DocumentExists<T>]/[DocumentDoesNotExist<T>]. Negative-control verified locally: temporarily disabling the DocumentExists branch of the CanApply fix makes both tests fail (2/2). - Bug_2941_read_aggregate_scheduled_cascade.cs (Polecat parallel of the Marten Bug_aggregate test): pins the [ReadAggregate] contract on the Polecat side. Local verification: Marten Bugs sweep 45/45 (full Postgres-backed integration tests including the new scheduled-cascade pins) + full `wolverine.slnx -c Release` clean (0 warnings, 0 errors). The Polecat tests build cleanly; local SQL Server 2025 emulation on Apple Silicon is too slow to validate them locally (existing Polecat durable tests time out the same way against this image), so they will be validated by CI's native Linux SQL Server. ## Out of scope (follow-up) The IMartenDataRequirement / IPolecatDataRequirement return-value continuation strategies create MartenDataRequirementFrame / PolecatDataRequirementFrame (AsyncFrames using IDocumentSession) via a different code path (handler return value, not attribute injection). They could in theory have a parallel issue when combined with a scheduled cascade, but require a separate test surface. Co-Authored-By: Jur Balledux <JurJean@users.noreply.github.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ng upstream fix Reverts the diagnostic; restores the proper CanApply attribute-detection in PolecatPersistenceFrameProvider with a comment explaining that the Wolverine-side fix alone is insufficient on Polecat. Polecat 4.1.1's DocumentSessionBase.SaveChangesAsync (DocumentSessionBase.cs:312) early-returns when _workTracker.HasOutstandingWork() is false. _workTracker tracks document operations + event streams, not transaction participants. So a chain that has only added a StoreIncomingEnvelopeParticipant via Session.StoreIncoming(...) - the path PolecatEnvelopeTransaction.PersistIncomingAsync takes for scheduled cascades from [ReadAggregate] / [DocumentExists] handlers - never has its participant executed. Marten doesn't have this gap; its session.SaveChangesAsync flushes queued ops unconditionally. The Polecat scheduled-cascade tests for [ReadAggregate] and [DocumentExists]/ [DocumentDoesNotExist] are temporarily Skip'd with a reason pointing at the upstream Polecat issue. Unskip when a fixed Polecat version is pinned.
This was referenced May 28, 2026
Closed
outofrange-consulting
pushed a commit
to outofrange-consulting/wolverine
that referenced
this pull request
May 28, 2026
…scade tests Polecat 4.2.1 ships polecat#161, the upstream companion fix to JasperFxGH-2941. Its DocumentSessionBase.SaveChangesAsync no longer early-returns when only ITransactionParticipants are queued, so the StoreIncomingEnvelopeParticipant added via Session.StoreIncoming(...) inside PolecatEnvelopeTransaction. PersistIncomingAsync actually runs for scheduled cascades from [ReadAggregate] / [DocumentExists] handlers that don't write any documents. That lets the three previously [Skip]'d Polecat tests (added in JasperFx#2943 with a Skip reason pointing at polecat#161) exercise the end-to-end path on CI: - Bug_2941_read_aggregate_scheduled_cascade.read_aggregate_handler_schedules_its_cascading_message - Bug_2941_document_exists_scheduled_cascade.document_exists_handler_schedules_its_cascading_message - Bug_2941_document_exists_scheduled_cascade.document_does_not_exist_handler_schedules_its_cascading_message Also tightens the PolecatPersistenceFrameProvider.CanApply comment so it documents the pairing with polecat#161 / Polecat 4.2.1 rather than the 'necessary but not sufficient' caveat from the original commit. Local: full wolverine.slnx -c Release builds clean (0 warnings, 0 errors). Marten Bugs sweep 45/45 pass. Polecat tests will be validated by CI - the local SQL Server 2025 image on Apple Silicon is too slow to run them (existing Polecat durable tests time out the same way, unrelated to this change). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
outofrange-consulting
pushed a commit
to outofrange-consulting/wolverine
that referenced
this pull request
May 28, 2026
…JasperFx#2949) Closes JasperFx#2949. The flake symptom on three consecutive PRs (JasperFx#2943, JasperFx#2947, JasperFx#2948 in May 2026) was: System.TimeoutException : Timed out waiting for expected response Wolverine.Runtime.Agents.AgentsStarted for original message <id> of type Wolverine.Runtime.Agents.StartAgents with a configured timeout of 10000 milliseconds Tests affected: RavenDbTests.LeaderElection.leadership_election_compliance. take_over_leader_ship_if_leader_becomes_stale and .leader_switchover_between_nodes. Root cause: WolverineRuntime.Agents.InvokeAsync<T>(NodeDestination, IAgentCommand) had an asymmetric timeout - same-node calls got 30s, remote-node calls got 10s. The asymmetry is backwards: a remote request-reply traverses the control endpoint + serialization + network, so it should have AT LEAST as much budget as a same-node in-memory invocation, not less. Under load on shared GitHub runners that 10s was a real timing race on the cross-node leadership-takeover scenarios the failing tests exercise, where StartAgents -> AgentsStarted is sent from the new leader to a target node and the runner can stall just long enough for the reply not to land in time. Fix: align remote with same-node at 30s. This is the production constant, not a test-only setting - the same flake would (rarely) bite users on busy or slow-network multi-node Wolverine clusters whose leadership-takeover scenario hits the same StartAgents -> AgentsStarted ack. Same-node already accepts 30s so the change is conservatively in-line with existing precedent. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #2941. Builds on @JurJean's repro from PR #2941 (test cherry-picked + extended).
The bug (verified)
A handler whose only Marten/Polecat dependency comes from an attribute -
[ReadAggregate],[DocumentExists<T>],[DocumentDoesNotExist<T>]- silently lost any cascadingDeliveryMessage<T>.DelayedFor(...)it emitted. Non-scheduled cascades from the same handlers were unaffected (they take Wolverine's in-memory delivery path), so the bug only surfaces when scheduling.TrackedSession trace before the fix:
The reporter saw it directly:
StoreIncomingEnvelopeshows up in the unit of work but is never saved.Root cause (Wolverine side)
AutoApplyTransactionsdecides whichIPersistenceFrameProviderto apply by callingCanApply, which querieschain.ServiceDependencies()forIDocumentSession.Chain.serviceDependencies(src/Wolverine/Configuration/Chain.cs:276) only walksMiddleware.OfType<MethodCall>()- other frame types are not inspected.[ReadAggregate]injectsFetchLatestAggregateFrame(AsyncFrame).[DocumentExists<T>]/[DocumentDoesNotExist<T>]injectDocumentExistenceCheckFrame(AsyncFrame). Both depend onIDocumentSessionbut neither is aMethodCall, so the dependency is invisible toServiceDependencies.WolverineParameterAttribute.ModifyandModifyChainAttribute.Modifyboth run lazily inHandlerChain.applyCustomizations- long afterAutoApplyTransactionshas evaluatedCanApply. So self-declaring the dependency from insideModifyis too late.Result:
CanApplyreturns false, noDocumentSessionSaveChangespostprocessor is attached, andMartenEnvelopeTransaction.PersistIncomingAsync/ its Polecat equivalent queueStoreIncoming(...)on the session that is never flushed.[WriteAggregate]and[BoundaryModel]do not need this fix - theirModify()paths explicitly callApplyTransactionSupportthemselves (AggregateHandling.Applyline 43;BoundaryModelAttribute.cs:86). Saga chains hit theSagaChainshort-circuit inCanApply.Polecat side companion fix (shipped in Polecat 4.2.1)
Diagnosing this surfaced a second bug on the Polecat side:
DocumentSessionBase.SaveChangesAsyncearly-returned when_workTracker.HasOutstandingWork()was false._workTrackertracks document ops + event streams - it does NOT includeITransactionParticipants. Wolverine.Polecat'sSession.StoreIncoming(...)adds aStoreIncomingEnvelopeParticipantwhoseBeforeCommitAsyncwrites the scheduled inbox row; if that's the only thing on the session (handler had no doc ops),SaveChangesAsyncskipped the entire pipeline and the participant'sBeforeCommitAsyncnever ran. Marten doesn't have this gap.Fixed upstream in JasperFx/polecat#161, shipped as Polecat 4.2.1. This PR bumps the Polecat pin to 4.2.1 so the Wolverine-side
CanApplyfix actually closes the loop.Fix (this PR, Wolverine side)
Direct attribute detection in
MartenPersistenceFrameProvider.CanApplyandPolecatPersistenceFrameProvider.CanApply. Walks handler-method parameters for[ReadAggregate], and handler-method / handler-type / message-type attributes forDocumentExistsAttribute<>/DocumentDoesNotExistAttribute<>. Detection works by reflection on existing handler metadata, so it does not depend on attributeModify()having run yet.Also bumps
Polecat4.1.1 → 4.2.1 inDirectory.Packages.propsto pick up polecat#161.Tests + verification
[ReadAggregate]Bug_aggregate_should_still_publish(cherry-picked from #2941; fixed the test bug ofPublishReadermissing fromIncludeType)[DocumentExists<T>]/[DocumentDoesNotExist<T>]Bug_2941_document_exists_scheduled_cascade(new)[ReadAggregate]Bug_2941_read_aggregate_scheduled_cascade(new)[Skip]'d in earlier commits pending polecat#161; unskipped now that Polecat 4.2.1 ships the upstream companion fix)[DocumentExists<T>]/[DocumentDoesNotExist<T>]Bug_2941_document_exists_scheduled_cascade(new)Full Marten Bugs sweep: 45/45 pass. Full
dotnet build wolverine.slnx -c Releaseclean (0 warnings, 0 errors).Out of scope (follow-up worth tracking separately)
The
IMartenDataRequirement/IPolecatDataRequirementreturn-value continuation strategies createMartenDataRequirementFrame/PolecatDataRequirementFrame(AsyncFrames usingIDocumentSession) via a different code path - they are produced by*DataRequirementContinuationStrategy.TryFindContinuationHandlerwhen a handler returnsI*DataRequirement. The same scheduled-cascade-loss class could in theory occur there too; would want its own test scenarios in a follow-up.🤖 Generated with Claude Code