[#4462] Don't classify caller-cancelled OCE as a daemon error in ForceAllMartenDaemonActivityToCatchUpAsync#4463
Merged
jeremydmiller merged 1 commit intoMay 18, 2026
Conversation
… a daemon error ForceAllMartenDaemonActivityToCatchUpAsync's contract is "catch all exceptions and return them as a list so callers can assert against them." Pre-fix, an OperationCanceledException raised by the user-supplied CancellationToken (test CTS firing mid-catch-up) flowed through the catch-all block and landed in the returned list, then surfaced as a fake "exceptions should be empty but had 1 item" assertion failure in tests that wrap the call in a Shouldly assertion. The real signal here is caller-initiated cancellation, not a daemon error. Tighten both the primary-store and per-T variants to re-throw an OCE when `cancellation.IsCancellationRequested` is true — caller's CT cancellations propagate as a normal cancellation signal (or TaskCanceledException on the test side, which fails the test with a clear "timed out" message rather than a misleading assertion). Non-cancellation exceptions and OCEs from sources other than the caller's CT still flow into the returned list as before. The Bug_4441 test that caught this in #4461's CI run gets the skip removed, plus headroom — outer Fact timeout 30s→60s, CT 15s→45s — so slow CI runners working through the custom-outbox lifecycle path (CreateBatch → BeforeCommit → AfterCommit per shard, on top of the per-shard catch-up loop) don't pre-cancel. Closes #4462. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jeremydmiller
added a commit
that referenced
this pull request
May 18, 2026
…nc's cancellation guard #4463's OCE-rethrow guard only matched bare OperationCanceledException. JasperFxAsyncDaemon.CatchUpAsync bundles per-shard cancellation exceptions into an AggregateException before throwing, so a CI run where the test's CTS fires partway through the catch-up still surfaces as a misleading "exceptions should be empty but had 1 item" assertion failure — the AggregateException wraps a single OperationCanceledException (wrapping a Postgres 57014 'canceling statement due to user request') but the outer type didn't match the guard. Tighten the guard with an IsCallerCancellation helper that recursively treats AggregateException-of-cancellations as caller cancellation too. On the cancelled-CT path the helper paves over both the bare OCE and the AggregateException cases, replaces both with a single OperationCanceledException re-throw, and leaves genuine daemon exceptions flowing into the list as before. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jeremydmiller
added a commit
that referenced
this pull request
May 18, 2026
#4463 added an OCE-rethrow guard, and `4f715820` extended it to unwrap AggregateException-of-OCEs. Neither helped the latest CI flake: https://github.com/JasperFx/marten/actions/runs/26035890428/job/76533776264?pr=4464 The new evidence: failure lands in 176 ms with the test's CTS set to 45s, so `cancellation.IsCancellationRequested` is false in the catch handler — the OCE isn't caller cancellation. It originates at GroupedProjectionExecution.processRangeAsync line 192 → range.Agent.ReportCriticalFailureAsync(e), where the per-shard internal CTS gets cancelled during the batch build. The daemon's Recorder captures it and JasperFxAsyncDaemon.CatchUpAsync line 716 re-throws as AggregateException — by which point ForceAllMartenDaemonActivityToCatchUpAsync's guards can't tell apart "user cancelled" from "shard's own state got cancelled by its own lifecycle." The real fix lives in the JasperFx.Events daemon's internal cancellation contract (StopAllAsync followed by CatchUpAsync leaking a cancelled per-shard CTS into the new agents under timing pressure). Tracking that in #4462. The AggregateException unwrap stays in place as defense in depth for the caller-cancellation path — it's still correct, just insufficient on its own. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jeremydmiller
added a commit
that referenced
this pull request
May 18, 2026
…ody (#4464) * [#4409] JSON-encode scalar projections when streaming to a response body WriteArray / StreamJsonArray / WriteOne / StreamJsonOne against a scalar projection (`.Select(x => x.SomeEnum)` under `EnumStorage.AsString`, or any `.Select(x => x.StringProperty)`) used to emit raw bytes from Postgres between the array brackets and commas. Postgres returns `data->>'X'` as text (no JSON quoting), so the body landed as `[FooValue,BarValue]` — invalid JSON, and downstream `System.Text.Json` reads blew up with "'F' is an invalid start of a value". Numeric / boolean projections happened to look like valid JSON, masking the bug. Centralize per-row writes through a new `WriteJsonValueAsync` on `NpgsqlDataReader` that branches on `GetDataTypeName`: - `jsonb` / `json` columns — copy the field stream byte-for-byte with the existing SOH-skip (unchanged document-streaming behavior). - DBNull — write the JSON `null` literal. - Everything else — materialize the .NET value via `reader.GetFieldValueAsync<object>` and round-trip through `JsonSerializer.SerializeAsync` so strings get JSON-quoted and escaped (handling embedded `"`, `\`, control characters), numerics / booleans / dates get their JSON literal representation, and enums-as-string come out as `"FooValue"` instead of `FooValue`. Routes the existing `StreamMany` / `StreamOne` extensions and the inline copy in `OneResultHandler.StreamJson` through the new helper. Document streaming paths (whole-doc jsonb column) keep the previous fast-path verbatim — `WriteJsonValueAsync`'s first branch is the same SOH-skip copy that was there before. Closes #4409. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Unwrap AggregateException in ForceAllMartenDaemonActivityToCatchUpAsync's cancellation guard #4463's OCE-rethrow guard only matched bare OperationCanceledException. JasperFxAsyncDaemon.CatchUpAsync bundles per-shard cancellation exceptions into an AggregateException before throwing, so a CI run where the test's CTS fires partway through the catch-up still surfaces as a misleading "exceptions should be empty but had 1 item" assertion failure — the AggregateException wraps a single OperationCanceledException (wrapping a Postgres 57014 'canceling statement due to user request') but the outer type didn't match the guard. Tighten the guard with an IsCallerCancellation helper that recursively treats AggregateException-of-cancellations as caller cancellation too. On the cancelled-CT path the helper paves over both the bare OCE and the AggregateException cases, replaces both with a single OperationCanceledException re-throw, and leaves genuine daemon exceptions flowing into the list as before. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Re-skip Bug_4441 outbox-lifecycle test pending the deeper #4462 fix #4463 added an OCE-rethrow guard, and `4f715820` extended it to unwrap AggregateException-of-OCEs. Neither helped the latest CI flake: https://github.com/JasperFx/marten/actions/runs/26035890428/job/76533776264?pr=4464 The new evidence: failure lands in 176 ms with the test's CTS set to 45s, so `cancellation.IsCancellationRequested` is false in the catch handler — the OCE isn't caller cancellation. It originates at GroupedProjectionExecution.processRangeAsync line 192 → range.Agent.ReportCriticalFailureAsync(e), where the per-shard internal CTS gets cancelled during the batch build. The daemon's Recorder captures it and JasperFxAsyncDaemon.CatchUpAsync line 716 re-throws as AggregateException — by which point ForceAllMartenDaemonActivityToCatchUpAsync's guards can't tell apart "user cancelled" from "shard's own state got cancelled by its own lifecycle." The real fix lives in the JasperFx.Events daemon's internal cancellation contract (StopAllAsync followed by CatchUpAsync leaking a cancelled per-shard CTS into the new agents under timing pressure). Tracking that in #4462. The AggregateException unwrap stays in place as defense in depth for the caller-cancellation path — it's still correct, just insufficient on its own. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2 tasks
jeremydmiller
added a commit
that referenced
this pull request
May 19, 2026
JasperFx PR #285 ("Stop daemon-internal cancellations from leaking into CatchUpAsync") shipped in JasperFx.Events 2.0.0-alpha.11 and is consumed by Marten's current alpha.13 pin (#4475 Phase 2). The daemon-internal cancellation flake that motivated the re-skip in #4462/#4463 is gone — confirmed locally 5/5 on both net9.0 and net10.0. Test-only change. Closes #4466. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes #4462. The
force_catch_up_invokes_message_batch_lifecycle_with_custom_outboxtest flaked in CI under timing pressure during the Phase 5 PR (#4461). Root cause is theForceAllMartenDaemonActivityToCatchUpAsynccontract treating a caller-initiatedOperationCanceledExceptionas if it were a daemon error.The flake
The test passes a 15s
CancellationTokenSourceto the helper. Under CI load, the per-shard catch-up pipeline (Polly-wrapped event-loader execution →SpinUpMessageBatchAsync→ custom outbox lifecycle:CreateBatch→BeforeCommit→AfterCommit) doesn't finish in time. The CTS fires, the OCE propagates up through Polly, thecatch (Exception)inForceAllMartenDaemonActivityToCatchUpAsyncgrabs it, and it lands in the returned list. The test assertsexceptions.ShouldBeEmpty()and fails — the assertion message implies a daemon bug when the real signal was "the test ran out of time."Fix
AsyncProjectionTestingExtensions.cs— both the primary-store and the<T>variants — get a newcatch (OperationCanceledException) when (cancellation.IsCancellationRequested)guard that re-throws. Non-cancellation exceptions (and OCEs from sources other than the caller's CT) still flow into the returned list as before, preserving the documented "catch-and-list" contract for real daemon errors. Caller's own cancellations propagate as a normal cancellation signal (orTaskCanceledExceptionon the test side), failing the test with a clear "timed out" message rather than a misleading assertion.The Bug_4441 test:
[Fact(Timeout = ...)]raised from 30s → 60s.Two pieces of timing headroom because the custom-outbox variant has measurable extra hops (
CreateBatch+ per-shardBeforeCommit/AfterCommit) on top of the catch-up loop the siblingforce_catch_up_returns_*test exercises.Test plan
EventSourcingTests— 1333 pass, 5 pre-existing skips, 0 fail (net10.0) — Bug_4441 is back in the green countCloses #4462.
🤖 Generated with Claude Code