Skip to content

[#4462] Don't classify caller-cancelled OCE as a daemon error in ForceAllMartenDaemonActivityToCatchUpAsync#4463

Merged
jeremydmiller merged 1 commit into
masterfrom
fix/4462-force-catch-up-cancellation-contract
May 18, 2026
Merged

[#4462] Don't classify caller-cancelled OCE as a daemon error in ForceAllMartenDaemonActivityToCatchUpAsync#4463
jeremydmiller merged 1 commit into
masterfrom
fix/4462-force-catch-up-cancellation-contract

Conversation

@jeremydmiller
Copy link
Copy Markdown
Member

Summary

Closes #4462. The force_catch_up_invokes_message_batch_lifecycle_with_custom_outbox test flaked in CI under timing pressure during the Phase 5 PR (#4461). Root cause is the ForceAllMartenDaemonActivityToCatchUpAsync contract treating a caller-initiated OperationCanceledException as if it were a daemon error.

The flake

Shouldly.ShouldAssertException : exceptions
    should be empty but had
1
    item and was
[System.AggregateException: One or more errors occurred. (The operation was canceled.)
 ---> System.OperationCanceledException: The operation was canceled.
   at Polly.Utils.ExceptionUtilities.TrySetStackTrace[T](T exception)
   ...
   at Marten.Events.TestingExtensions.ForceAllMartenDaemonActivityToCatchUpAsync(...)]

The test passes a 15s CancellationTokenSource to the helper. Under CI load, the per-shard catch-up pipeline (Polly-wrapped event-loader execution → SpinUpMessageBatchAsync → custom outbox lifecycle: CreateBatchBeforeCommitAfterCommit) doesn't finish in time. The CTS fires, the OCE propagates up through Polly, the catch (Exception) in ForceAllMartenDaemonActivityToCatchUpAsync grabs it, and it lands in the returned list. The test asserts exceptions.ShouldBeEmpty() and fails — the assertion message implies a daemon bug when the real signal was "the test ran out of time."

Fix

AsyncProjectionTestingExtensions.cs — both the primary-store and the <T> variants — get a new catch (OperationCanceledException) when (cancellation.IsCancellationRequested) guard that re-throws. Non-cancellation exceptions (and OCEs from sources other than the caller's CT) still flow into the returned list as before, preserving the documented "catch-and-list" contract for real daemon errors. Caller's own cancellations propagate as a normal cancellation signal (or TaskCanceledException on the test side), failing the test with a clear "timed out" message rather than a misleading assertion.

The Bug_4441 test:

  • Skip attribute removed.
  • Outer [Fact(Timeout = ...)] raised from 30s → 60s.
  • Inner CTS raised from 15s → 45s.

Two pieces of timing headroom because the custom-outbox variant has measurable extra hops (CreateBatch + per-shard BeforeCommit/AfterCommit) on top of the catch-up loop the sibling force_catch_up_returns_* test exercises.

Test plan

  • Build clean (0 errors)
  • Test ran locally 5 times back-to-back — all pass (~500-600ms each)
  • Full EventSourcingTests — 1333 pass, 5 pre-existing skips, 0 fail (net10.0) — Bug_4441 is back in the green count
  • CI matrix (the fix is most relevant to the slower Event Sourcing lanes where the original flake fired)

Closes #4462.

🤖 Generated with Claude Code

… a daemon error

ForceAllMartenDaemonActivityToCatchUpAsync's contract is "catch all
exceptions and return them as a list so callers can assert against them."
Pre-fix, an OperationCanceledException raised by the user-supplied
CancellationToken (test CTS firing mid-catch-up) flowed through the
catch-all block and landed in the returned list, then surfaced as a fake
"exceptions should be empty but had 1 item" assertion failure in tests
that wrap the call in a Shouldly assertion. The real signal here is
caller-initiated cancellation, not a daemon error.

Tighten both the primary-store and per-T variants to re-throw an OCE
when `cancellation.IsCancellationRequested` is true — caller's CT
cancellations propagate as a normal cancellation signal (or
TaskCanceledException on the test side, which fails the test with a
clear "timed out" message rather than a misleading assertion).
Non-cancellation exceptions and OCEs from sources other than the
caller's CT still flow into the returned list as before.

The Bug_4441 test that caught this in #4461's CI run gets the skip
removed, plus headroom — outer Fact timeout 30s→60s, CT 15s→45s — so
slow CI runners working through the custom-outbox lifecycle path
(CreateBatch → BeforeCommit → AfterCommit per shard, on top of the
per-shard catch-up loop) don't pre-cancel.

Closes #4462.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jeremydmiller jeremydmiller merged commit 47be076 into master May 18, 2026
6 checks passed
@jeremydmiller jeremydmiller deleted the fix/4462-force-catch-up-cancellation-contract branch May 18, 2026 12:00
jeremydmiller added a commit that referenced this pull request May 18, 2026
…nc's cancellation guard

#4463's OCE-rethrow guard only matched bare OperationCanceledException.
JasperFxAsyncDaemon.CatchUpAsync bundles per-shard cancellation
exceptions into an AggregateException before throwing, so a CI run
where the test's CTS fires partway through the catch-up still surfaces
as a misleading "exceptions should be empty but had 1 item" assertion
failure — the AggregateException wraps a single OperationCanceledException
(wrapping a Postgres 57014 'canceling statement due to user request')
but the outer type didn't match the guard.

Tighten the guard with an IsCallerCancellation helper that recursively
treats AggregateException-of-cancellations as caller cancellation too.
On the cancelled-CT path the helper paves over both the bare OCE and
the AggregateException cases, replaces both with a single
OperationCanceledException re-throw, and leaves genuine daemon
exceptions flowing into the list as before.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jeremydmiller added a commit that referenced this pull request May 18, 2026
#4463 added an OCE-rethrow guard, and `4f715820` extended it to unwrap
AggregateException-of-OCEs. Neither helped the latest CI flake:
https://github.com/JasperFx/marten/actions/runs/26035890428/job/76533776264?pr=4464

The new evidence: failure lands in 176 ms with the test's CTS set to 45s,
so `cancellation.IsCancellationRequested` is false in the catch handler
— the OCE isn't caller cancellation. It originates at
GroupedProjectionExecution.processRangeAsync line 192 →
range.Agent.ReportCriticalFailureAsync(e), where the per-shard internal
CTS gets cancelled during the batch build. The daemon's Recorder
captures it and JasperFxAsyncDaemon.CatchUpAsync line 716 re-throws as
AggregateException — by which point ForceAllMartenDaemonActivityToCatchUpAsync's
guards can't tell apart "user cancelled" from "shard's own state got
cancelled by its own lifecycle."

The real fix lives in the JasperFx.Events daemon's internal cancellation
contract (StopAllAsync followed by CatchUpAsync leaking a cancelled
per-shard CTS into the new agents under timing pressure). Tracking that
in #4462. The AggregateException unwrap stays in place as defense in
depth for the caller-cancellation path — it's still correct, just
insufficient on its own.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jeremydmiller added a commit that referenced this pull request May 18, 2026
…ody (#4464)

* [#4409] JSON-encode scalar projections when streaming to a response body

WriteArray / StreamJsonArray / WriteOne / StreamJsonOne against a scalar
projection (`.Select(x => x.SomeEnum)` under `EnumStorage.AsString`, or any
`.Select(x => x.StringProperty)`) used to emit raw bytes from Postgres
between the array brackets and commas. Postgres returns `data->>'X'` as
text (no JSON quoting), so the body landed as `[FooValue,BarValue]` —
invalid JSON, and downstream `System.Text.Json` reads blew up with
"'F' is an invalid start of a value". Numeric / boolean projections
happened to look like valid JSON, masking the bug.

Centralize per-row writes through a new `WriteJsonValueAsync` on
`NpgsqlDataReader` that branches on `GetDataTypeName`:
- `jsonb` / `json` columns — copy the field stream byte-for-byte with the
  existing SOH-skip (unchanged document-streaming behavior).
- DBNull — write the JSON `null` literal.
- Everything else — materialize the .NET value via
  `reader.GetFieldValueAsync<object>` and round-trip through
  `JsonSerializer.SerializeAsync` so strings get JSON-quoted and escaped
  (handling embedded `"`, `\`, control characters), numerics / booleans /
  dates get their JSON literal representation, and enums-as-string come
  out as `"FooValue"` instead of `FooValue`.

Routes the existing `StreamMany` / `StreamOne` extensions and the inline
copy in `OneResultHandler.StreamJson` through the new helper. Document
streaming paths (whole-doc jsonb column) keep the previous fast-path
verbatim — `WriteJsonValueAsync`'s first branch is the same SOH-skip
copy that was there before.

Closes #4409.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Unwrap AggregateException in ForceAllMartenDaemonActivityToCatchUpAsync's cancellation guard

#4463's OCE-rethrow guard only matched bare OperationCanceledException.
JasperFxAsyncDaemon.CatchUpAsync bundles per-shard cancellation
exceptions into an AggregateException before throwing, so a CI run
where the test's CTS fires partway through the catch-up still surfaces
as a misleading "exceptions should be empty but had 1 item" assertion
failure — the AggregateException wraps a single OperationCanceledException
(wrapping a Postgres 57014 'canceling statement due to user request')
but the outer type didn't match the guard.

Tighten the guard with an IsCallerCancellation helper that recursively
treats AggregateException-of-cancellations as caller cancellation too.
On the cancelled-CT path the helper paves over both the bare OCE and
the AggregateException cases, replaces both with a single
OperationCanceledException re-throw, and leaves genuine daemon
exceptions flowing into the list as before.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Re-skip Bug_4441 outbox-lifecycle test pending the deeper #4462 fix

#4463 added an OCE-rethrow guard, and `4f715820` extended it to unwrap
AggregateException-of-OCEs. Neither helped the latest CI flake:
https://github.com/JasperFx/marten/actions/runs/26035890428/job/76533776264?pr=4464

The new evidence: failure lands in 176 ms with the test's CTS set to 45s,
so `cancellation.IsCancellationRequested` is false in the catch handler
— the OCE isn't caller cancellation. It originates at
GroupedProjectionExecution.processRangeAsync line 192 →
range.Agent.ReportCriticalFailureAsync(e), where the per-shard internal
CTS gets cancelled during the batch build. The daemon's Recorder
captures it and JasperFxAsyncDaemon.CatchUpAsync line 716 re-throws as
AggregateException — by which point ForceAllMartenDaemonActivityToCatchUpAsync's
guards can't tell apart "user cancelled" from "shard's own state got
cancelled by its own lifecycle."

The real fix lives in the JasperFx.Events daemon's internal cancellation
contract (StopAllAsync followed by CatchUpAsync leaking a cancelled
per-shard CTS into the new agents under timing pressure). Tracking that
in #4462. The AggregateException unwrap stays in place as defense in
depth for the caller-cancellation path — it's still correct, just
insufficient on its own.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jeremydmiller added a commit that referenced this pull request May 19, 2026
JasperFx PR #285 ("Stop daemon-internal cancellations from leaking
into CatchUpAsync") shipped in JasperFx.Events 2.0.0-alpha.11 and
is consumed by Marten's current alpha.13 pin (#4475 Phase 2). The
daemon-internal cancellation flake that motivated the re-skip in
#4462/#4463 is gone — confirmed locally 5/5 on both net9.0 and
net10.0.

Test-only change. Closes #4466.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky test: Bug_4441_force_catch_up_with_outbox.force_catch_up_invokes_message_batch_lifecycle_with_custom_outbox

1 participant