Skip to content

Revert Task.Yield() from AsyncWriteJournal and SnapshotStore#8188

Merged
Aaronontheweb merged 2 commits into
akkadotnet:v1.5from
Aaronontheweb:fix/revert-task-yield-8163
Apr 26, 2026
Merged

Revert Task.Yield() from AsyncWriteJournal and SnapshotStore#8188
Aaronontheweb merged 2 commits into
akkadotnet:v1.5from
Aaronontheweb:fix/revert-task-yield-8163

Conversation

@Aaronontheweb
Copy link
Copy Markdown
Member

Summary

Why

PR #8163 added await Task.Yield() before calling WriteMessagesAsync/SaveAsync inside their circuit breaker lambdas to move serialization off the actor thread (~45% throughput gain in benchmarks). However, this silently broke the implicit contract that persistence plugins rely on — that the synchronous preamble of these methods executes in actor context.

Affected plugins:

  • Akka.Persistence.Sql / Akka.Persistence.EventStore — access Self inside WriteMessagesAsyncNotSupportedException off actor thread; use Dictionary<string, Task> for write tracking → concurrent access race conditions
  • Akka.Persistence.Redis — sends Tell to subscribers after writes complete → accesses shared actor state off-thread

Irony: The plugins that benefit most from off-thread serialization (MongoDB, Azure Table Storage — they serialize on the actor thread before any async boundary) don't access actor context at all. The plugins that break (SQL, EventStore, Redis) already do serialization off-thread in their async pipelines and gain little from the Task.Yield().

A future version may reintroduce this optimization with a more targeted approach (opt-in property, Template Method pattern) that preserves the plugin threading contract.

Test plan

  • dotnet build src/core/Akka.Persistence/Akka.Persistence.csproj -c Release — compiles
  • Akka.Persistence.Tests — 285 passed, 0 failed (net10.0)
  • Akka.Persistence.TestKit.Tests — 37 passed, 0 failed (net10.0)
  • Health check specs still pass (kept from Ensure WriteMessagesAsync/SaveAsync is called asynchronously in Async… #8163, not reverted)
  • API approval tests — 18 pre-existing failures on net48 unrelated to this change (Verify/PublicApiGenerator tooling mismatch with .NET 10 SDK)

…napshotStore (akkadotnet#8163)

This reverts the Task.Yield() additions from PR akkadotnet#8163 in AsyncWriteJournal.ExecuteBatch
and SnapshotStore.ReceiveSnapshotStore, while preserving the health check test
improvements from that same PR.

PR akkadotnet#8163 added `await Task.Yield()` before calling `WriteMessagesAsync` and `SaveAsync`
inside their respective circuit breaker lambdas. The intent was to move expensive byte
serialization off the actor's message-processing thread, which showed ~45% throughput
improvement in benchmarks.

However, this silently broke the implicit contract that persistence plugins relied on:
that the synchronous preamble of `WriteMessagesAsync`/`SaveAsync` executes in actor
context. Moving execution to the thread pool caused:

1. Plugins that access `Self` inside `WriteMessagesAsync` (e.g. Akka.Persistence.Sql,
   Akka.Persistence.EventStore) throw `NotSupportedException` because there is no
   active ActorContext on a thread pool thread.

2. Plugins that use non-thread-safe collections like `Dictionary<string, Task>` for
   write tracking (e.g. Akka.Persistence.Sql, Akka.Persistence.EventStore) are now
   subject to concurrent access from both the actor thread and thread pool threads,
   causing `InvalidOperationException` or silent data corruption.

3. Plugins that send messages to subscribers after writes complete (e.g.
   Akka.Persistence.Redis) access shared actor state off the actor thread.

The change was too blunt an instrument — it applied uniformly to all plugins via the
base class, removing their ability to do any actor-thread setup before async work begins.
Ironically, the plugins that benefit most from off-thread serialization (MongoDB, Azure
Table Storage) don't access actor context at all, while the plugins that break (SQL,
EventStore, Redis) already perform serialization off-thread in their async pipelines.

A future version may reintroduce this optimization with a more surgical approach
(e.g. opt-in property or Template Method pattern) that preserves the plugin threading
contract.
@Aaronontheweb Aaronontheweb enabled auto-merge (rebase) April 25, 2026 23:49
@Aaronontheweb Aaronontheweb disabled auto-merge April 26, 2026 00:24
@Aaronontheweb Aaronontheweb merged commit 2f057e4 into akkadotnet:v1.5 Apr 26, 2026
9 of 11 checks passed
@Aaronontheweb Aaronontheweb deleted the fix/revert-task-yield-8163 branch April 26, 2026 00:25
@Aaronontheweb Aaronontheweb added this to the 1.5.67 milestone Apr 26, 2026
@to11mtm
Copy link
Copy Markdown
Member

to11mtm commented Apr 26, 2026

FWIW I should note that at least on my branch of akka.Persistence.Sql, after applying fixes to make 1.5.66 'work', 1.5.64 is still faster on the local benchmarks for heavy write load....

1.5.64:

Workers: 200, Mean: 494.47 ms, Standard Deviation: 63.64 ms, Min: 352.82 ms, Q1: 449.25 ms, Median: 487.03 ms, Q3: 554.37 ms, Max: 634.85 ms
Mean throughput: 20.22 msg/s/actor, Mean total throughput: 4044.75 msg/s
Median throughput: 20.53 msg/s/actor, Median total throughput: 4106.50 msg/s

Workers: 200, Mean: 515.09 ms, Standard Deviation: 63.79 ms, Min: 382.05 ms, Q1: 452.50 ms, Median: 524.29 ms, Q3: 564.01 ms, Max: 646.60 ms
Mean throughput: 19.41 msg/s/actor, Mean total throughput: 3882.78 msg/s
Median throughput: 19.07 msg/s/actor, Median total throughput: 3814.69 msg/s

1.5.66:

Workers: 200, Mean: 618.94 ms, Standard Deviation: 29.32 ms, Min: 560.64 ms, Q1: 598.94 ms, Median: 620.05 ms, Q3: 636.78 ms, Max: 682.92 ms
Mean throughput: 16.16 msg/s/actor, Mean total throughput: 3231.36 msg/s
Median throughput: 16.13 msg/s/actor, Median total throughput: 3225.55 msg/s

Workers: 200, Mean: 608.23 ms, Standard Deviation: 30.15 ms, Min: 540.70 ms, Q1: 582.21 ms, Median: 612.55 ms, Q3: 629.21 ms, Max: 678.94 ms
Mean throughput: 16.44 msg/s/actor, Mean total throughput: 3288.23 msg/s
Median throughput: 16.33 msg/s/actor, Median total throughput: 3265.04 msg/s

Leaving these as breadcrumbs before I lose them to confirm if the Task.Yield caused the perf regression or something else...

@to11mtm
Copy link
Copy Markdown
Member

to11mtm commented Apr 26, 2026

1.5.57:

Workers: 200, Mean: 463.85 ms, Standard Deviation: 66.68 ms, Min: 314.78 ms, Q1: 420.04 ms, Median: 462.30 ms, Q3: 510.99 ms, Max: 596.72 ms
Mean throughput: 21.56 msg/s/actor, Mean total throughput: 4311.72 msg/s
Median throughput: 21.63 msg/s/actor, Median total throughput: 4326.20 msg/s

@Aaronontheweb
Copy link
Copy Markdown
Member Author

Good data @to11mtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants