Port Pekko ShardStopped handler + handoff safety net by Aaronontheweb · Pull Request #8055 · akkadotnet/akka.net

Aaronontheweb · 2026-02-25T04:21:01Z

Summary

Fixes #7500 - Shards can fail to HandOff indefinitely during scale-up events.

Root cause: When a RebalanceWorker times out (single 60s timer covers both BeginHandOffAck + ShardStopped phases), the ShardStopped message from HandOffStopper goes to the dead RebalanceWorker and is lost. The coordinator never deallocates the shard, so entity traffic triggers GetShardHome → ShardHome → shard recreation → repeat (10-30 minutes of unhandled HandOff messages).

Changes:

ShardCoordinator.cs: Added ShardStopped handler to Active() (ported from Pekko). Cleans up _unAckedHostShards and performs late deallocation when no RebalanceWorker is active for the shard (!_rebalanceInProgress.ContainsKey). This is the safety net — if the worker already timed out, the coordinator can still deallocate and reallocate the shard.
ShardRegion.cs: HandleTerminated() now sends a backup ShardStopped to the coordinator when a handoff completes. This ensures the coordinator receives the stop notification even when the RebalanceWorker has already timed out. The coordinator handler is idempotent — if a rebalance is still in progress, the deallocation is skipped (the worker handles it).

Test plan

dotnet build -c Release -warnaserror — 0 warnings, 0 errors
Akka.Cluster.Sharding.Tests — 188/190 passed (2 failures are pre-existing RememberEntitiesStarterSpec flakes, unrelated)
ShardStopped is an internal message — no public API surface changes
Manual verification: coordinator correctly deallocates shards that complete handoff after RebalanceWorker timeout

Shards can fail to HandOff indefinitely during scale-up when the RebalanceWorker times out before receiving ShardStopped. The coordinator never deallocates the shard, causing an endless GetShardHome/ShardHome loop. - Add ShardStopped handler to ShardCoordinator.Active() (Pekko port): cleans up unAckedHostShards and performs late deallocation when no rebalance is in progress for the shard - ShardRegion sends backup ShardStopped to coordinator on handoff completion, ensuring the coordinator learns about it even when the RebalanceWorker has already timed out

Aaronontheweb

Detailed my changes

Aaronontheweb · 2026-02-25T04:37:45Z

+                    // Safety net: if no rebalance is in progress for this shard (RebalanceWorker
+                    // already timed out), deallocate the shard so it can be reallocated elsewhere.
+                    // This prevents the shard from being endlessly recreated via GetShardHome/ShardHome.
+                    if (!_rebalanceInProgress.ContainsKey(m.Shard) && State.Shards.ContainsKey(m.Shard))


I have seen a ton of GetShardHome / ShardHome spam in my apps and I assumed it was Phobos' sharding metric polling responsible for that. Apparently this bug is also a big contributor.

Aaronontheweb · 2026-02-25T04:39:07Z

+                    // has already timed out and missed the ShardStopped from HandOffStopper.
+                    // The coordinator's Active handler will only deallocate if no rebalance
+                    // is currently in progress for this shard.
+                    _coordinator?.Tell(new ShardCoordinator.ShardStopped(shard));


allows us to double-tap the ShardStopped message handling in case the RebalanceWorker has died already

akkadotnet#8055) Shards can fail to HandOff indefinitely during scale-up when the RebalanceWorker times out before receiving ShardStopped. The coordinator never deallocates the shard, causing an endless GetShardHome/ShardHome loop. - Add ShardStopped handler to ShardCoordinator.Active() (Pekko port): cleans up unAckedHostShards and performs late deallocation when no rebalance is in progress for the shard - ShardRegion sends backup ShardStopped to coordinator on handoff completion, ensuring the coordinator learns about it even when the RebalanceWorker has already timed out (cherry picked from commit ff3c590)

* Add EventFilter + semantic logging test coverage (#8046) * Add EventFilter + semantic logging test coverage Add comprehensive test coverage for EventFilter matching against semantic log messages (named templates). Covers exact match, partial matchers, null handling, decimal precision, type variety, and end-to-end Logger.Info() scenarios through the real adapter path. - EventFilterSemanticLoggingTests: 26 integration tests for EventFilter with SemanticLogMessageFormatter including customer-reported scenario where converting from positional to named templates changed output - SemanticFormatterNullAndEdgeCaseSpecs: 32 unit tests for formatter edge cases (decimal trailing zeros, nullable types, null rendering, format specifiers, LogValues boxing) - Document null rendering asymmetry in SemanticLogMessageFormatter XML docs (named: "null", positional: "" via string.Format) * Add decimal precision mismatch reproduction test with workarounds Reproduces exact customer scenario where actor arithmetic produces 100.0m but test EventFilter uses 100m via string interpolation. Demonstrates three workarounds: contains: partial match, matching exact precision, and F2/F0 format specifiers in named templates. (cherry picked from commit f978022) * [xUnit3] Bump xUnit version (#8052) (cherry picked from commit c5dd0a9) * Downgrade VirtualPathContainer RemoveChild log from Warning to Debug (#8048) The /temp VirtualPathContainer logs "trying to remove non-child" at Warning level when ConcurrentDictionary.TryRemove returns false. This is expected, benign behavior during concurrent Ask operations — it simply means the entry was already removed. Logging it as Warning creates noise during load testing. Fixes #8037 Co-authored-by: Aaron Stannard <aaron@petabridge.com> (cherry picked from commit 2226305) * AppVersion.CompareTo missing else if breaks comparison symmetry (#8051) The second condition in the rest-string comparison was an 'if' instead of 'else if', causing the first branch's result (diff = 1) to be immediately overwritten by the else branch. This made release versions appear less than their pre-release counterparts (e.g. '1.2.0' < '1.2.0-M1'), violating IComparable<T> symmetry. This could cause non-deterministic shard allocation ordering during rolling updates from pre-release to release versions via AbstractLeastShardAllocationStrategy. Added regression test for the reverse comparison direction. Co-authored-by: Aaron Stannard <aaron@petabridge.com> (cherry picked from commit 71f0350) * Fix Shard remember-entities flag mismatch causing entity restart failures (#8054) The Shard constructor derived _rememberEntities from whether a provider was passed, but passed settings.RememberEntities (from HOCON config) to the Entities class. When a provider was supplied without the config flag, the Entities._remembering set was never populated. This caused OnUpdateDone to see no pending work and overwrite the WaitingForRememberEntitiesStore behavior with Idle, dropping the subsequent UpdateDone from the store and preventing entity restarts. (cherry picked from commit d332236) * Correct self-comparison in ShardCoordinator ResendShardHost handler (#8050) region.Equals(region) compared the variable to itself, making the condition always true and the else branch (shard reallocated to another region) unreachable dead code. Changed to region.Equals(m.Region) to compare the current region from state against the original region captured when the ResendShardHost timer was scheduled. Co-authored-by: Aaron Stannard <aaron@petabridge.com> (cherry picked from commit 8a9e6f0) * Port Pekko ShardStopped handler + handoff safety net (#7500) (#8055) Shards can fail to HandOff indefinitely during scale-up when the RebalanceWorker times out before receiving ShardStopped. The coordinator never deallocates the shard, causing an endless GetShardHome/ShardHome loop. - Add ShardStopped handler to ShardCoordinator.Active() (Pekko port): cleans up unAckedHostShards and performs late deallocation when no rebalance is in progress for the shard - ShardRegion sends backup ShardStopped to coordinator on handoff completion, ensuring the coordinator learns about it even when the RebalanceWorker has already timed out (cherry picked from commit ff3c590) * fix: correct format index in 3-arg LogInfo overload (#8056) The 3-arg LogInfo overload used {2} in its prefix, which resolved to arg3 instead of _selfAddress (at index 3). This caused log messages like 'Cluster Node [1.0.0]' instead of 'Cluster Node [akka://...]'. The 1-arg and 2-arg overloads correctly used {1} and {2} respectively to reference _selfAddress as the last argument. Also fixed xmldoc on arg3 parameter ('second' -> 'third'). (cherry picked from commit 9f0cc64) * fix: remove stray dollar signs from interpolated strings (#8057) Fix three string interpolation bugs where a $ prefix was combined with positional {0} formatting or produced literal $ in output: - ClusterHeartbeat.cs: [${string.Join...}] -> [{string.Join...}] - ShardRegion.cs: $"{0}: {qr}" -> $"{_typeName}: {qr}" - SinkRefImpl.cs: [${_promise.Task.Result}] and ${SubscriptionTimeout.Timeout} Add regression test for the HeartbeatNodeRing error message. Co-authored-by: Aaron Stannard <aaron@petabridge.com> (cherry picked from commit 920aad9) * VectorClock inequality fixes (#8058) * fix: VectorClock != operator returns negation of == instead of IsConcurrent The != operator incorrectly delegated to IsConcurrentWith instead of being the logical negation of ==. This meant a != b returned false when one clock was Before or After the other, violating the fundamental contract that != is the negation of ==. Changed from IsConcurrentWith to !IsSameAs. Added regression assertions in existing comparison tests that cover Before and After relationships. * fix: VectorClock == and != operators handle null correctly Previously null == null returned false and something ==/!= null threw NullReferenceException. Now follows standard C# equality pattern: ReferenceEquals for both-null and either-null, then delegate to IsSameAs. The != operator delegates to !(left == right) to guarantee consistency. --------- Co-authored-by: Aaron Stannard <aaron@petabridge.com> (cherry picked from commit d8e2b1f) * Fix wrong randomFactor argument type on RetrySupport.Retry() (#8061) ## Changes * Change randomFactor type from `int` to `double` * Modernize the thrown exception messages --------- Co-authored-by: Aaron Stannard <aaron@petabridge.com> (cherry picked from commit 5f02610) * Update API Approval list for .net 4.8 --------- Co-authored-by: Aaron Stannard <aaron@petabridge.com> Co-authored-by: Apoorv Darshan <ad13dtu@gmail.com> Co-authored-by: Matt Kotsenas <51421+MattKotsenas@users.noreply.github.com>

Aaronontheweb added 2 commits February 24, 2026 22:20

Merge branch 'dev' into fix/shard-handoff-safety-net

0facf90

Aaronontheweb commented Feb 25, 2026

View reviewed changes

Aaronontheweb added the akka-cluster-sharding label Feb 25, 2026

Aaronontheweb enabled auto-merge (squash) February 25, 2026 04:39

Merge branch 'dev' into fix/shard-handoff-safety-net

21fb4cf

Aaronontheweb merged commit ff3c590 into akkadotnet:dev Feb 25, 2026
12 checks passed

Aaronontheweb deleted the fix/shard-handoff-safety-net branch February 25, 2026 18:21

Arkatufus mentioned this pull request Feb 26, 2026

Update RELEASE_NOTES.md for 1.5.61 release #8063

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Port Pekko ShardStopped handler + handoff safety net#8055

Port Pekko ShardStopped handler + handoff safety net#8055
Aaronontheweb merged 3 commits into
akkadotnet:devfrom
Aaronontheweb:fix/shard-handoff-safety-net

Aaronontheweb commented Feb 25, 2026

Uh oh!

Aaronontheweb left a comment

Uh oh!

Aaronontheweb Feb 25, 2026

Uh oh!

Aaronontheweb Feb 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Aaronontheweb commented Feb 25, 2026

Summary

Test plan

Uh oh!

Aaronontheweb left a comment

Choose a reason for hiding this comment

Uh oh!

Aaronontheweb Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Aaronontheweb Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant