Port Pekko ShardStopped handler + handoff safety net#8055
Merged
Aaronontheweb merged 3 commits intoFeb 25, 2026
Conversation
Shards can fail to HandOff indefinitely during scale-up when the RebalanceWorker times out before receiving ShardStopped. The coordinator never deallocates the shard, causing an endless GetShardHome/ShardHome loop. - Add ShardStopped handler to ShardCoordinator.Active() (Pekko port): cleans up unAckedHostShards and performs late deallocation when no rebalance is in progress for the shard - ShardRegion sends backup ShardStopped to coordinator on handoff completion, ensuring the coordinator learns about it even when the RebalanceWorker has already timed out
Aaronontheweb
commented
Feb 25, 2026
Member
Author
Aaronontheweb
left a comment
There was a problem hiding this comment.
Detailed my changes
| // Safety net: if no rebalance is in progress for this shard (RebalanceWorker | ||
| // already timed out), deallocate the shard so it can be reallocated elsewhere. | ||
| // This prevents the shard from being endlessly recreated via GetShardHome/ShardHome. | ||
| if (!_rebalanceInProgress.ContainsKey(m.Shard) && State.Shards.ContainsKey(m.Shard)) |
Member
Author
There was a problem hiding this comment.
I have seen a ton of GetShardHome / ShardHome spam in my apps and I assumed it was Phobos' sharding metric polling responsible for that. Apparently this bug is also a big contributor.
| // has already timed out and missed the ShardStopped from HandOffStopper. | ||
| // The coordinator's Active handler will only deallocate if no rebalance | ||
| // is currently in progress for this shard. | ||
| _coordinator?.Tell(new ShardCoordinator.ShardStopped(shard)); |
Member
Author
There was a problem hiding this comment.
allows us to double-tap the ShardStopped message handling in case the RebalanceWorker has died already
Arkatufus
pushed a commit
to Arkatufus/akka.net
that referenced
this pull request
Feb 26, 2026
akkadotnet#8055) Shards can fail to HandOff indefinitely during scale-up when the RebalanceWorker times out before receiving ShardStopped. The coordinator never deallocates the shard, causing an endless GetShardHome/ShardHome loop. - Add ShardStopped handler to ShardCoordinator.Active() (Pekko port): cleans up unAckedHostShards and performs late deallocation when no rebalance is in progress for the shard - ShardRegion sends backup ShardStopped to coordinator on handoff completion, ensuring the coordinator learns about it even when the RebalanceWorker has already timed out (cherry picked from commit ff3c590)
Aaronontheweb
added a commit
that referenced
this pull request
Feb 26, 2026
* Add EventFilter + semantic logging test coverage (#8046) * Add EventFilter + semantic logging test coverage Add comprehensive test coverage for EventFilter matching against semantic log messages (named templates). Covers exact match, partial matchers, null handling, decimal precision, type variety, and end-to-end Logger.Info() scenarios through the real adapter path. - EventFilterSemanticLoggingTests: 26 integration tests for EventFilter with SemanticLogMessageFormatter including customer-reported scenario where converting from positional to named templates changed output - SemanticFormatterNullAndEdgeCaseSpecs: 32 unit tests for formatter edge cases (decimal trailing zeros, nullable types, null rendering, format specifiers, LogValues boxing) - Document null rendering asymmetry in SemanticLogMessageFormatter XML docs (named: "null", positional: "" via string.Format) * Add decimal precision mismatch reproduction test with workarounds Reproduces exact customer scenario where actor arithmetic produces 100.0m but test EventFilter uses 100m via string interpolation. Demonstrates three workarounds: contains: partial match, matching exact precision, and F2/F0 format specifiers in named templates. (cherry picked from commit f978022) * [xUnit3] Bump xUnit version (#8052) (cherry picked from commit c5dd0a9) * Downgrade VirtualPathContainer RemoveChild log from Warning to Debug (#8048) The /temp VirtualPathContainer logs "trying to remove non-child" at Warning level when ConcurrentDictionary.TryRemove returns false. This is expected, benign behavior during concurrent Ask operations — it simply means the entry was already removed. Logging it as Warning creates noise during load testing. Fixes #8037 Co-authored-by: Aaron Stannard <aaron@petabridge.com> (cherry picked from commit 2226305) * AppVersion.CompareTo missing else if breaks comparison symmetry (#8051) The second condition in the rest-string comparison was an 'if' instead of 'else if', causing the first branch's result (diff = 1) to be immediately overwritten by the else branch. This made release versions appear less than their pre-release counterparts (e.g. '1.2.0' < '1.2.0-M1'), violating IComparable<T> symmetry. This could cause non-deterministic shard allocation ordering during rolling updates from pre-release to release versions via AbstractLeastShardAllocationStrategy. Added regression test for the reverse comparison direction. Co-authored-by: Aaron Stannard <aaron@petabridge.com> (cherry picked from commit 71f0350) * Fix Shard remember-entities flag mismatch causing entity restart failures (#8054) The Shard constructor derived _rememberEntities from whether a provider was passed, but passed settings.RememberEntities (from HOCON config) to the Entities class. When a provider was supplied without the config flag, the Entities._remembering set was never populated. This caused OnUpdateDone to see no pending work and overwrite the WaitingForRememberEntitiesStore behavior with Idle, dropping the subsequent UpdateDone from the store and preventing entity restarts. (cherry picked from commit d332236) * Correct self-comparison in ShardCoordinator ResendShardHost handler (#8050) region.Equals(region) compared the variable to itself, making the condition always true and the else branch (shard reallocated to another region) unreachable dead code. Changed to region.Equals(m.Region) to compare the current region from state against the original region captured when the ResendShardHost timer was scheduled. Co-authored-by: Aaron Stannard <aaron@petabridge.com> (cherry picked from commit 8a9e6f0) * Port Pekko ShardStopped handler + handoff safety net (#7500) (#8055) Shards can fail to HandOff indefinitely during scale-up when the RebalanceWorker times out before receiving ShardStopped. The coordinator never deallocates the shard, causing an endless GetShardHome/ShardHome loop. - Add ShardStopped handler to ShardCoordinator.Active() (Pekko port): cleans up unAckedHostShards and performs late deallocation when no rebalance is in progress for the shard - ShardRegion sends backup ShardStopped to coordinator on handoff completion, ensuring the coordinator learns about it even when the RebalanceWorker has already timed out (cherry picked from commit ff3c590) * fix: correct format index in 3-arg LogInfo overload (#8056) The 3-arg LogInfo overload used {2} in its prefix, which resolved to arg3 instead of _selfAddress (at index 3). This caused log messages like 'Cluster Node [1.0.0]' instead of 'Cluster Node [akka://...]'. The 1-arg and 2-arg overloads correctly used {1} and {2} respectively to reference _selfAddress as the last argument. Also fixed xmldoc on arg3 parameter ('second' -> 'third'). (cherry picked from commit 9f0cc64) * fix: remove stray dollar signs from interpolated strings (#8057) Fix three string interpolation bugs where a $ prefix was combined with positional {0} formatting or produced literal $ in output: - ClusterHeartbeat.cs: [${string.Join...}] -> [{string.Join...}] - ShardRegion.cs: $"{0}: {qr}" -> $"{_typeName}: {qr}" - SinkRefImpl.cs: [${_promise.Task.Result}] and ${SubscriptionTimeout.Timeout} Add regression test for the HeartbeatNodeRing error message. Co-authored-by: Aaron Stannard <aaron@petabridge.com> (cherry picked from commit 920aad9) * VectorClock inequality fixes (#8058) * fix: VectorClock != operator returns negation of == instead of IsConcurrent The != operator incorrectly delegated to IsConcurrentWith instead of being the logical negation of ==. This meant a != b returned false when one clock was Before or After the other, violating the fundamental contract that != is the negation of ==. Changed from IsConcurrentWith to !IsSameAs. Added regression assertions in existing comparison tests that cover Before and After relationships. * fix: VectorClock == and != operators handle null correctly Previously null == null returned false and something ==/!= null threw NullReferenceException. Now follows standard C# equality pattern: ReferenceEquals for both-null and either-null, then delegate to IsSameAs. The != operator delegates to !(left == right) to guarantee consistency. --------- Co-authored-by: Aaron Stannard <aaron@petabridge.com> (cherry picked from commit d8e2b1f) * Fix wrong randomFactor argument type on RetrySupport.Retry() (#8061) ## Changes * Change randomFactor type from `int` to `double` * Modernize the thrown exception messages --------- Co-authored-by: Aaron Stannard <aaron@petabridge.com> (cherry picked from commit 5f02610) * Update API Approval list for .net 4.8 --------- Co-authored-by: Aaron Stannard <aaron@petabridge.com> Co-authored-by: Apoorv Darshan <ad13dtu@gmail.com> Co-authored-by: Matt Kotsenas <51421+MattKotsenas@users.noreply.github.com>
This was referenced Feb 27, 2026
This was referenced May 21, 2026
Open
Closed
Open
Open
This was referenced May 29, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #7500 - Shards can fail to
HandOffindefinitely during scale-up events.Root cause: When a
RebalanceWorkertimes out (single 60s timer covers bothBeginHandOffAck+ShardStoppedphases), theShardStoppedmessage fromHandOffStoppergoes to the deadRebalanceWorkerand is lost. The coordinator never deallocates the shard, so entity traffic triggersGetShardHome→ShardHome→ shard recreation → repeat (10-30 minutes of unhandledHandOffmessages).Changes:
ShardCoordinator.cs: Added
ShardStoppedhandler toActive()(ported from Pekko). Cleans up_unAckedHostShardsand performs late deallocation when noRebalanceWorkeris active for the shard (!_rebalanceInProgress.ContainsKey). This is the safety net — if the worker already timed out, the coordinator can still deallocate and reallocate the shard.ShardRegion.cs:
HandleTerminated()now sends a backupShardStoppedto the coordinator when a handoff completes. This ensures the coordinator receives the stop notification even when theRebalanceWorkerhas already timed out. The coordinator handler is idempotent — if a rebalance is still in progress, the deallocation is skipped (the worker handles it).Test plan
dotnet build -c Release -warnaserror— 0 warnings, 0 errorsAkka.Cluster.Sharding.Tests— 188/190 passed (2 failures are pre-existingRememberEntitiesStarterSpecflakes, unrelated)ShardStoppedis an internal message — no public API surface changes