Skip to content

Fix Shard remember-entities flag mismatch causing entity restart failures#8054

Merged
Aaronontheweb merged 2 commits into
akkadotnet:devfrom
Aaronontheweb:claude-wt-racy_shard_spec
Feb 25, 2026
Merged

Fix Shard remember-entities flag mismatch causing entity restart failures#8054
Aaronontheweb merged 2 commits into
akkadotnet:devfrom
Aaronontheweb:claude-wt-racy_shard_spec

Conversation

@Aaronontheweb
Copy link
Copy Markdown
Member

Summary

  • Fixed a bug in Shard.cs where the Entities class was initialized with settings.RememberEntities (from HOCON config, default false) instead of _rememberEntities (derived from whether a rememberEntitiesProvider was passed to the constructor)
  • This mismatch caused the Entities._remembering HashSet to never be populated, so OnUpdateDone would see no pending work and incorrectly transition the shard to Idle while a store write was in-flight
  • The dropped UpdateDone prevented entity restarts after transient failures (constructor/PreStart exceptions), causing the ShardEntityFailureSpec test to flake

Root Cause

The Shard constructor had two independent "remember entities enabled" flags:

  1. _rememberEntities = rememberEntitiesProvider != null (true when provider passed)
  2. Entities.RememberingEntities = settings.RememberEntities (from HOCON config)

When a rememberEntitiesProvider was supplied without the config flag being set, PassivateCompleted would trigger a store write and Context.Become(WaitingForRememberEntitiesStore), but then OnUpdateDone's pending check against the empty _remembering set would overwrite this with Context.Become(Idle), causing subsequent UpdateDone messages to be dropped with "Id must not be empty".

Test plan

  • ShardEntityFailureSpec passes (both ConstructorFailActor and PreStartFailActor variants)
  • All 24 shard-related tests pass
  • Validated with 200 consecutive passes by stress testing

…ures

The Shard constructor derived _rememberEntities from whether a provider
was passed, but passed settings.RememberEntities (from HOCON config) to
the Entities class. When a provider was supplied without the config flag,
the Entities._remembering set was never populated. This caused
OnUpdateDone to see no pending work and overwrite the
WaitingForRememberEntitiesStore behavior with Idle, dropping the
subsequent UpdateDone from the store and preventing entity restarts.
@Aaronontheweb Aaronontheweb enabled auto-merge (squash) February 25, 2026 03:30
@Aaronontheweb Aaronontheweb merged commit d332236 into akkadotnet:dev Feb 25, 2026
12 checks passed
@Aaronontheweb Aaronontheweb deleted the claude-wt-racy_shard_spec branch February 25, 2026 16:19
Arkatufus pushed a commit to Arkatufus/akka.net that referenced this pull request Feb 26, 2026
…ures (akkadotnet#8054)

The Shard constructor derived _rememberEntities from whether a provider
was passed, but passed settings.RememberEntities (from HOCON config) to
the Entities class. When a provider was supplied without the config flag,
the Entities._remembering set was never populated. This caused
OnUpdateDone to see no pending work and overwrite the
WaitingForRememberEntitiesStore behavior with Idle, dropping the
subsequent UpdateDone from the store and preventing entity restarts.

(cherry picked from commit d332236)
Aaronontheweb added a commit that referenced this pull request Feb 26, 2026
* Add EventFilter + semantic logging test coverage (#8046)

* Add EventFilter + semantic logging test coverage

Add comprehensive test coverage for EventFilter matching against
semantic log messages (named templates). Covers exact match, partial
matchers, null handling, decimal precision, type variety, and
end-to-end Logger.Info() scenarios through the real adapter path.

- EventFilterSemanticLoggingTests: 26 integration tests for EventFilter
  with SemanticLogMessageFormatter including customer-reported scenario
  where converting from positional to named templates changed output
- SemanticFormatterNullAndEdgeCaseSpecs: 32 unit tests for formatter
  edge cases (decimal trailing zeros, nullable types, null rendering,
  format specifiers, LogValues boxing)
- Document null rendering asymmetry in SemanticLogMessageFormatter
  XML docs (named: "null", positional: "" via string.Format)

* Add decimal precision mismatch reproduction test with workarounds

Reproduces exact customer scenario where actor arithmetic produces
100.0m but test EventFilter uses 100m via string interpolation.
Demonstrates three workarounds: contains: partial match, matching
exact precision, and F2/F0 format specifiers in named templates.

(cherry picked from commit f978022)

* [xUnit3] Bump xUnit version (#8052)

(cherry picked from commit c5dd0a9)

* Downgrade VirtualPathContainer RemoveChild log from Warning to Debug (#8048)

The /temp VirtualPathContainer logs "trying to remove non-child" at
Warning level when ConcurrentDictionary.TryRemove returns false. This
is expected, benign behavior during concurrent Ask operations — it
simply means the entry was already removed. Logging it as Warning
creates noise during load testing.

Fixes #8037

Co-authored-by: Aaron Stannard <aaron@petabridge.com>
(cherry picked from commit 2226305)

* AppVersion.CompareTo missing else if breaks comparison symmetry (#8051)

The second condition in the rest-string comparison was an 'if' instead
of 'else if', causing the first branch's result (diff = 1) to be
immediately overwritten by the else branch. This made release versions
appear less than their pre-release counterparts (e.g. '1.2.0' <
'1.2.0-M1'),
violating IComparable<T> symmetry.

This could cause non-deterministic shard allocation ordering during
rolling updates from pre-release to release versions via
AbstractLeastShardAllocationStrategy.

Added regression test for the reverse comparison direction.

Co-authored-by: Aaron Stannard <aaron@petabridge.com>
(cherry picked from commit 71f0350)

* Fix Shard remember-entities flag mismatch causing entity restart failures (#8054)

The Shard constructor derived _rememberEntities from whether a provider
was passed, but passed settings.RememberEntities (from HOCON config) to
the Entities class. When a provider was supplied without the config flag,
the Entities._remembering set was never populated. This caused
OnUpdateDone to see no pending work and overwrite the
WaitingForRememberEntitiesStore behavior with Idle, dropping the
subsequent UpdateDone from the store and preventing entity restarts.

(cherry picked from commit d332236)

* Correct self-comparison in ShardCoordinator ResendShardHost handler (#8050)

region.Equals(region) compared the variable to itself, making the
condition always true and the else branch (shard reallocated to another
region) unreachable dead code. Changed to region.Equals(m.Region) to
compare the current region from state against the original region
captured when the ResendShardHost timer was scheduled.

Co-authored-by: Aaron Stannard <aaron@petabridge.com>
(cherry picked from commit 8a9e6f0)

* Port Pekko ShardStopped handler + handoff safety net (#7500) (#8055)

Shards can fail to HandOff indefinitely during scale-up when the
RebalanceWorker times out before receiving ShardStopped. The coordinator
never deallocates the shard, causing an endless GetShardHome/ShardHome loop.

- Add ShardStopped handler to ShardCoordinator.Active() (Pekko port):
  cleans up unAckedHostShards and performs late deallocation when no
  rebalance is in progress for the shard
- ShardRegion sends backup ShardStopped to coordinator on handoff
  completion, ensuring the coordinator learns about it even when the
  RebalanceWorker has already timed out

(cherry picked from commit ff3c590)

* fix: correct format index in 3-arg LogInfo overload (#8056)

The 3-arg LogInfo overload used {2} in its prefix, which resolved to
arg3 instead of _selfAddress (at index 3). This caused log messages
like 'Cluster Node [1.0.0]' instead of 'Cluster Node [akka://...]'.

The 1-arg and 2-arg overloads correctly used {1} and {2} respectively
to reference _selfAddress as the last argument.

Also fixed xmldoc on arg3 parameter ('second' -> 'third').

(cherry picked from commit 9f0cc64)

* fix: remove stray dollar signs from interpolated strings (#8057)

Fix three string interpolation bugs where a $ prefix was combined with
positional {0} formatting or produced literal $ in output:

- ClusterHeartbeat.cs: [${string.Join...}] -> [{string.Join...}]
- ShardRegion.cs: $"{0}: {qr}" -> $"{_typeName}: {qr}"
- SinkRefImpl.cs: [${_promise.Task.Result}] and
${SubscriptionTimeout.Timeout}

Add regression test for the HeartbeatNodeRing error message.

Co-authored-by: Aaron Stannard <aaron@petabridge.com>
(cherry picked from commit 920aad9)

* VectorClock inequality fixes (#8058)

* fix: VectorClock != operator returns negation of == instead of IsConcurrent

The != operator incorrectly delegated to IsConcurrentWith instead of
being the logical negation of ==. This meant a != b returned false when
one clock was Before or After the other, violating the fundamental
contract that != is the negation of ==.

Changed from IsConcurrentWith to !IsSameAs. Added regression assertions
in existing comparison tests that cover Before and After relationships.

* fix: VectorClock == and != operators handle null correctly

Previously null == null returned false and something ==/!= null threw
NullReferenceException. Now follows standard C# equality pattern:
ReferenceEquals for both-null and either-null, then delegate to
IsSameAs. The != operator delegates to !(left == right) to guarantee
consistency.

---------

Co-authored-by: Aaron Stannard <aaron@petabridge.com>
(cherry picked from commit d8e2b1f)

* Fix wrong randomFactor argument type on RetrySupport.Retry() (#8061)

## Changes

* Change randomFactor type from `int` to `double`
* Modernize the thrown exception messages
---------

Co-authored-by: Aaron Stannard <aaron@petabridge.com>
(cherry picked from commit 5f02610)

* Update API Approval list for .net 4.8

---------

Co-authored-by: Aaron Stannard <aaron@petabridge.com>
Co-authored-by: Apoorv Darshan <ad13dtu@gmail.com>
Co-authored-by: Matt Kotsenas <51421+MattKotsenas@users.noreply.github.com>
This was referenced Feb 27, 2026
This was referenced May 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant