Fix version-aware grain directory membership cleanup by ReubenBond · Pull Request #10085 · dotnet/orleans

ReubenBond · 2026-05-11T22:20:31Z

Problem

LocalGrainDirectory membership reconciliation and grain-directory entry validation were using lossy status events or approximate terminating-state checks. That could remove, reject, or stop returning registrations for silos which were ShuttingDown or Stopping, even though those registrations should remain valid until the silo is known to be Dead.

Solution

Reconcile LocalGrainDirectory membership from IClusterMembershipService snapshots.
Add a version-aware ClusterMembershipSnapshot.GetSiloStatus(silo, seenAtVersion) overload so stale unknown silos can be treated as dead only when the snapshot is newer than the registration.
Use that version-aware dead check consistently across LocalGrainDirectory cleanup, local/distributed directory return paths, cache validation, and handoff filtering.
Keep handoff operations retryable and transfer registrations for ShuttingDown/Stopping silos while excluding only entries resolved as Dead.
Move dead-silo outstanding-message breaking out of LocalGrainDirectory and into networking ownership.

Review focus

Please focus on the GetSiloStatus(..., seenAtVersion) semantics and whether the dead-only filtering is applied consistently across directory, cache, and handoff paths.

Microsoft Reviewers: Open in CodeFlow

Process cluster membership as snapshots in LocalGrainDirectory so directory state can be reconciled and retried after failures. Move silo-removal activation cleanup out of Catalog and keep handoff operations retrying until success, obsolescence, or shutdown. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Suppress expected shutdown failures while stopping membership processing and disposing the directory cache. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Remove redundant locking and cancellation from snapshot application, publish the latest directory membership before side effects, and simplify defunct entry cleanup against the current membership. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Evict LocalGrainDirectory directory and cache entries for terminating silos immediately, and only evict unknown-silo entries when the address was registered before the applied membership snapshot. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Refresh and apply cluster membership snapshots when grain directory RPCs receive GrainAddress values from a newer membership version before making ownership decisions. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Apply the latest cluster membership snapshot directly, remove the unused IsSiloInCluster contract, and simplify related LocalGrainDirectory expressions. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Clarify why LocalGrainDirectory removes dead-silo activations and stale unknown-silo activations during membership reconciliation. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

This PR updates Orleans.Runtime’s local grain directory membership handling to reconcile from IClusterMembershipService snapshots (instead of silo-status events), moves silo-removal activation cleanup responsibility from Catalog into LocalGrainDirectory, and changes directory handoff operations to keep retrying until they succeed or shutdown.

Changes:

Replace ISiloStatusListener-driven membership updates in LocalGrainDirectory with a background loop applying IClusterMembershipService snapshots.
Move “directory owner removed ⇒ deactivate local activations” logic from Catalog into LocalGrainDirectory using the previous membership view for ownership checks.
Adjust handoff manager operation processing to continuously retry queued operations (and add additional filtering of defunct registrations).

Show a summary per file

File	Description
test/Orleans.Core.Tests/Directory/CachedGrainLocatorTests.cs	Updates unit tests to provide a cluster membership service dependency.
src/Orleans.Runtime/GrainDirectory/LocalGrainDirectory.cs	Implements snapshot-based membership reconciliation and migrates activation cleanup logic into the directory.
src/Orleans.Runtime/GrainDirectory/ILocalGrainDirectory.cs	Removes `IsSiloInCluster` from the local directory interface.
src/Orleans.Runtime/GrainDirectory/GrainDirectoryHandoffManager.cs	Changes enqueue/retry behavior for handoff operations and filters defunct registrations.
src/Orleans.Runtime/Catalog/Catalog.cs	Removes silo-removal activation cleanup logic now handled by `LocalGrainDirectory`.

Copilot's findings

Comments suppressed due to low confidence (1)

src/Orleans.Runtime/GrainDirectory/LocalGrainDirectory.cs:1055

This log message says "failure of silo" but OnSiloStatusChange runs for any terminating status (eg ShuttingDown/Stopping) which can be a planned shutdown rather than a failure. Consider adjusting the wording (or include the status) so operational logs are accurate.

        [LoggerMessage(
            Level = LogLevel.Information,
            EventId = (int)ErrorCode.Catalog_SiloStatusChangeNotification,
            Message = "LocalGrainDirectory is deactivating {Count} activations due to a failure of silo {Silo}, since it is a primary directory partition to these grain ids."
        )]
        private partial void LogInfoSiloStatusChangeNotification(int count, SiloAddressLogValue silo);

Files reviewed: 5/5 changed files
Comments generated: 3

            lock (this)
            {
                this.pendingOperations.Enqueue((name, state, action));
-                if (this.pendingOperations.Count <= 2)
-                {
-                    this.localDirectory.RemoteGrainDirectory.WorkItemGroup.QueueTask(ExecutePendingOperations, localDirectory.RemoteGrainDirectory);
-                }
+                this.localDirectory.RemoteGrainDirectory.WorkItemGroup.QueueTask(ExecutePendingOperations, localDirectory.RemoteGrainDirectory);
            }


Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Use ClusterMembershipSnapshot.GetSiloStatus with registration membership versions when deciding whether grain directory entries are dead. This keeps shutting down and stopping silos valid until they are marked dead while still filtering old unknown or replaced silos. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Only delay and recheck ownership after a request has already been forwarded once, matching the single-operation forwarding behavior. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Copilot's findings

Comments suppressed due to low confidence (4)

src/Orleans.Runtime/GrainDirectory/LocalGrainDirectory.cs:674

Same off-by-one issue as in RegisterAsync: with forwarding implemented as hopCount + 1, hopCount == 1 is already a forwarded request. Using hopCount > 1 skips the retry delay/re-check on the first re-forward and may cause unnecessary extra hops during membership churn. Consider hopCount > 0 (or adjust the comment/logic consistently).

            // see if the owner is somewhere else (returns null if we are owner)
            var forwardAddress = this.CheckIfShouldForward(address.GrainId, hopCount, "UnregisterAsync");

            // After the first forward, we insert a retry delay and recheck owner before forwarding again
            if (hopCount > 1 && forwardAddress != null)

src/Orleans.Runtime/GrainDirectory/LocalGrainDirectory.cs:743

The retry-delay gating uses hopCount > 1. Since forwarded calls use hopCount + 1, the first forwarded UnregisterManyAsync call will have hopCount == 1, and with the current condition it can immediately re-forward without the intended stabilization delay. Consider using hopCount > 0 if the delay is meant to apply after the first hop.

            UnregisterOrPutInForwardList(addresses, cause, hopCount, ref forwardlist, "UnregisterManyAsync");

            // After the first forward, we insert a retry delay and recheck owner before forwarding again
            if (hopCount > 1 && forwardlist != null)
            {

src/Orleans.Runtime/GrainDirectory/LocalGrainDirectory.cs:856

The retry-delay condition uses hopCount > 1, which means a request forwarded once (hopCount == 1) can be re-forwarded immediately if ownership changes again. If the goal is to pause before any re-forward after the initial hop, consider hopCount > 0 to match the comment and reduce forwarding churn.

            // see if the owner is somewhere else (returns null if we are owner)
            var forwardAddress = this.CheckIfShouldForward(grainId, hopCount, "LookUpAsync");

            // After the first forward, we insert a retry delay and recheck owner before forwarding again
            if (hopCount > 1 && forwardAddress != null)

src/Orleans.Runtime/GrainDirectory/LocalGrainDirectory.cs:917

DeleteGrainAsync uses the same hopCount > 1 retry-delay gating. Since forwarded calls increment hopCount, the first forwarded invocation has hopCount == 1 and will skip the delay/re-check before potentially forwarding again. Consider hopCount > 0 if you intended to delay on the first re-forward.

            // see if the owner is somewhere else (returns null if we are owner)
            var forwardAddress = this.CheckIfShouldForward(grainId, hopCount, "DeleteGrainAsync");

            // After the first forward, we insert a retry delay and recheck owner before forwarding again
            if (hopCount > 1 && forwardAddress != null)

Files reviewed: 16/16 changed files
Comments generated: 3

-            // on all silos other than first, we insert a retry delay and recheck owner before forwarding
-            if (hopCount > 0 && forwardAddress != null)
+            // After the first forward, we insert a retry delay and recheck owner before forwarding again
+            if (hopCount > 1 && forwardAddress != null)


            await foreach (var snapshot in updates)
            {
-                // Active filtering: detect silos that went down and try to clean proactively the directory
+                // Active filtering: detect dead silos and try to clean proactively the directory
                var changes = snapshot.CreateUpdate(previousSnapshot).Changes;
                var deadSilos = changes


+            if (this.cache.LookUp(grainId, out address, out _))
            {


ReubenBond · 2026-05-12T04:17:11Z

Split PR stack created from this PR. Merge in order:

Fix LocalGrainDirectory membership reconciliation #10086 - Fix LocalGrainDirectory membership reconciliation
Use membership versions for stale directory cleanup #10087 - Use membership versions for stale directory cleanup
Refresh membership before directory RPC routing #10088 - Refresh membership before directory RPC routing
Simplify LocalGrainDirectory membership processing #10089 - Simplify LocalGrainDirectory membership processing
Move dead-silo message break to connection maintainer #10090 - Move dead-silo message break to connection maintainer
Use version-aware silo status checks #10091 - Use version-aware silo status checks
Refine directory forwarding retry checks #10092 - Refine directory forwarding retry checks

The final split branch (split/pr10085-07-hopcount-guard) has the same tree as this PR branch, so merging the stack in order is source-equivalent to merging #10085.

Advance the previous membership snapshot after each cache cleanup pass and stamp manually cached entries with the current membership version so version-aware cache validation does not evict them immediately. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Clarify that hopCount == 1 is allowed to re-forward immediately and the retry delay applies only after the request has already bounced through another owner. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

ReubenBond · 2026-05-12T04:29:34Z

Addressed the latest review feedback:

Fixed CachedGrainLocator membership update processing so previousSnapshot advances after each update, avoiding repeated dead-silo cleanup for already-processed membership changes.
Stamped UpdateCache entries with the current membership version so version-aware cache validation does not immediately evict manually cached entries with MembershipVersion.MinValue.
Kept the hopCount > 1 behavior, but clarified the comments: the first forwarded owner (hopCount == 1) can re-forward immediately; the stabilization delay applies once the request has already been re-forwarded/bounced again.

Pushed updates to #10085 and the affected split branches (#10091 and #10092). The final split branch still matches #10085.

ReubenBond and others added 7 commits May 11, 2026 15:20

Harden LocalGrainDirectory shutdown cleanup

9f9b13d

Suppress expected shutdown failures while stopping membership processing and disposing the directory cache. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Document defunct activation cleanup

5e0fd90

Clarify why LocalGrainDirectory removes dead-silo activations and stale unknown-silo activations during membership reconciliation. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

ReubenBond requested a review from Copilot May 12, 2026 00:34

Copilot started reviewing on behalf of ReubenBond May 12, 2026 00:35 View session

Copilot AI reviewed May 12, 2026

View reviewed changes

ReubenBond and others added 5 commits May 11, 2026 17:56

Address PR review feedback

6bddbc3

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Refine silo death cleanup handling

05f0408

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Preserve membership version guard for cleanup

4ed7dfe

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Correct defunct activation predicate

1e9f516

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

ReubenBond changed the title ~~Fix LocalGrainDirectory membership reconciliation~~ Fix version-aware grain directory membership cleanup May 12, 2026

Refine directory forwarding retry checks

3564900

Only delay and recheck ownership after a request has already been forwarded once, matching the single-operation forwarding behavior. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

ReubenBond requested a review from Copilot May 12, 2026 04:04

Copilot started reviewing on behalf of ReubenBond May 12, 2026 04:04 View session

Copilot AI reviewed May 12, 2026

View reviewed changes

ReubenBond and others added 2 commits May 11, 2026 21:28

ReubenBond closed this May 12, 2026

github-actions Bot locked and limited conversation to collaborators Jun 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix version-aware grain directory membership cleanup#10085

Fix version-aware grain directory membership cleanup#10085
ReubenBond wants to merge 15 commits into
dotnet:mainfrom
ReubenBond:fix-local-grain-directory-membership-reconcile

ReubenBond commented May 11, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

ReubenBond commented May 12, 2026

Uh oh!

ReubenBond commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

ReubenBond commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

Review focus

Microsoft Reviewers: Open in CodeFlow

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Copilot's findings

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Copilot's findings

Uh oh!

ReubenBond commented May 12, 2026

Uh oh!

ReubenBond commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ReubenBond commented May 11, 2026 •

edited

Loading