Skip to content

Fix version-aware grain directory membership cleanup#10085

Closed
ReubenBond wants to merge 15 commits into
dotnet:mainfrom
ReubenBond:fix-local-grain-directory-membership-reconcile
Closed

Fix version-aware grain directory membership cleanup#10085
ReubenBond wants to merge 15 commits into
dotnet:mainfrom
ReubenBond:fix-local-grain-directory-membership-reconcile

Conversation

@ReubenBond

@ReubenBond ReubenBond commented May 11, 2026

Copy link
Copy Markdown
Member

Problem

LocalGrainDirectory membership reconciliation and grain-directory entry validation were using lossy status events or approximate terminating-state checks. That could remove, reject, or stop returning registrations for silos which were ShuttingDown or Stopping, even though those registrations should remain valid until the silo is known to be Dead.

Solution

  • Reconcile LocalGrainDirectory membership from IClusterMembershipService snapshots.
  • Add a version-aware ClusterMembershipSnapshot.GetSiloStatus(silo, seenAtVersion) overload so stale unknown silos can be treated as dead only when the snapshot is newer than the registration.
  • Use that version-aware dead check consistently across LocalGrainDirectory cleanup, local/distributed directory return paths, cache validation, and handoff filtering.
  • Keep handoff operations retryable and transfer registrations for ShuttingDown/Stopping silos while excluding only entries resolved as Dead.
  • Move dead-silo outstanding-message breaking out of LocalGrainDirectory and into networking ownership.

Review focus

Please focus on the GetSiloStatus(..., seenAtVersion) semantics and whether the dead-only filtering is applied consistently across directory, cache, and handoff paths.

Microsoft Reviewers: Open in CodeFlow

ReubenBond and others added 7 commits May 11, 2026 15:20
Process cluster membership as snapshots in LocalGrainDirectory so directory state can be reconciled and retried after failures. Move silo-removal activation cleanup out of Catalog and keep handoff operations retrying until success, obsolescence, or shutdown.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Suppress expected shutdown failures while stopping membership processing and disposing the directory cache.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Remove redundant locking and cancellation from snapshot application, publish the latest directory membership before side effects, and simplify defunct entry cleanup against the current membership.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Evict LocalGrainDirectory directory and cache entries for terminating silos immediately, and only evict unknown-silo entries when the address was registered before the applied membership snapshot.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Refresh and apply cluster membership snapshots when grain directory RPCs receive GrainAddress values from a newer membership version before making ownership decisions.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Apply the latest cluster membership snapshot directly, remove the unused IsSiloInCluster contract, and simplify related LocalGrainDirectory expressions.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Clarify why LocalGrainDirectory removes dead-silo activations and stale unknown-silo activations during membership reconciliation.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates Orleans.Runtime’s local grain directory membership handling to reconcile from IClusterMembershipService snapshots (instead of silo-status events), moves silo-removal activation cleanup responsibility from Catalog into LocalGrainDirectory, and changes directory handoff operations to keep retrying until they succeed or shutdown.

Changes:

  • Replace ISiloStatusListener-driven membership updates in LocalGrainDirectory with a background loop applying IClusterMembershipService snapshots.
  • Move “directory owner removed ⇒ deactivate local activations” logic from Catalog into LocalGrainDirectory using the previous membership view for ownership checks.
  • Adjust handoff manager operation processing to continuously retry queued operations (and add additional filtering of defunct registrations).
Show a summary per file
File Description
test/Orleans.Core.Tests/Directory/CachedGrainLocatorTests.cs Updates unit tests to provide a cluster membership service dependency.
src/Orleans.Runtime/GrainDirectory/LocalGrainDirectory.cs Implements snapshot-based membership reconciliation and migrates activation cleanup logic into the directory.
src/Orleans.Runtime/GrainDirectory/ILocalGrainDirectory.cs Removes IsSiloInCluster from the local directory interface.
src/Orleans.Runtime/GrainDirectory/GrainDirectoryHandoffManager.cs Changes enqueue/retry behavior for handoff operations and filters defunct registrations.
src/Orleans.Runtime/Catalog/Catalog.cs Removes silo-removal activation cleanup logic now handled by LocalGrainDirectory.

Copilot's findings

Comments suppressed due to low confidence (1)

src/Orleans.Runtime/GrainDirectory/LocalGrainDirectory.cs:1055

  • This log message says "failure of silo" but OnSiloStatusChange runs for any terminating status (eg ShuttingDown/Stopping) which can be a planned shutdown rather than a failure. Consider adjusting the wording (or include the status) so operational logs are accurate.
        [LoggerMessage(
            Level = LogLevel.Information,
            EventId = (int)ErrorCode.Catalog_SiloStatusChangeNotification,
            Message = "LocalGrainDirectory is deactivating {Count} activations due to a failure of silo {Silo}, since it is a primary directory partition to these grain ids."
        )]
        private partial void LogInfoSiloStatusChangeNotification(int count, SiloAddressLogValue silo);
  • Files reviewed: 5/5 changed files
  • Comments generated: 3

Comment on lines 204 to 208
lock (this)
{
this.pendingOperations.Enqueue((name, state, action));
if (this.pendingOperations.Count <= 2)
{
this.localDirectory.RemoteGrainDirectory.WorkItemGroup.QueueTask(ExecutePendingOperations, localDirectory.RemoteGrainDirectory);
}
this.localDirectory.RemoteGrainDirectory.WorkItemGroup.QueueTask(ExecutePendingOperations, localDirectory.RemoteGrainDirectory);
}
Comment thread src/Orleans.Runtime/GrainDirectory/LocalGrainDirectory.cs Outdated
Comment thread src/Orleans.Runtime/GrainDirectory/LocalGrainDirectory.cs Outdated
ReubenBond and others added 5 commits May 11, 2026 17:56
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Use ClusterMembershipSnapshot.GetSiloStatus with registration membership versions when deciding whether grain directory entries are dead. This keeps shutting down and stopping silos valid until they are marked dead while still filtering old unknown or replaced silos.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@ReubenBond ReubenBond changed the title Fix LocalGrainDirectory membership reconciliation Fix version-aware grain directory membership cleanup May 12, 2026
Only delay and recheck ownership after a request has already been forwarded once, matching the single-operation forwarding behavior.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot's findings

Comments suppressed due to low confidence (4)

src/Orleans.Runtime/GrainDirectory/LocalGrainDirectory.cs:674

  • Same off-by-one issue as in RegisterAsync: with forwarding implemented as hopCount + 1, hopCount == 1 is already a forwarded request. Using hopCount > 1 skips the retry delay/re-check on the first re-forward and may cause unnecessary extra hops during membership churn. Consider hopCount > 0 (or adjust the comment/logic consistently).
            // see if the owner is somewhere else (returns null if we are owner)
            var forwardAddress = this.CheckIfShouldForward(address.GrainId, hopCount, "UnregisterAsync");

            // After the first forward, we insert a retry delay and recheck owner before forwarding again
            if (hopCount > 1 && forwardAddress != null)

src/Orleans.Runtime/GrainDirectory/LocalGrainDirectory.cs:743

  • The retry-delay gating uses hopCount > 1. Since forwarded calls use hopCount + 1, the first forwarded UnregisterManyAsync call will have hopCount == 1, and with the current condition it can immediately re-forward without the intended stabilization delay. Consider using hopCount > 0 if the delay is meant to apply after the first hop.
            UnregisterOrPutInForwardList(addresses, cause, hopCount, ref forwardlist, "UnregisterManyAsync");

            // After the first forward, we insert a retry delay and recheck owner before forwarding again
            if (hopCount > 1 && forwardlist != null)
            {

src/Orleans.Runtime/GrainDirectory/LocalGrainDirectory.cs:856

  • The retry-delay condition uses hopCount > 1, which means a request forwarded once (hopCount == 1) can be re-forwarded immediately if ownership changes again. If the goal is to pause before any re-forward after the initial hop, consider hopCount > 0 to match the comment and reduce forwarding churn.
            // see if the owner is somewhere else (returns null if we are owner)
            var forwardAddress = this.CheckIfShouldForward(grainId, hopCount, "LookUpAsync");

            // After the first forward, we insert a retry delay and recheck owner before forwarding again
            if (hopCount > 1 && forwardAddress != null)

src/Orleans.Runtime/GrainDirectory/LocalGrainDirectory.cs:917

  • DeleteGrainAsync uses the same hopCount > 1 retry-delay gating. Since forwarded calls increment hopCount, the first forwarded invocation has hopCount == 1 and will skip the delay/re-check before potentially forwarding again. Consider hopCount > 0 if you intended to delay on the first re-forward.
            // see if the owner is somewhere else (returns null if we are owner)
            var forwardAddress = this.CheckIfShouldForward(grainId, hopCount, "DeleteGrainAsync");

            // After the first forward, we insert a retry delay and recheck owner before forwarding again
            if (hopCount > 1 && forwardAddress != null)
  • Files reviewed: 16/16 changed files
  • Comments generated: 3

// on all silos other than first, we insert a retry delay and recheck owner before forwarding
if (hopCount > 0 && forwardAddress != null)
// After the first forward, we insert a retry delay and recheck owner before forwarding again
if (hopCount > 1 && forwardAddress != null)
Comment on lines 170 to 174
await foreach (var snapshot in updates)
{
// Active filtering: detect silos that went down and try to clean proactively the directory
// Active filtering: detect dead silos and try to clean proactively the directory
var changes = snapshot.CreateUpdate(previousSnapshot).Changes;
var deadSilos = changes
Comment on lines +215 to 216
if (this.cache.LookUp(grainId, out address, out _))
{
@ReubenBond

Copy link
Copy Markdown
Member Author

Split PR stack created from this PR. Merge in order:

  1. Fix LocalGrainDirectory membership reconciliation #10086 - Fix LocalGrainDirectory membership reconciliation
  2. Use membership versions for stale directory cleanup #10087 - Use membership versions for stale directory cleanup
  3. Refresh membership before directory RPC routing #10088 - Refresh membership before directory RPC routing
  4. Simplify LocalGrainDirectory membership processing #10089 - Simplify LocalGrainDirectory membership processing
  5. Move dead-silo message break to connection maintainer #10090 - Move dead-silo message break to connection maintainer
  6. Use version-aware silo status checks #10091 - Use version-aware silo status checks
  7. Refine directory forwarding retry checks #10092 - Refine directory forwarding retry checks

The final split branch (split/pr10085-07-hopcount-guard) has the same tree as this PR branch, so merging the stack in order is source-equivalent to merging #10085.

ReubenBond and others added 2 commits May 11, 2026 21:28
Advance the previous membership snapshot after each cache cleanup pass and stamp manually cached entries with the current membership version so version-aware cache validation does not evict them immediately.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Clarify that hopCount == 1 is allowed to re-forward immediately and the retry delay applies only after the request has already bounced through another owner.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@ReubenBond

Copy link
Copy Markdown
Member Author

Addressed the latest review feedback:

  • Fixed CachedGrainLocator membership update processing so previousSnapshot advances after each update, avoiding repeated dead-silo cleanup for already-processed membership changes.
  • Stamped UpdateCache entries with the current membership version so version-aware cache validation does not immediately evict manually cached entries with MembershipVersion.MinValue.
  • Kept the hopCount > 1 behavior, but clarified the comments: the first forwarded owner (hopCount == 1) can re-forward immediately; the stabilization delay applies once the request has already been re-forwarded/bounced again.

Pushed updates to #10085 and the affected split branches (#10091 and #10092). The final split branch still matches #10085.

@ReubenBond ReubenBond closed this May 12, 2026
@github-actions github-actions Bot locked and limited conversation to collaborators Jun 12, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants