Fix version-aware grain directory membership cleanup#10085
Conversation
Process cluster membership as snapshots in LocalGrainDirectory so directory state can be reconciled and retried after failures. Move silo-removal activation cleanup out of Catalog and keep handoff operations retrying until success, obsolescence, or shutdown. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Suppress expected shutdown failures while stopping membership processing and disposing the directory cache. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Remove redundant locking and cancellation from snapshot application, publish the latest directory membership before side effects, and simplify defunct entry cleanup against the current membership. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Evict LocalGrainDirectory directory and cache entries for terminating silos immediately, and only evict unknown-silo entries when the address was registered before the applied membership snapshot. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Refresh and apply cluster membership snapshots when grain directory RPCs receive GrainAddress values from a newer membership version before making ownership decisions. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Apply the latest cluster membership snapshot directly, remove the unused IsSiloInCluster contract, and simplify related LocalGrainDirectory expressions. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Clarify why LocalGrainDirectory removes dead-silo activations and stale unknown-silo activations during membership reconciliation. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR updates Orleans.Runtime’s local grain directory membership handling to reconcile from IClusterMembershipService snapshots (instead of silo-status events), moves silo-removal activation cleanup responsibility from Catalog into LocalGrainDirectory, and changes directory handoff operations to keep retrying until they succeed or shutdown.
Changes:
- Replace
ISiloStatusListener-driven membership updates inLocalGrainDirectorywith a background loop applyingIClusterMembershipServicesnapshots. - Move “directory owner removed ⇒ deactivate local activations” logic from
CatalogintoLocalGrainDirectoryusing the previous membership view for ownership checks. - Adjust handoff manager operation processing to continuously retry queued operations (and add additional filtering of defunct registrations).
Show a summary per file
| File | Description |
|---|---|
| test/Orleans.Core.Tests/Directory/CachedGrainLocatorTests.cs | Updates unit tests to provide a cluster membership service dependency. |
| src/Orleans.Runtime/GrainDirectory/LocalGrainDirectory.cs | Implements snapshot-based membership reconciliation and migrates activation cleanup logic into the directory. |
| src/Orleans.Runtime/GrainDirectory/ILocalGrainDirectory.cs | Removes IsSiloInCluster from the local directory interface. |
| src/Orleans.Runtime/GrainDirectory/GrainDirectoryHandoffManager.cs | Changes enqueue/retry behavior for handoff operations and filters defunct registrations. |
| src/Orleans.Runtime/Catalog/Catalog.cs | Removes silo-removal activation cleanup logic now handled by LocalGrainDirectory. |
Copilot's findings
Comments suppressed due to low confidence (1)
src/Orleans.Runtime/GrainDirectory/LocalGrainDirectory.cs:1055
- This log message says "failure of silo" but OnSiloStatusChange runs for any terminating status (eg ShuttingDown/Stopping) which can be a planned shutdown rather than a failure. Consider adjusting the wording (or include the status) so operational logs are accurate.
[LoggerMessage(
Level = LogLevel.Information,
EventId = (int)ErrorCode.Catalog_SiloStatusChangeNotification,
Message = "LocalGrainDirectory is deactivating {Count} activations due to a failure of silo {Silo}, since it is a primary directory partition to these grain ids."
)]
private partial void LogInfoSiloStatusChangeNotification(int count, SiloAddressLogValue silo);
- Files reviewed: 5/5 changed files
- Comments generated: 3
| lock (this) | ||
| { | ||
| this.pendingOperations.Enqueue((name, state, action)); | ||
| if (this.pendingOperations.Count <= 2) | ||
| { | ||
| this.localDirectory.RemoteGrainDirectory.WorkItemGroup.QueueTask(ExecutePendingOperations, localDirectory.RemoteGrainDirectory); | ||
| } | ||
| this.localDirectory.RemoteGrainDirectory.WorkItemGroup.QueueTask(ExecutePendingOperations, localDirectory.RemoteGrainDirectory); | ||
| } |
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Use ClusterMembershipSnapshot.GetSiloStatus with registration membership versions when deciding whether grain directory entries are dead. This keeps shutting down and stopping silos valid until they are marked dead while still filtering old unknown or replaced silos. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Only delay and recheck ownership after a request has already been forwarded once, matching the single-operation forwarding behavior. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Copilot's findings
Comments suppressed due to low confidence (4)
src/Orleans.Runtime/GrainDirectory/LocalGrainDirectory.cs:674
- Same off-by-one issue as in
RegisterAsync: with forwarding implemented ashopCount + 1,hopCount == 1is already a forwarded request. UsinghopCount > 1skips the retry delay/re-check on the first re-forward and may cause unnecessary extra hops during membership churn. ConsiderhopCount > 0(or adjust the comment/logic consistently).
// see if the owner is somewhere else (returns null if we are owner)
var forwardAddress = this.CheckIfShouldForward(address.GrainId, hopCount, "UnregisterAsync");
// After the first forward, we insert a retry delay and recheck owner before forwarding again
if (hopCount > 1 && forwardAddress != null)
src/Orleans.Runtime/GrainDirectory/LocalGrainDirectory.cs:743
- The retry-delay gating uses
hopCount > 1. Since forwarded calls usehopCount + 1, the first forwardedUnregisterManyAsynccall will havehopCount == 1, and with the current condition it can immediately re-forward without the intended stabilization delay. Consider usinghopCount > 0if the delay is meant to apply after the first hop.
UnregisterOrPutInForwardList(addresses, cause, hopCount, ref forwardlist, "UnregisterManyAsync");
// After the first forward, we insert a retry delay and recheck owner before forwarding again
if (hopCount > 1 && forwardlist != null)
{
src/Orleans.Runtime/GrainDirectory/LocalGrainDirectory.cs:856
- The retry-delay condition uses
hopCount > 1, which means a request forwarded once (hopCount == 1) can be re-forwarded immediately if ownership changes again. If the goal is to pause before any re-forward after the initial hop, considerhopCount > 0to match the comment and reduce forwarding churn.
// see if the owner is somewhere else (returns null if we are owner)
var forwardAddress = this.CheckIfShouldForward(grainId, hopCount, "LookUpAsync");
// After the first forward, we insert a retry delay and recheck owner before forwarding again
if (hopCount > 1 && forwardAddress != null)
src/Orleans.Runtime/GrainDirectory/LocalGrainDirectory.cs:917
DeleteGrainAsyncuses the samehopCount > 1retry-delay gating. Since forwarded calls incrementhopCount, the first forwarded invocation hashopCount == 1and will skip the delay/re-check before potentially forwarding again. ConsiderhopCount > 0if you intended to delay on the first re-forward.
// see if the owner is somewhere else (returns null if we are owner)
var forwardAddress = this.CheckIfShouldForward(grainId, hopCount, "DeleteGrainAsync");
// After the first forward, we insert a retry delay and recheck owner before forwarding again
if (hopCount > 1 && forwardAddress != null)
- Files reviewed: 16/16 changed files
- Comments generated: 3
| // on all silos other than first, we insert a retry delay and recheck owner before forwarding | ||
| if (hopCount > 0 && forwardAddress != null) | ||
| // After the first forward, we insert a retry delay and recheck owner before forwarding again | ||
| if (hopCount > 1 && forwardAddress != null) |
| await foreach (var snapshot in updates) | ||
| { | ||
| // Active filtering: detect silos that went down and try to clean proactively the directory | ||
| // Active filtering: detect dead silos and try to clean proactively the directory | ||
| var changes = snapshot.CreateUpdate(previousSnapshot).Changes; | ||
| var deadSilos = changes |
| if (this.cache.LookUp(grainId, out address, out _)) | ||
| { |
|
Split PR stack created from this PR. Merge in order:
The final split branch ( |
Advance the previous membership snapshot after each cache cleanup pass and stamp manually cached entries with the current membership version so version-aware cache validation does not evict them immediately. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Clarify that hopCount == 1 is allowed to re-forward immediately and the retry delay applies only after the request has already bounced through another owner. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Addressed the latest review feedback:
Pushed updates to #10085 and the affected split branches (#10091 and #10092). The final split branch still matches #10085. |
Problem
LocalGrainDirectory membership reconciliation and grain-directory entry validation were using lossy status events or approximate terminating-state checks. That could remove, reject, or stop returning registrations for silos which were
ShuttingDownorStopping, even though those registrations should remain valid until the silo is known to beDead.Solution
IClusterMembershipServicesnapshots.ClusterMembershipSnapshot.GetSiloStatus(silo, seenAtVersion)overload so stale unknown silos can be treated as dead only when the snapshot is newer than the registration.ShuttingDown/Stoppingsilos while excluding only entries resolved asDead.Review focus
Please focus on the
GetSiloStatus(..., seenAtVersion)semantics and whether the dead-only filtering is applied consistently across directory, cache, and handoff paths.Microsoft Reviewers: Open in CodeFlow