Skip to content

Use version-aware silo status checks#10091

Closed
ReubenBond wants to merge 7 commits into
dotnet:mainfrom
ReubenBond:split/pr10085-06-version-aware-status
Closed

Use version-aware silo status checks#10091
ReubenBond wants to merge 7 commits into
dotnet:mainfrom
ReubenBond:split/pr10085-06-version-aware-status

Conversation

@ReubenBond

@ReubenBond ReubenBond commented May 12, 2026

Copy link
Copy Markdown
Member

Part 6 of 7 split from #10085.

Problem:
Directory and cache validation need to treat entries as invalid only when the referenced silo is Dead, or when the silo is unknown in a snapshot newer than the entry's membership version. ShuttingDown and Stopping silos should remain valid until they are Dead.

Solution:
Add ClusterMembershipSnapshot.GetSiloStatus(silo, seenAtVersion), update LocalGrainDirectory, handoff, cached locator, and partition validation to use it, and add coverage for the version-aware semantics.

Stack:
Merge after #10090. This branch is stacked on split/pr10085-05-dead-silo-message-break; until earlier PRs merge, GitHub may show earlier stack changes. Incremental compare: ReubenBond/orleans@split/pr10085-05-dead-silo-message-break...split/pr10085-06-version-aware-status

Review focus:
Version-aware GetSiloStatus semantics, Dead-only filtering, and preserving ShuttingDown/Stopping registrations.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates Orleans grain-directory and locator logic to use version-aware silo status checks, treating entries as invalid only when the referenced silo is Dead (or provably stale/unknown relative to a newer membership snapshot), while keeping ShuttingDown/Stopping registrations valid until death.

Changes:

  • Add ClusterMembershipSnapshot.GetSiloStatus(SiloAddress, MembershipVersion) and switch directory/cache/handoff validation to use it for dead-only filtering with version awareness.
  • Rework LocalGrainDirectory membership processing to apply IClusterMembershipService snapshots and use dead-only invalidation for directory entries and cache.
  • Add/extend tests to cover version-aware “unknown vs dead” semantics and terminating-but-not-dead behavior.
Show a summary per file
File Description
test/Orleans.Runtime.Internal.Tests/LocalGrainDirectoryTests.cs Adds unit tests for LocalGrainDirectory.IsDefunctActivation version-aware semantics.
test/Orleans.Runtime.Internal.Tests/GrainDirectoryPartitionTests.cs Switches partition tests to IClusterMembershipService and adds coverage for terminating-but-not-dead behavior.
test/Orleans.Runtime.Internal.Tests/GrainDirectoryHandoffManagerTests.cs Adds tests ensuring handoff transferability is dead-only and version-aware.
test/Orleans.Runtime.Internal.Tests/ClusterMembershipSnapshotTests.cs Adds tests for GetSiloStatus(silo, seenAtVersion) semantics (unknown/older => Dead).
test/Orleans.Core.Tests/Directory/CachedGrainLocatorTests.cs Updates wiring to include membership service and adds cache validation tests for ShuttingDown/Stopping silos.
src/Orleans.Runtime/Networking/SiloConnectionMaintainer.cs Breaks outstanding messages to Dead silos and closes connections on death.
src/Orleans.Runtime/MembershipService/SiloStatusListenerManager.cs Minor type change (sealed).
src/Orleans.Runtime/MembershipService/ClusterMembershipSnapshot.cs Introduces the version-aware GetSiloStatus overload.
src/Orleans.Runtime/GrainDirectory/LocalGrainDirectoryPartition.cs Replaces terminating-based checks with snapshot-based dead-only defunct detection.
src/Orleans.Runtime/GrainDirectory/LocalGrainDirectory.cs Moves to snapshot-driven membership application; dead-only invalidation; refreshes membership when entries are newer than applied snapshot.
src/Orleans.Runtime/GrainDirectory/ILocalGrainDirectory.cs Removes IsSiloInCluster from the internal interface.
src/Orleans.Runtime/GrainDirectory/GrainDirectoryPartition.Interface.cs Uses version-aware dead detection when deciding whether an entry is dead.
src/Orleans.Runtime/GrainDirectory/GrainDirectoryHandoffManager.cs Uses snapshot dead-only filtering for transferable registrations; changes pending-operation retry behavior.
src/Orleans.Runtime/GrainDirectory/CachedGrainLocator.cs Changes proactive cleanup to dead-only and uses version-aware status checks for cached entries.
src/Orleans.Runtime/Catalog/Catalog.cs Removes directory-owned silo-status-change deactivation logic (moved elsewhere).
src/api/Orleans.Runtime/Orleans.Runtime.cs Updates generated public API surface for the new GetSiloStatus overload.

Copilot's findings

Comments suppressed due to low confidence (2)

src/Orleans.Runtime/GrainDirectory/LocalGrainDirectory.cs:677

  • UnregisterAsync: the retry-delay/recheck gate uses hopCount > 1, so the first forwarded request (hopCount == 1) will forward again without delay/revalidation. This is inconsistent with LookupAsync/DeleteGrainAsync (hopCount > 0) and may reintroduce fast hop chains when directory ownership is unstable. Consider using hopCount > 0 (or otherwise aligning the hop-count semantics) so forwarded unregisters also get a stabilization delay.
            await RefreshMembershipIfNewer(address.MembershipVersion);

            // see if the owner is somewhere else (returns null if we are owner)
            var forwardAddress = this.CheckIfShouldForward(address.GrainId, hopCount, "UnregisterAsync");

            // After the first forward, we insert a retry delay and recheck owner before forwarding again
            if (hopCount > 1 && forwardAddress != null)
            {
                await Task.Delay(RETRY_DELAY);
                forwardAddress = this.CheckIfShouldForward(address.GrainId, hopCount, "UnregisterAsync");

src/Orleans.Runtime/GrainDirectory/CachedGrainLocator.cs:190

  • ListenToClusterChange computes snapshot.CreateUpdate(previousSnapshot) but never updates previousSnapshot inside the loop. As a result, every iteration diffs against the initial snapshot, which can repeatedly re-process the same dead silos and grow the change set over time. Updating previousSnapshot = snapshot at the end of each loop will make the processing incremental and avoid redundant UnregisterSilos calls.
            var previousSnapshot = this.clusterMembershipService.CurrentSnapshot;

            ((ITestAccessor)this).LastMembershipVersion = previousSnapshot.Version;

            var updates = this.clusterMembershipService.MembershipUpdates.WithCancellation(this.shutdownToken.Token);
            await foreach (var snapshot in updates)
            {
                // Active filtering: detect dead silos and try to clean proactively the directory
                var changes = snapshot.CreateUpdate(previousSnapshot).Changes;
                var deadSilos = changes
                    .Where(member => member.Status == SiloStatus.Dead)
                    .Select(member => member.SiloAddress)
                    .ToList();

                if (deadSilos.Count > 0)
                {
                    var tasks = new List<Task>();
                    foreach (var directory in this.grainDirectoryResolver.Directories)
                    {
                        tasks.Add(directory.UnregisterSilos(deadSilos));
                    }
                    await Task.WhenAll(tasks).WaitAsync(this.shutdownToken.Token);
                }

                ((ITestAccessor)this).LastMembershipVersion = snapshot.Version;
            }
  • Files reviewed: 16/16 changed files
  • Comments generated: 2

Comment thread src/Orleans.Runtime/GrainDirectory/LocalGrainDirectory.cs
Comment thread src/Orleans.Runtime/GrainDirectory/GrainDirectoryHandoffManager.cs
@ReubenBond ReubenBond force-pushed the split/pr10085-06-version-aware-status branch 5 times, most recently from 77123df to c845b97 Compare May 12, 2026 20:13
ReubenBond and others added 7 commits May 12, 2026 13:15
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When background Orleans work logs after xUnit has cleared the current test context, ITestOutputHelper throws InvalidOperationException with "There is no currently active test." That exception can escape through Microsoft.Extensions.Logging and abort the test host.

Fall back to stderr for that specific late-log case so the original runtime log is still emitted without crashing the test process.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Use ClusterMembershipSnapshot.GetSiloStatus with registration membership versions when deciding whether grain directory entries are dead. This keeps shutting down and stopping silos valid until they are marked dead while still filtering old unknown or replaced silos.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Advance the previous membership snapshot after each cache cleanup pass and stamp manually cached entries with the current membership version so version-aware cache validation does not evict them immediately.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@ReubenBond

Copy link
Copy Markdown
Member Author

Closing as obsolete: the later stacked PR #10092 has already merged and its merge commit includes the #10090/#10091 changes, including the xUnit logger crash fix. Current main also includes the follow-up directory retry PRs #10094 and #10095.

@ReubenBond ReubenBond closed this May 12, 2026
@github-actions github-actions Bot locked and limited conversation to collaborators Jun 12, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants