Skip to content

fix(cluster.tools): fix race condition in DistributedPubSubRestartSpec actor creation#7980

Merged
Aaronontheweb merged 3 commits into
akkadotnet:devfrom
Aaronontheweb:claude-wt-MNTR_DistributedPubSubRestartSpecs
Dec 23, 2025
Merged

fix(cluster.tools): fix race condition in DistributedPubSubRestartSpec actor creation#7980
Aaronontheweb merged 3 commits into
akkadotnet:devfrom
Aaronontheweb:claude-wt-MNTR_DistributedPubSubRestartSpecs

Conversation

@Aaronontheweb

Copy link
Copy Markdown
Member

Summary

  • Fix race condition where First node times out waiting for Third node's shutdown actor
  • Move shutdown actor creation to execute before the 5-second gossip isolation delay
  • Migrate sync-over-async methods to proper async patterns

Root Cause

The test was flaky because the Third node created the shutdown actor after a 5-second ExpectNoMsgAsync delay, but the First node started polling for it immediately after TestConductor.Shutdown. On slow CI machines, the 20-second timeout would expire before the actor was created.

Changes

  1. Race condition fix: Create shutdown actor immediately after JoinAsync, before the gossip verification
  2. Async migration: ExpectMsg<T>()ExpectMsgAsync<T>(), RunOn()RunOnAsync()
  3. Timeout margins: Increased WithinAsync from 20s→25s and inner ExpectMsgAsync from 1s→2s

Test plan

  • Multi-node tests pass on CI
  • DistributedPubSubRestartSpec no longer flaky under load

…c actor creation

Move shutdown actor creation to execute immediately after the new
ActorSystem joins, before the 5-second gossip isolation verification.
This eliminates the race where First node times out waiting for Third
node to create the shutdown actor.

Changes:
- Create shutdown actor before ExpectNoMsgAsync delay on Third node
- Migrate sync-over-async methods to proper async (ExpectMsgAsync,
  RunOnAsync)
- Increase timeout margins (WithinAsync 20s→25s, ExpectMsgAsync 1s→2s)
  for better CI stability
@Aaronontheweb Aaronontheweb enabled auto-merge (squash) December 23, 2025 21:20
@Aaronontheweb

Copy link
Copy Markdown
Member Author

/azp run

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

- Use dedicated TestProbe for Identify messages in AwaitAssertAsync
  to avoid polluting TestActor mailbox with stray ActorIdentity
  responses from retry attempts
- Use dedicated TestProbes for DeltaCount queries on First and Second
  nodes with explicit 5-second timeouts
- This prevents message queue pollution from causing timeout failures
  when ExpectMsgAsync<long> was receiving messages meant for TestActor
@Aaronontheweb Aaronontheweb merged commit bd97d56 into akkadotnet:dev Dec 23, 2025
11 checks passed
Arkatufus pushed a commit to Arkatufus/akka.net that referenced this pull request Jan 7, 2026
…c actor creation (akkadotnet#7980)

Move shutdown actor creation to execute immediately after the new
ActorSystem joins, before the 5-second gossip isolation verification.
This eliminates the race where First node times out waiting for Third
node to create the shutdown actor.

Changes:
- Create shutdown actor before ExpectNoMsgAsync delay on Third node
- Migrate sync-over-async methods to proper async (ExpectMsgAsync,
  RunOnAsync)
- Increase timeout margins (WithinAsync 20s→25s, ExpectMsgAsync 1s→2s)
  for better CI stability
Aaronontheweb added a commit that referenced this pull request Jan 8, 2026
…c actor creation (#7980)

Move shutdown actor creation to execute immediately after the new
ActorSystem joins, before the 5-second gossip isolation verification.
This eliminates the race where First node times out waiting for Third
node to create the shutdown actor.

Changes:
- Create shutdown actor before ExpectNoMsgAsync delay on Third node
- Migrate sync-over-async methods to proper async (ExpectMsgAsync,
  RunOnAsync)
- Increase timeout margins (WithinAsync 20s→25s, ExpectMsgAsync 1s→2s)
  for better CI stability
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant