Skip to content

Fix race condition in multi-producer sharding delivery test#7975

Merged
Aaronontheweb merged 3 commits into
akkadotnet:devfrom
Aaronontheweb:claude-wt-ReliableDelivery_with_Sharding_must_illustrate_Sharding_usage_with_several_producers
Dec 23, 2025
Merged

Fix race condition in multi-producer sharding delivery test#7975
Aaronontheweb merged 3 commits into
akkadotnet:devfrom
Aaronontheweb:claude-wt-ReliableDelivery_with_Sharding_must_illustrate_Sharding_usage_with_several_producers

Conversation

@Aaronontheweb

Copy link
Copy Markdown
Member

Summary

Fixes a race condition in the test ReliableDelivery_with_Sharding_must_illustrate_Sharding_usage_with_several_producers that was causing intermittent failures.

Root Cause

When multiple producers send messages to the same sharded entities:

  • Each producer-entity pair maintains independent sequence numbers (1-42)
  • Producer IDs are formatted as {producerId}-{entityId} (e.g., p1-entity-0, p2-entity-0)
  • The end condition seqNr >= 42 would trigger when any producer reached seqNr 42
  • This caused entities to stop immediately, before other producers could complete delivery

The Race

  1. Producer1 sends messages 1-42 to entity-0 (producerId: p1-entity-0)
  2. Producer2 sends messages 1-42 to entity-0 (producerId: p2-entity-0)
  3. Whichever producer completes first triggers the end condition
  4. Entity-0 stops and sends Collected message with only one producer ID
  5. The other producer's messages are lost (sent to terminated actor)
  6. Test expects 6 producer IDs across 3 entities but gets incomplete data

Solution

Modified TestConsumer to track per-producer completion:

  • Added expectedProducerCount parameter (defaults to 1 for backward compatibility)
  • Track which producers have met the end condition in _completedProducers set
  • Only stop the entity when all expected producers have completed
  • Updated the multi-producer test to pass expectedProducerCount: 2

Changes

src/core/Akka.Tests/Delivery/TestConsumer.cs

  • Added _expectedProducerCount and _completedProducers fields
  • Modified end condition handler to wait for all producers before stopping
  • Updated constructor and factory methods to accept expectedProducerCount

src/contrib/cluster/Akka.Cluster.Sharding.Tests/Delivery/ReliableDeliveryShardingSpec.cs

  • Updated multi-producer test to pass expectedProducerCount: 2 in line 91

Test Plan

  • Test ran successfully 20 consecutive times without failures
  • All 8 tests in ReliableDeliveryShardingSpec pass
  • No regressions in test suite
  • Backward compatibility maintained (single-producer tests unaffected)

Impact

This is a test-only fix with no production code changes. The fix ensures that multi-producer reliable delivery tests properly validate the scenario they're designed to test, rather than failing due to race conditions in the test infrastructure.

The test ReliableDelivery_with_Sharding_must_illustrate_Sharding_usage_with_several_producers
was experiencing intermittent failures due to a premature entity termination race condition.

Root cause: When multiple producers send messages to the same sharded entities, each
producer-entity pair maintains independent sequence numbers (1-42). The test's end condition
(seqNr >= 42) would trigger when ANY producer reached seqNr 42, causing the entity to stop
immediately, before other producers could complete their message delivery.

This resulted in the test expecting 3 Collected messages (one per entity) containing all 6
producer IDs (p1-entity-{0,1,2} and p2-entity-{0,1,2}), but entities would stop after only
receiving messages from the first producer to complete, causing timeouts.

Solution: Modified TestConsumer to track per-producer completion:
- Added expectedProducerCount parameter (defaults to 1 for backward compatibility)
- Track which producers have met the end condition in _completedProducers set
- Only stop the entity when ALL expected producers have completed
- Updated multi-producer test to pass expectedProducerCount: 2

This ensures entities properly wait for all producers to complete message delivery before
terminating, eliminating the race condition without relying on timeout adjustments.

Verified with 20 consecutive successful test runs and full test suite validation.
EndReplyTo.Tell(new Collected(_processed.Select(c => c.Item1).ToImmutableHashSet(), _messageCount + 1));
Context.Stop(Self);
// Track that this producer has completed
if (!_completedProducers.Contains(job.ProducerId))

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the key fix - need to track completions individually per-producer if there are more than 1 of them messaging the same TestConsumer

@Aaronontheweb Aaronontheweb enabled auto-merge (squash) December 22, 2025 19:07
…_illustrate_Sharding_usage_with_several_producers
…_illustrate_Sharding_usage_with_several_producers
@Aaronontheweb Aaronontheweb merged commit 1e51fc4 into akkadotnet:dev Dec 23, 2025
11 checks passed
@Aaronontheweb Aaronontheweb deleted the claude-wt-ReliableDelivery_with_Sharding_must_illustrate_Sharding_usage_with_several_producers branch December 23, 2025 20:14
Arkatufus pushed a commit to Arkatufus/akka.net that referenced this pull request Jan 7, 2026
…et#7975)

The test ReliableDelivery_with_Sharding_must_illustrate_Sharding_usage_with_several_producers
was experiencing intermittent failures due to a premature entity termination race condition.

Root cause: When multiple producers send messages to the same sharded entities, each
producer-entity pair maintains independent sequence numbers (1-42). The test's end condition
(seqNr >= 42) would trigger when ANY producer reached seqNr 42, causing the entity to stop
immediately, before other producers could complete their message delivery.

This resulted in the test expecting 3 Collected messages (one per entity) containing all 6
producer IDs (p1-entity-{0,1,2} and p2-entity-{0,1,2}), but entities would stop after only
receiving messages from the first producer to complete, causing timeouts.

Solution: Modified TestConsumer to track per-producer completion:
- Added expectedProducerCount parameter (defaults to 1 for backward compatibility)
- Track which producers have met the end condition in _completedProducers set
- Only stop the entity when ALL expected producers have completed
- Updated multi-producer test to pass expectedProducerCount: 2

This ensures entities properly wait for all producers to complete message delivery before
terminating, eliminating the race condition without relying on timeout adjustments.

Verified with 20 consecutive successful test runs and full test suite validation.
Aaronontheweb added a commit that referenced this pull request Jan 8, 2026
The test ReliableDelivery_with_Sharding_must_illustrate_Sharding_usage_with_several_producers
was experiencing intermittent failures due to a premature entity termination race condition.

Root cause: When multiple producers send messages to the same sharded entities, each
producer-entity pair maintains independent sequence numbers (1-42). The test's end condition
(seqNr >= 42) would trigger when ANY producer reached seqNr 42, causing the entity to stop
immediately, before other producers could complete their message delivery.

This resulted in the test expecting 3 Collected messages (one per entity) containing all 6
producer IDs (p1-entity-{0,1,2} and p2-entity-{0,1,2}), but entities would stop after only
receiving messages from the first producer to complete, causing timeouts.

Solution: Modified TestConsumer to track per-producer completion:
- Added expectedProducerCount parameter (defaults to 1 for backward compatibility)
- Track which producers have met the end condition in _completedProducers set
- Only stop the entity when ALL expected producers have completed
- Updated multi-producer test to pass expectedProducerCount: 2

This ensures entities properly wait for all producers to complete message delivery before
terminating, eliminating the race condition without relying on timeout adjustments.

Verified with 20 consecutive successful test runs and full test suite validation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant