Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 5 additions & 2 deletions src/Orleans.Journaling/StateMachineManager.cs
Original file line number Diff line number Diff line change
Expand Up @@ -303,9 +303,12 @@ private async Task RecoverAsync(CancellationToken cancellationToken)
}
}

foreach (var stateMachine in _stateMachines.Values)
lock (_lock)
{
stateMachine.OnRecoveryCompleted();
foreach (var stateMachine in _stateMachines.Values)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could exhibit the same issue if OnRecoveryComplete is able to register another state machine

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, enumerating over a copy would be the safer choice. At least it does not occur from concurrency (which comes as a surprise to users)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know where the concurrency is coming from?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The recovery process itself is done safely, but as new activations are created so are the instances of state machines as part of the grain's ctor. Each of them registers themselves in the state machine manager as their ctor runs. That process happens outside the work loop of the manager, and it can modify the list of state machines which is currently being enumerated as part of notifying all state machines that recovery is completed. Those two operations happen in concurrently by different threads.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Locking on the recovery notification, means concurrent attempts for state machines to register themselves have to wait until all SMs are notified, i.e. the RegisterStateMachine(name, sm) correctly locks but that lock is open if the enumeration does not hold the lock (which it does now in this PR)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats how i see it, hope it makes sense!

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StateMachineManager is supposed to be created per-activation, though, so all of this should be happening on the same thread. Do you have a repro somewhere that I could look at?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, how did i miss that!!! Yeah, it happens sometimes for the "automatic job scenario"

https://github.com/ledjon-behluli/DurableStateMachines/blob/main/playground/DurableStateMachines.CTS/Program.cs#L49

 Orleans.Journaling.StateMachineManager[2114651837]
      Error processing work items.
      System.InvalidOperationException: Collection was modified; enumeration operation may not execute.
         at System.Collections.Generic.Dictionary`2.ValueCollection.Enumerator.MoveNext()
         at Orleans.Journaling.StateMachineManager.RecoverAsync(CancellationToken cancellationToken) in /_/src/Orleans.Journaling/StateMachineManager.cs:line 306
         at Orleans.Journaling.StateMachineManager.WorkLoop() in /_/src/Orleans.Journaling/StateMachineManager.cs:line 104

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not look too closely into that as i was under the assumption that the SM manager was for the silo, so i got fooled by the lack of lock.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lock is good anyway

{
stateMachine.OnRecoveryCompleted();
}
}
}

Expand Down
Loading