-
Notifications
You must be signed in to change notification settings - Fork 78
Description
Commit hash: dcd618bcf93b60bdf19ca1b29c24c4a08d30615a
(HEAD of release/0.x, behavior should be same on HEAD of main branch)
Context:
If a node in an actors cluster is killed ungracefully (say, via SIGKILL) and its replacement is immediately respawned then the previous node is still considered to be up since SWIM has not yet reached consensus on the previous node being down yet. In this scenario, a call to Cluster.Membership.removeCompletely() will be made when processing the .joining MembershipChange for the replacement node which will immediately remove it as a cluster member. However, if the cluster subsequently receives a gossip update that still includes the previous node as .up then the Cluster.Membership.mergeFrom() function will create a MembershipChange directive to add the node back.
Subsequently, the replacement node gets marked .down by the downing strategy (unclear exactly how this happens) while the previous node stays in the cluster membership marked as .up. This persists after additional restarts.
Steps to reproduce:
- Run any actors cluster with 3 nodes using a static IP + port combination for each node
kill -9the process for one node- Immediately restart that node using the same IP + port