Skip to content

Gossip re-adds nodes replaced after ungraceful shutdown #866

@budde

Description

@budde

Commit hash: dcd618bcf93b60bdf19ca1b29c24c4a08d30615a

(HEAD of release/0.x, behavior should be same on HEAD of main branch)

Context:

If a node in an actors cluster is killed ungracefully (say, via SIGKILL) and its replacement is immediately respawned then the previous node is still considered to be up since SWIM has not yet reached consensus on the previous node being down yet. In this scenario, a call to Cluster.Membership.removeCompletely() will be made when processing the .joining MembershipChange for the replacement node which will immediately remove it as a cluster member. However, if the cluster subsequently receives a gossip update that still includes the previous node as .up then the Cluster.Membership.mergeFrom() function will create a MembershipChange directive to add the node back.

Subsequently, the replacement node gets marked .down by the downing strategy (unclear exactly how this happens) while the previous node stays in the cluster membership marked as .up. This persists after additional restarts.

Steps to reproduce:

  1. Run any actors cluster with 3 nodes using a static IP + port combination for each node
  2. kill -9 the process for one node
  3. Immediately restart that node using the same IP + port

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions