-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Description
Currently, the message that data nodes send to master when they are done snapshotting a shard is not exempt from the circuit breaker.
That means that if a data node sends a [internal:cluster/snapshot/update_snapshot_status] message and the master drops it because of the circuit breaker, the data node will not resend it unless master fails over.
This in turn leads to the master never finalising the shard snapshot in the cluster state and thus never finalising the snapshot overall, making it stuck until a master failover occurs.
We need to fix this by either exempting these messages from the circuit breaker or retrying them if they run into the circuit breaker.
Note: due to the small size of the shard status update request this is fairly unlikely to happen fortunately
EDIT: just discussing this with @ywelsch he pointed out that this is a general problem with TransportMasterNodeAction. We retry requests for master on master fail-overs but we don't retry on circuit breaker exceptions. So we ideally need a fix here that adds back-off retries to TransportMasterNodeAction.