Skip to content

Snapshot Can Become Stuck if Master Circuit-Breaks on Shard Snapshot Update Message (TransportMasterNodeAction does not retry) #54714

@original-brownbear

Description

@original-brownbear

Currently, the message that data nodes send to master when they are done snapshotting a shard is not exempt from the circuit breaker.
That means that if a data node sends a [internal:cluster/snapshot/update_snapshot_status] message and the master drops it because of the circuit breaker, the data node will not resend it unless master fails over.
This in turn leads to the master never finalising the shard snapshot in the cluster state and thus never finalising the snapshot overall, making it stuck until a master failover occurs.

We need to fix this by either exempting these messages from the circuit breaker or retrying them if they run into the circuit breaker.

Note: due to the small size of the shard status update request this is fairly unlikely to happen fortunately

EDIT: just discussing this with @ywelsch he pointed out that this is a general problem with TransportMasterNodeAction. We retry requests for master on master fail-overs but we don't retry on circuit breaker exceptions. So we ideally need a fix here that adds back-off retries to TransportMasterNodeAction.

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions