Snapshot Can Become Stuck if Master Circuit-Breaks on Shard Snapshot Update Message (TransportMasterNodeAction does not retry)

Currently, the message that data nodes send to master when they are done snapshotting a shard is not exempt from the circuit breaker.
That means that if a data node sends a `[internal:cluster/snapshot/update_snapshot_status]` message and the master drops it because of the circuit breaker, the data node will not resend it unless master fails over.
This in turn leads to the master never finalising the shard snapshot in the cluster state and thus never finalising the snapshot overall, making it stuck until a master failover occurs.

We need to fix this by either exempting these messages from the circuit breaker or retrying them if they run into the circuit breaker.

Note: due to the small size of the shard status update request this is fairly unlikely to happen fortunately

EDIT: just discussing this with @ywelsch he pointed out that this is a general problem with `TransportMasterNodeAction`. We retry requests for master on master fail-overs but we don't retry on circuit breaker exceptions. So we ideally need a fix here that adds back-off retries to `TransportMasterNodeAction`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Snapshot Can Become Stuck if Master Circuit-Breaks on Shard Snapshot Update Message (TransportMasterNodeAction does not retry) #54714

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Snapshot Can Become Stuck if Master Circuit-Breaks on Shard Snapshot Update Message (TransportMasterNodeAction does not retry) #54714

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions