Make Snapshot Deletes Less Racy (#54765) #55226

original-brownbear · 2020-04-15T11:20:59Z

Snapshot deletes should first check the cluster state for an in-progress snapshot
and try to abort it before checking the repository contents. This allows for atomically
checking and aborting a snapshot in the same cluster state update, removing all possible
races where a snapshot that is in-progress could not be found if it finishes between
checking the repository contents and the cluster state.
Also removes confusing races, where checking the cluster state off of the cluster state thread
finds an in-progress snapshot that is then not found in the cluster state update to abort it.
Finally, the logic to use the repository generation of the in-progress snapshot + 1 was error
prone because it would always fail the delete when the repository had a pending generation different from its safe generation when a snapshot started (leading to the snapshot finalizing at a
higher generation).

These issues (particularly that last point) can easily be reproduced by running SLMSnapshotBlockingIntegTests in a loop with current master (see #54766).

The snapshot resiliency test for concurrent snapshot creation and deletion was made to more
aggressively start the delete operation so that the above races would become visible.
Previously, the fact that deletes would never coincide with initializing snapshots resulted
in a number of the above races not reproducing.

This PR is the most consistent I could get snapshot deletes without changes to the state machine. The fact that aborted deletes will not put the delete operation in the cluster state before waiting for the snapshot to abort still allows for some possible (though practically very unlikely) races. These will be fixed by a state-machine change in upcoming work in #54705 (which will have a much simpler and clearer diff after this change).

Closes #54766

backport of #54765

Snapshot deletes should first check the cluster state for an in-progress snapshot and try to abort it before checking the repository contents. This allows for atomically checking and aborting a snapshot in the same cluster state update, removing all possible races where a snapshot that is in-progress could not be found if it finishes between checking the repository contents and the cluster state. Also removes confusing races, where checking the cluster state off of the cluster state thread finds an in-progress snapshot that is then not found in the cluster state update to abort it. Finally, the logic to use the repository generation of the in-progress snapshot + 1 was error prone because it would always fail the delete when the repository had a pending generation different from its safe generation when a snapshot started (leading to the snapshot finalizing at a higher generation). These issues (particularly that last point) can easily be reproduced by running `SLMSnapshotBlockingIntegTests` in a loop with current `master` (see elastic#54766). The snapshot resiliency test for concurrent snapshot creation and deletion was made to more aggressively start the delete operation so that the above races would become visible. Previously, the fact that deletes would never coincide with initializing snapshots resulted in a number of the above races not reproducing. This PR is the most consistent I could get snapshot deletes without changes to the state machine. The fact that aborted deletes will not put the delete operation in the cluster state before waiting for the snapshot to abort still allows for some possible (though practically very unlikely) races. These will be fixed by a state-machine change in upcoming work in elastic#54705 (which will have a much simpler and clearer diff after this change). Closes elastic#54766

elasticmachine · 2020-04-15T11:21:01Z

Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore)

original-brownbear added :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs backport labels Apr 15, 2020

Merge remote-tracking branch 'elastic/7.x' into 54765-7.x

6c248b7

original-brownbear merged commit d8b43c6 into elastic:7.x Apr 15, 2020

original-brownbear deleted the 54765-7.x branch April 15, 2020 12:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Make Snapshot Deletes Less Racy (#54765) #55226

Make Snapshot Deletes Less Racy (#54765) #55226

Uh oh!

original-brownbear commented Apr 15, 2020

Uh oh!

elasticmachine commented Apr 15, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Make Snapshot Deletes Less Racy (#54765) #55226

Make Snapshot Deletes Less Racy (#54765) #55226

Uh oh!

Conversation

original-brownbear commented Apr 15, 2020

Uh oh!

elasticmachine commented Apr 15, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants