-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Make Snapshot Deletes Less Racy #54765
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make Snapshot Deletes Less Racy #54765
Conversation
Snapshot deletes should first check the cluster state for an in-progress snapshot and try to abort it before checking the repository contents. This allows for atomically checking and aborting a snapshot in the same cluster state update, removing all possible races where a snapshot that is in-progress could not be found if it finishes between checking the repository contents and the cluster state. Also removes confusing races, where checking the cluster state off of the cluster state thread finds an in-progress snapshot that is then not found in the cluster state update to abort it. Finally, the logic to use the repository generation of the in-progress snapshot + 1 was error prone because it would always fail the delete when the repository had a pending generation different from its safe generation when a snapshot started (leading to the snapshot finalizing at a higher generation). These issues (particularly that last point) were shaken out by removing workarounds from SLM tests that would retry snapshot delete operations on repository exceptions as thrown when hitting the unexpected generation issue. The snapshot resiliency test for concurrent snapshot creation and deletion was made to more aggressively start the delete operation so that the above races would become visible. Previously, the fact that deletes would never coincide with initializing snapshots resulted in a number of the above races not reproducing.
|
Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore) |
| @Override | ||
| protected String executor() { | ||
| return ThreadPool.Names.GENERIC; | ||
| return ThreadPool.Names.SAME; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to fork to the generic pool here any more, we instantly move to the cluster state thread now in the snapshots service. We'll use the generic pool there once we have to inspect the repository data
|
Jenkins run elasticsearch-ci/2 (unrelated x-pack failure) |
| this.snapshot = snapshot; | ||
| this.startTime = startTime; | ||
| this.repositoryStateId = repositoryStateId; | ||
| assert repositoryStateId > RepositoryData.EMPTY_REPO_GEN : |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We would incorrectly get a -1 (empty repo gen) here when we deleted/aborted an initializing snapshot (because the repo gen is at -2 during the snapshot INIT stage). This wouldn't have caused any corruption and only made deletes fail but still, this assertion would've avoided not catching this race for so long :)
| // Derive repository generation if a snapshot is in progress because it will increment the generation when it finishes | ||
| repoGenId = matchedInProgress.get().repositoryStateId() + 1L; | ||
| public void deleteSnapshot(final String repositoryName, final String snapshotName, final ActionListener<Void> listener) { | ||
| logger.info("deleting snapshot [{}] from repository [{}]", snapshotName, repositoryName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we also log an info-level message once we have successfully completed the deletion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Jup added one for abort and one for the actual delete if it happened now.
| SnapshotsInProgress snapshots = currentState.custom(SnapshotsInProgress.TYPE); | ||
| SnapshotsInProgress.Entry snapshotEntry = null; | ||
| if (snapshots != null) { | ||
| for (SnapshotsInProgress.Entry entry : snapshots.entries()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add this a helper method to SnapshotsInProgress similar to public Entry snapshot(final Snapshot snapshot)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure extracted that logic :)
| failure = snapshotEntry.failure(); | ||
| } | ||
| return ClusterState.builder(currentState).putCustom(SnapshotsInProgress.TYPE, | ||
| new SnapshotsInProgress(new SnapshotsInProgress.Entry(snapshotEntry, State.ABORTED, shards, failure))).build(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know that this was already the case before, but if we were to allow concurrent snapshots, this line here would be silently removing other snapshots. Let's make sure to preserve other snapshots so that we don't get nasty surprises in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, this logic looks somewhat different in the concurrent deletes branch already but at least this keeps the diff smaller logically later on :)
| deleteSnapshot(new Snapshot(repositoryName, matchedEntry.get()), listener, repoGenId, immediatePriority); | ||
| }, listener::onFailure)); | ||
|
|
||
| private void tryDeleteExisting(Priority priority) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is method defined at the previous cluster state update task level? I think I would prefer explicitly passing in the listener, and moving it up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mainly put it here to make it perfectly clear that we're only running this after trying to abort. Also, we have the logic around the runningSnapshot field here so we can still log the Waited for snapshot ... warning. To me it just gets more confusing if we pull that to the top level because annoyingly enough it's still somewhat connected logically to what happened during the CS update.
| } | ||
| return clusterStateBuilder.build(); | ||
| // add the snapshot deletion to the cluster state | ||
| return ClusterState.builder(currentState).putCustom(SnapshotDeletionsInProgress.TYPE, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here as before, let's preserve other deletions
|
Thanks @ywelsch , all addressed I think |
ywelsch
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
server/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java
Outdated
Show resolved
Hide resolved
|
Thanks Yannick! |
Snapshot deletes should first check the cluster state for an in-progress snapshot and try to abort it before checking the repository contents. This allows for atomically checking and aborting a snapshot in the same cluster state update, removing all possible races where a snapshot that is in-progress could not be found if it finishes between checking the repository contents and the cluster state. Also removes confusing races, where checking the cluster state off of the cluster state thread finds an in-progress snapshot that is then not found in the cluster state update to abort it. Finally, the logic to use the repository generation of the in-progress snapshot + 1 was error prone because it would always fail the delete when the repository had a pending generation different from its safe generation when a snapshot started (leading to the snapshot finalizing at a higher generation). These issues (particularly that last point) can easily be reproduced by running `SLMSnapshotBlockingIntegTests` in a loop with current `master` (see elastic#54766). The snapshot resiliency test for concurrent snapshot creation and deletion was made to more aggressively start the delete operation so that the above races would become visible. Previously, the fact that deletes would never coincide with initializing snapshots resulted in a number of the above races not reproducing. This PR is the most consistent I could get snapshot deletes without changes to the state machine. The fact that aborted deletes will not put the delete operation in the cluster state before waiting for the snapshot to abort still allows for some possible (though practically very unlikely) races. These will be fixed by a state-machine change in upcoming work in elastic#54705 (which will have a much simpler and clearer diff after this change). Closes elastic#54766
Snapshot deletes should first check the cluster state for an in-progress snapshot and try to abort it before checking the repository contents. This allows for atomically checking and aborting a snapshot in the same cluster state update, removing all possible races where a snapshot that is in-progress could not be found if it finishes between checking the repository contents and the cluster state. Also removes confusing races, where checking the cluster state off of the cluster state thread finds an in-progress snapshot that is then not found in the cluster state update to abort it. Finally, the logic to use the repository generation of the in-progress snapshot + 1 was error prone because it would always fail the delete when the repository had a pending generation different from its safe generation when a snapshot started (leading to the snapshot finalizing at a higher generation). These issues (particularly that last point) can easily be reproduced by running `SLMSnapshotBlockingIntegTests` in a loop with current `master` (see #54766). The snapshot resiliency test for concurrent snapshot creation and deletion was made to more aggressively start the delete operation so that the above races would become visible. Previously, the fact that deletes would never coincide with initializing snapshots resulted in a number of the above races not reproducing. This PR is the most consistent I could get snapshot deletes without changes to the state machine. The fact that aborted deletes will not put the delete operation in the cluster state before waiting for the snapshot to abort still allows for some possible (though practically very unlikely) races. These will be fixed by a state-machine change in upcoming work in #54705 (which will have a much simpler and clearer diff after this change). Closes #54766
This TODO became fixable with elastic#54765
This TODO became fixable with #54765
This TODO became fixable with elastic#54765
Snapshot deletes should first check the cluster state for an in-progress snapshot
and try to abort it before checking the repository contents. This allows for atomically
checking and aborting a snapshot in the same cluster state update, removing all possible
races where a snapshot that is in-progress could not be found if it finishes between
checking the repository contents and the cluster state.
Also removes confusing races, where checking the cluster state off of the cluster state thread
finds an in-progress snapshot that is then not found in the cluster state update to abort it.
Finally, the logic to use the repository generation of the in-progress snapshot + 1 was error
prone because it would always fail the delete when the repository had a pending generation different from its safe generation when a snapshot started (leading to the snapshot finalizing at a
higher generation).
These issues (particularly that last point) can easily be reproduced by running
SLMSnapshotBlockingIntegTestsin a loop with currentmaster(see #54766).The snapshot resiliency test for concurrent snapshot creation and deletion was made to more
aggressively start the delete operation so that the above races would become visible.
Previously, the fact that deletes would never coincide with initializing snapshots resulted
in a number of the above races not reproducing.
This PR is the most consistent I could get snapshot deletes without changes to the state machine. The fact that aborted deletes will not put the delete operation in the cluster state before waiting for the snapshot to abort still allows for some possible (though practically very unlikely) races. These will be fixed by a state-machine change in upcoming work in #54705 (which will have a much simpler and clearer diff after this change).
Closes #54766