Skip to content

Commit d8b43c6

Browse files
Make Snapshot Deletes Less Racy (#54765) (#55226)
Snapshot deletes should first check the cluster state for an in-progress snapshot and try to abort it before checking the repository contents. This allows for atomically checking and aborting a snapshot in the same cluster state update, removing all possible races where a snapshot that is in-progress could not be found if it finishes between checking the repository contents and the cluster state. Also removes confusing races, where checking the cluster state off of the cluster state thread finds an in-progress snapshot that is then not found in the cluster state update to abort it. Finally, the logic to use the repository generation of the in-progress snapshot + 1 was error prone because it would always fail the delete when the repository had a pending generation different from its safe generation when a snapshot started (leading to the snapshot finalizing at a higher generation). These issues (particularly that last point) can easily be reproduced by running `SLMSnapshotBlockingIntegTests` in a loop with current `master` (see #54766). The snapshot resiliency test for concurrent snapshot creation and deletion was made to more aggressively start the delete operation so that the above races would become visible. Previously, the fact that deletes would never coincide with initializing snapshots resulted in a number of the above races not reproducing. This PR is the most consistent I could get snapshot deletes without changes to the state machine. The fact that aborted deletes will not put the delete operation in the cluster state before waiting for the snapshot to abort still allows for some possible (though practically very unlikely) races. These will be fixed by a state-machine change in upcoming work in #54705 (which will have a much simpler and clearer diff after this change). Closes #54766
1 parent 70616cd commit d8b43c6

File tree

4 files changed

+198
-169
lines changed

4 files changed

+198
-169
lines changed

server/src/main/java/org/elasticsearch/action/admin/cluster/snapshots/delete/TransportDeleteSnapshotAction.java

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -53,7 +53,7 @@ public TransportDeleteSnapshotAction(TransportService transportService, ClusterS
5353

5454
@Override
5555
protected String executor() {
56-
return ThreadPool.Names.GENERIC;
56+
return ThreadPool.Names.SAME;
5757
}
5858

5959
@Override
@@ -71,6 +71,6 @@ protected ClusterBlockException checkBlock(DeleteSnapshotRequest request, Cluste
7171
protected void masterOperation(final DeleteSnapshotRequest request, ClusterState state,
7272
final ActionListener<AcknowledgedResponse> listener) {
7373
snapshotsService.deleteSnapshot(request.repository(), request.snapshot(),
74-
ActionListener.map(listener, v -> new AcknowledgedResponse(true)), false);
74+
ActionListener.map(listener, v -> new AcknowledgedResponse(true)));
7575
}
7676
}

server/src/main/java/org/elasticsearch/cluster/SnapshotDeletionsInProgress.java

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,7 @@
2626
import org.elasticsearch.common.io.stream.Writeable;
2727
import org.elasticsearch.common.unit.TimeValue;
2828
import org.elasticsearch.common.xcontent.XContentBuilder;
29+
import org.elasticsearch.repositories.RepositoryData;
2930
import org.elasticsearch.repositories.RepositoryOperation;
3031
import org.elasticsearch.snapshots.Snapshot;
3132

@@ -174,6 +175,8 @@ public Entry(Snapshot snapshot, long startTime, long repositoryStateId) {
174175
this.snapshot = snapshot;
175176
this.startTime = startTime;
176177
this.repositoryStateId = repositoryStateId;
178+
assert repositoryStateId > RepositoryData.EMPTY_REPO_GEN :
179+
"Can't delete based on an empty or unknown repository generation but saw [" + repositoryStateId + "]";
177180
}
178181

179182
public Entry(StreamInput in) throws IOException {

0 commit comments

Comments
 (0)