Make Snapshot Deletes Less Racy #54765

original-brownbear · 2020-04-04T16:15:03Z

Snapshot deletes should first check the cluster state for an in-progress snapshot
and try to abort it before checking the repository contents. This allows for atomically
checking and aborting a snapshot in the same cluster state update, removing all possible
races where a snapshot that is in-progress could not be found if it finishes between
checking the repository contents and the cluster state.
Also removes confusing races, where checking the cluster state off of the cluster state thread
finds an in-progress snapshot that is then not found in the cluster state update to abort it.
Finally, the logic to use the repository generation of the in-progress snapshot + 1 was error
prone because it would always fail the delete when the repository had a pending generation different from its safe generation when a snapshot started (leading to the snapshot finalizing at a
higher generation).

These issues (particularly that last point) can easily be reproduced by running SLMSnapshotBlockingIntegTests in a loop with current master (see #54766).

The snapshot resiliency test for concurrent snapshot creation and deletion was made to more
aggressively start the delete operation so that the above races would become visible.
Previously, the fact that deletes would never coincide with initializing snapshots resulted
in a number of the above races not reproducing.

This PR is the most consistent I could get snapshot deletes without changes to the state machine. The fact that aborted deletes will not put the delete operation in the cluster state before waiting for the snapshot to abort still allows for some possible (though practically very unlikely) races. These will be fixed by a state-machine change in upcoming work in #54705 (which will have a much simpler and clearer diff after this change).

Closes #54766

Snapshot deletes should first check the cluster state for an in-progress snapshot and try to abort it before checking the repository contents. This allows for atomically checking and aborting a snapshot in the same cluster state update, removing all possible races where a snapshot that is in-progress could not be found if it finishes between checking the repository contents and the cluster state. Also removes confusing races, where checking the cluster state off of the cluster state thread finds an in-progress snapshot that is then not found in the cluster state update to abort it. Finally, the logic to use the repository generation of the in-progress snapshot + 1 was error prone because it would always fail the delete when the repository had a pending generation different from its safe generation when a snapshot started (leading to the snapshot finalizing at a higher generation). These issues (particularly that last point) were shaken out by removing workarounds from SLM tests that would retry snapshot delete operations on repository exceptions as thrown when hitting the unexpected generation issue. The snapshot resiliency test for concurrent snapshot creation and deletion was made to more aggressively start the delete operation so that the above races would become visible. Previously, the fact that deletes would never coincide with initializing snapshots resulted in a number of the above races not reproducing.

elasticmachine · 2020-04-04T16:15:06Z

Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore)

original-brownbear · 2020-04-05T13:44:30Z

...a/org/elasticsearch/action/admin/cluster/snapshots/delete/TransportDeleteSnapshotAction.java

    @Override
    protected String executor() {
-        return ThreadPool.Names.GENERIC;
+        return ThreadPool.Names.SAME;


No need to fork to the generic pool here any more, we instantly move to the cluster state thread now in the snapshots service. We'll use the generic pool there once we have to inspect the repository data

original-brownbear · 2020-04-05T14:23:40Z

Jenkins run elasticsearch-ci/2 (unrelated x-pack failure)

original-brownbear · 2020-04-06T07:09:41Z

server/src/main/java/org/elasticsearch/cluster/SnapshotDeletionsInProgress.java

            this.snapshot = snapshot;
            this.startTime = startTime;
            this.repositoryStateId = repositoryStateId;
+            assert repositoryStateId > RepositoryData.EMPTY_REPO_GEN :


We would incorrectly get a -1 (empty repo gen) here when we deleted/aborted an initializing snapshot (because the repo gen is at -2 during the snapshot INIT stage). This wouldn't have caused any corruption and only made deletes fail but still, this assertion would've avoided not catching this race for so long :)

ywelsch · 2020-04-06T07:07:28Z

server/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java

-                    // Derive repository generation if a snapshot is in progress because it will increment the generation when it finishes
-                    repoGenId = matchedInProgress.get().repositoryStateId() + 1L;
+    public void deleteSnapshot(final String repositoryName, final String snapshotName, final ActionListener<Void> listener) {
+        logger.info("deleting snapshot [{}] from repository [{}]", snapshotName, repositoryName);


should we also log an info-level message once we have successfully completed the deletion?

Jup added one for abort and one for the actual delete if it happened now.

ywelsch · 2020-04-06T07:11:37Z

server/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java

+                SnapshotsInProgress snapshots = currentState.custom(SnapshotsInProgress.TYPE);
+                SnapshotsInProgress.Entry snapshotEntry = null;
+                if (snapshots != null) {
+                    for (SnapshotsInProgress.Entry entry : snapshots.entries()) {


maybe add this a helper method to SnapshotsInProgress similar to public Entry snapshot(final Snapshot snapshot)

Sure extracted that logic :)

ywelsch · 2020-04-06T07:15:38Z

server/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java

+                    failure = snapshotEntry.failure();
+                }
+                return ClusterState.builder(currentState).putCustom(SnapshotsInProgress.TYPE,
+                    new SnapshotsInProgress(new SnapshotsInProgress.Entry(snapshotEntry, State.ABORTED, shards, failure))).build();


I know that this was already the case before, but if we were to allow concurrent snapshots, this line here would be silently removing other snapshots. Let's make sure to preserve other snapshots so that we don't get nasty surprises in the future.

Sure, this logic looks somewhat different in the concurrent deletes branch already but at least this keeps the diff smaller logically later on :)

ywelsch · 2020-04-06T07:17:52Z

server/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java

-            deleteSnapshot(new Snapshot(repositoryName, matchedEntry.get()), listener, repoGenId, immediatePriority);
-        }, listener::onFailure));
+
+            private void tryDeleteExisting(Priority priority) {


why is method defined at the previous cluster state update task level? I think I would prefer explicitly passing in the listener, and moving it up

I mainly put it here to make it perfectly clear that we're only running this after trying to abort. Also, we have the logic around the runningSnapshot field here so we can still log the Waited for snapshot ... warning. To me it just gets more confusing if we pull that to the top level because annoyingly enough it's still somewhat connected logically to what happened during the CS update.

ywelsch · 2020-04-06T07:23:08Z

server/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java

                }
-                return clusterStateBuilder.build();
+                // add the snapshot deletion to the cluster state
+                return ClusterState.builder(currentState).putCustom(SnapshotDeletionsInProgress.TYPE,


same here as before, let's preserve other deletions

original-brownbear · 2020-04-06T08:40:07Z

Thanks @ywelsch , all addressed I think

ywelsch

LGTM

server/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java

original-brownbear · 2020-04-06T10:26:30Z

Thanks Yannick!

Snapshot deletes should first check the cluster state for an in-progress snapshot and try to abort it before checking the repository contents. This allows for atomically checking and aborting a snapshot in the same cluster state update, removing all possible races where a snapshot that is in-progress could not be found if it finishes between checking the repository contents and the cluster state. Also removes confusing races, where checking the cluster state off of the cluster state thread finds an in-progress snapshot that is then not found in the cluster state update to abort it. Finally, the logic to use the repository generation of the in-progress snapshot + 1 was error prone because it would always fail the delete when the repository had a pending generation different from its safe generation when a snapshot started (leading to the snapshot finalizing at a higher generation). These issues (particularly that last point) can easily be reproduced by running `SLMSnapshotBlockingIntegTests` in a loop with current `master` (see elastic#54766). The snapshot resiliency test for concurrent snapshot creation and deletion was made to more aggressively start the delete operation so that the above races would become visible. Previously, the fact that deletes would never coincide with initializing snapshots resulted in a number of the above races not reproducing. This PR is the most consistent I could get snapshot deletes without changes to the state machine. The fact that aborted deletes will not put the delete operation in the cluster state before waiting for the snapshot to abort still allows for some possible (though practically very unlikely) races. These will be fixed by a state-machine change in upcoming work in elastic#54705 (which will have a much simpler and clearer diff after this change). Closes elastic#54766

Snapshot deletes should first check the cluster state for an in-progress snapshot and try to abort it before checking the repository contents. This allows for atomically checking and aborting a snapshot in the same cluster state update, removing all possible races where a snapshot that is in-progress could not be found if it finishes between checking the repository contents and the cluster state. Also removes confusing races, where checking the cluster state off of the cluster state thread finds an in-progress snapshot that is then not found in the cluster state update to abort it. Finally, the logic to use the repository generation of the in-progress snapshot + 1 was error prone because it would always fail the delete when the repository had a pending generation different from its safe generation when a snapshot started (leading to the snapshot finalizing at a higher generation). These issues (particularly that last point) can easily be reproduced by running `SLMSnapshotBlockingIntegTests` in a loop with current `master` (see #54766). The snapshot resiliency test for concurrent snapshot creation and deletion was made to more aggressively start the delete operation so that the above races would become visible. Previously, the fact that deletes would never coincide with initializing snapshots resulted in a number of the above races not reproducing. This PR is the most consistent I could get snapshot deletes without changes to the state machine. The fact that aborted deletes will not put the delete operation in the cluster state before waiting for the snapshot to abort still allows for some possible (though practically very unlikely) races. These will be fixed by a state-machine change in upcoming work in #54705 (which will have a much simpler and clearer diff after this change). Closes #54766

This TODO became fixable with elastic#54765

This TODO became fixable with #54765

This TODO became fixable with elastic#54765

This TODO became fixable with #54765

original-brownbear added >non-issue :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v8.0.0 v7.8.0 labels Apr 4, 2020

original-brownbear mentioned this pull request Apr 4, 2020

Test Failures in org.elasticsearch.xpack.slm.SLMSnapshotBlockingIntegTests #54766

Closed

revert unrelated changes

6fea237

original-brownbear added the WIP label Apr 5, 2020

original-brownbear added 2 commits April 5, 2020 15:31

Merge remote-tracking branch 'elastic/master' into cleanup-slm-tests

3766083

fix deadlock

a0cdd17

original-brownbear commented Apr 5, 2020

View reviewed changes

original-brownbear removed the WIP label Apr 5, 2020

original-brownbear requested review from tlrx and ywelsch April 5, 2020 15:05

original-brownbear commented Apr 6, 2020

View reviewed changes

ywelsch reviewed Apr 6, 2020

View reviewed changes

original-brownbear added 5 commits April 6, 2020 10:01

Merge remote-tracking branch 'elastic/master' into cleanup-slm-tests

36aa15d

CR: add after delete log messages

7f4deaf

CR: extract helper for in progrss snapshot search

157c938

CR: update snapshot list safer

ecd90a5

CR: preserve other deletes

7c3a24e

original-brownbear requested a review from ywelsch April 6, 2020 08:39

ywelsch approved these changes Apr 6, 2020

View reviewed changes

server/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java Outdated Show resolved Hide resolved

original-brownbear added 2 commits April 6, 2020 11:37

Merge remote-tracking branch 'elastic/master' into cleanup-slm-tests

6cd5631

CR: list it is

c78ec9a

original-brownbear merged commit 373da6a into elastic:master Apr 6, 2020

original-brownbear deleted the cleanup-slm-tests branch April 6, 2020 10:26

original-brownbear added the backport pending label Apr 6, 2020

original-brownbear mentioned this pull request Apr 15, 2020

Make Snapshot Deletes Less Racy (#54765) #55226

Merged

original-brownbear removed the backport pending label Apr 15, 2020

original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Apr 20, 2020

Fix TODO in SnapshotIT

c8fdaa3

This TODO became fixable with elastic#54765

original-brownbear mentioned this pull request Apr 20, 2020

Fix TODO in SnapshotIT #55465

Merged

original-brownbear added a commit that referenced this pull request Apr 20, 2020

Fix TODO in SnapshotIT (#55465)

f56b0f1

This TODO became fixable with #54765

original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Apr 20, 2020

Fix TODO in SnapshotIT (elastic#55465)

e28e677

This TODO became fixable with elastic#54765

original-brownbear mentioned this pull request Apr 20, 2020

Fix TODO in SnapshotIT (#55465) #55469

Merged

original-brownbear added a commit that referenced this pull request Apr 20, 2020

Fix TODO in SnapshotIT (#55465) (#55469)

e0195fa

This TODO became fixable with #54765

original-brownbear restored the cleanup-slm-tests branch August 6, 2020 18:25

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Make Snapshot Deletes Less Racy #54765

Make Snapshot Deletes Less Racy #54765

Uh oh!

Conversation

original-brownbear commented Apr 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticmachine commented Apr 4, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

original-brownbear commented Apr 5, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

original-brownbear commented Apr 6, 2020

Uh oh!

ywelsch left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

original-brownbear commented Apr 6, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

original-brownbear commented Apr 4, 2020 •

edited

Loading