Fix PAUSED shard deletion blocking QUEUED promotion by ywangd · Pull Request #142637 · elastic/elasticsearch

ywangd · 2026-02-18T05:39:56Z

When a snapshot with a PAUSED_FOR_NODE_REMOVAL shard is deleted, the abort previously transitioned it directly to FAILED (#141408). This bypassed the normal state propagation that promotes QUEUED shards, allowing a subsequently created snapshot to incorrectly receive INIT instead of QUEUED for the same shard, violating the ordering invariant.

Change abort to transition PAUSED_FOR_NODE_REMOVAL to ABORTED so that new snapshots correctly get QUEUED. The data node detects the PAUSED local status on an ABORTED entry and reports FAILED to the master, which triggers QUEUED promotion through the existing state propagation.

Relates #141408

When a snapshot with a PAUSED_FOR_NODE_REMOVAL shard is deleted, the abort previously transitioned it directly to FAILED. This bypassed the normal state propagation that promotes QUEUED shards, allowing a subsequently created snapshot to incorrectly receive INIT instead of QUEUED for the same shard, violating the ordering invariant. Change abort to transition PAUSED_FOR_NODE_REMOVAL to ABORTED (an active state) so that new snapshots correctly get QUEUED. The data node detects the PAUSED local status on an ABORTED entry and reports FAILED to the master, which triggers QUEUED promotion through the normal SHARD_STATE_EXECUTOR path. Co-authored-by: Cursor <cursoragent@cursor.com>

elasticsearchmachine · 2026-02-18T05:40:22Z

Pinging @elastic/es-distributed (Team:Distributed)

ywangd · 2026-02-18T05:58:08Z

@DaveCTurner I labelled this as >non-issue because previous change #141408 is not yet released. Both v9.2.6 and v9.3.1 will have build candidate tomorrow (2026-02-19). So this PR can be a non-issue if we make it in time.

I pondered on the fix and at the end decided to let data node handle the ABORTED status even when it is already PAUSED. It is not the most efficient. But I think it is overall the simplest option and robust. Other alternatives considered are:

In SnapshotsInProgress.abort, use SnapshotsService.createAndSubmitRequestToUpdateSnapshotState to send master update tasks directly. This was my initial thinking. But the updates can be lost if the master fails over and there is nothing to retry it.
When creating a new snapshot, put the shard in INIT if there is already a QUEUED shard. Basically augumenting this if check. Once the deletion completes, it will kick off the next queued shard with removeSnapshotDeletionFromClusterState. I feel a bit uneasy with this option since it sorta alters some assumption of the states. Such change is needed for creating new snapshot or clone but not when propagate existing entries. Overall I am a bit concerned that it might introduce new bugs.

ywangd · 2026-02-18T23:08:22Z

@DaveCTurner sorry to ping you again. There is still chance to get this into the upcoming releases if you are OK with the changes. Thanks a lot!

DaveCTurner

Agh sorry I missed this ping and I have some concerns.

When a shard is in state PAUSED_FOR_NODE_REMOVAL it is known not to be writing to the repository, but the ABORTED state loses this fact and requires the data node to re-confirm that it is not working on the shard which isn't really necessary. Moving it from PAUSED_FOR_NODE_REMOVAL to FAILED allowing the snapshot to complete was a deliberate decision.

I also worry that in a mixed-version cluster a newer-version master node could perform this strange transition to which an older-version data node wouldn't react properly - on an older version, it wouldn't send the move-to-FAILED notification back and we would be stuck again.

The ideal fix here would be for the abort CS update itself to move the next QUEUED shard to INIT, requiring some changes to SnapshotDeletionStartBatcher.Executor#abortRunningSnapshots. I know we're kinda splitting this up into two updates so we can re-use the ABORTED -> FAILED handling in SnapshotTaskExecutor but this isn't really what ABORTED is for. If we were to come along and fix this later we'd almost certainly forget that some versions misuse ABORTED in this way which would be a much trickier bug to track down.

I'd rather spend a little longer getting this right. It's going to be quite rare to hit this I think, and although the resulting snapshot will be stuck with a shard in state QUEUED, it will still be abortable at least.

DaveCTurner

Wait, sorry, I see that we only added the PAUSED_FOR_NODE_REMOVAL -> FAILED transition in #141408. Prior to that we were already doing PAUSED -> ABORTED, we just weren't handling it right on the data node.

Ok, let's go with this then. I still don't think PAUSED -> ABORTED is correct, but fundamentally this is just fixing the data-node handling of that pre-existing transition.

ywangd · 2026-02-19T04:03:36Z

I see that we only added the PAUSED_FOR_NODE_REMOVAL -> FAILED transition in #141408. Prior to that we were already doing PAUSED -> ABORTED, we just weren't handling it right on the data node.

Yes exactly.

The ideal fix here would be for the abort CS update itself to move the next QUEUED shard to INIT

Yeah I agree this is the long term fix. We need to consolidate the state propagation code and reuse a single implementation in all places. I didn't list it as an alternative since it is a rather big change. And based on my understanding of previous slack conversation, we are ok to defer it and not tie it to this fix. Again, I do agree we should work on the consolidation as a general improvement.

It's going to be quite rare to hit

Indeed unlikely in production. But it makes running the stress test in a loop less useful since it runs into this issue from time to time which prevents discovering other potential issues.

Thanks a lot for the review! 🙏

…142637) When a snapshot with a PAUSED_FOR_NODE_REMOVAL shard is deleted, the abort previously transitioned it directly to FAILED. This bypassed the normal state propagation that promotes QUEUED shards, allowing a subsequently created snapshot to incorrectly receive INIT instead of QUEUED for the same shard, violating the ordering invariant. Change abort to transition PAUSED_FOR_NODE_REMOVAL to ABORTED (an active state) so that new snapshots correctly get QUEUED. The data node detects the PAUSED local status on an ABORTED entry and reports FAILED to the master, which triggers QUEUED promotion through the existing state propagation. (cherry picked from commit 3393f3a)

ywangd · 2026-02-19T04:05:39Z

💚 All backports created successfully

Status	Branch	Result
✅	9.3
✅	9.2

Questions ?

Please refer to the Backport tool documentation

…142637) When a snapshot with a PAUSED_FOR_NODE_REMOVAL shard is deleted, the abort previously transitioned it directly to FAILED. This bypassed the normal state propagation that promotes QUEUED shards, allowing a subsequently created snapshot to incorrectly receive INIT instead of QUEUED for the same shard, violating the ordering invariant. Change abort to transition PAUSED_FOR_NODE_REMOVAL to ABORTED (an active state) so that new snapshots correctly get QUEUED. The data node detects the PAUSED local status on an ABORTED entry and reports FAILED to the master, which triggers QUEUED promotion through the existing state propagation. (cherry picked from commit 3393f3a)

…42637) (#142673) * Fix PAUSED_FOR_NODE_REMOVAL shard blocking QUEUED promotion (#142637) When a snapshot with a PAUSED_FOR_NODE_REMOVAL shard is deleted, the abort previously transitioned it directly to FAILED. This bypassed the normal state propagation that promotes QUEUED shards, allowing a subsequently created snapshot to incorrectly receive INIT instead of QUEUED for the same shard, violating the ordering invariant. Change abort to transition PAUSED_FOR_NODE_REMOVAL to ABORTED (an active state) so that new snapshots correctly get QUEUED. The data node detects the PAUSED local status on an ABORTED entry and reports FAILED to the master, which triggers QUEUED promotion through the existing state propagation. (cherry picked from commit 3393f3a) * fix imports

…42637) (#142672) * Fix PAUSED_FOR_NODE_REMOVAL shard blocking QUEUED promotion (#142637) When a snapshot with a PAUSED_FOR_NODE_REMOVAL shard is deleted, the abort previously transitioned it directly to FAILED. This bypassed the normal state propagation that promotes QUEUED shards, allowing a subsequently created snapshot to incorrectly receive INIT instead of QUEUED for the same shard, violating the ordering invariant. Change abort to transition PAUSED_FOR_NODE_REMOVAL to ABORTED (an active state) so that new snapshots correctly get QUEUED. The data node detects the PAUSED local status on an ABORTED entry and reports FAILED to the master, which triggers QUEUED promotion through the existing state propagation. (cherry picked from commit 3393f3a) * fix imports

…on-sliced-reindex * upstream/main: (120 commits) [Fleet] Add OpAMP field mappings to fleet-agents (elastic#142550) Clarify `expectedSize` behaviour of `ReleasableBytesStreamOutput` (elastic#142451) Refactor KnnIndexTester to tidy up some options (elastic#142651) Fixed with elastic#142638 already (elastic#142655) Change *OverTimeTests to extend AbstractAggregationTestCase (elastic#142659) Fix byteRefBlockHashSize for release mode (elastic#142668) Mute org.elasticsearch.xpack.esql.tree.EsqlNodeSubclassTests testTransform {class org.elasticsearch.xpack.esql.plan.logical.MMR} elastic#142674 Fix PAUSED_FOR_NODE_REMOVAL shard blocking QUEUED promotion (elastic#142637) Mute org.elasticsearch.xpack.logsdb.RandomizedRollingUpgradeIT testIndexingStandardSource elastic#142670 Revert "[ESQL] Introduce pluggable external datasource framework (elastic#141678) (elastic#142663) Mute org.elasticsearch.xpack.esql.spatial.SpatialPushDownGeoShapeIT testQuantizedXY elastic#141234 PromQL: infer start/end from query DSL filters (elastic#142580) Add GPU vector indexing monitoring to _xpack/usage (elastic#141932) Fix testTrackerClearShutdown: use non-zero startTimeMillis for DONE status (elastic#142646) Mute org.elasticsearch.xpack.esql.qa.single_node.GenerativeIT test elastic#142426 ESQL_ Move time_zone to GA (elastic#142287) Mute org.elasticsearch.xpack.esql.qa.multi_node.GenerativeIT test elastic#142426 DOCS: Convert Painless diagrams to mermaid (elastic#141851) ES|QL: fix validation in generative tests (elastic#142638) Unmute tests that do not reproduce failures (elastic#141712) ...

ywangd requested a review from DaveCTurner February 18, 2026 05:39

ywangd added >non-issue :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v9.3.1 v9.4.0 v9.2.6 labels Feb 18, 2026

elasticsearchmachine added the Team:Distributed Meta label for distributed team. label Feb 18, 2026

ywangd changed the title ~~Fix PAUSED_FOR_NODE_REMOVAL shard blocking QUEUED promotion~~ Fix PAUSED shard deletion blocking QUEUED promotion Feb 18, 2026

DaveCTurner reviewed Feb 19, 2026

View reviewed changes

DaveCTurner approved these changes Feb 19, 2026

View reviewed changes

ywangd merged commit 3393f3a into elastic:main Feb 19, 2026
35 checks passed

ywangd mentioned this pull request Feb 19, 2026

[9.3] Fix PAUSED_FOR_NODE_REMOVAL shard blocking QUEUED promotion (#142637) #142672

Merged

ywangd mentioned this pull request Feb 19, 2026

[9.2] Fix PAUSED_FOR_NODE_REMOVAL shard blocking QUEUED promotion (#142637) #142673

Merged

joshua-adams-1 mentioned this pull request Feb 23, 2026

[CI] SnapshotShutdownIT class failing #142805

Closed

ywangd mentioned this pull request Feb 24, 2026

Fix testAbortSnapshotWhileRemovingNode #142852

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix PAUSED shard deletion blocking QUEUED promotion#142637

Fix PAUSED shard deletion blocking QUEUED promotion#142637
ywangd merged 1 commit intoelastic:mainfrom
ywangd:propagate-state-after-paused-snapshot-deletion

ywangd commented Feb 18, 2026

Uh oh!

elasticsearchmachine commented Feb 18, 2026

Uh oh!

ywangd commented Feb 18, 2026

Uh oh!

ywangd commented Feb 18, 2026

Uh oh!

DaveCTurner left a comment

Uh oh!

DaveCTurner left a comment

Uh oh!

ywangd commented Feb 19, 2026

Uh oh!

Uh oh!

ywangd commented Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ywangd commented Feb 18, 2026

Uh oh!

elasticsearchmachine commented Feb 18, 2026

Uh oh!

ywangd commented Feb 18, 2026

Uh oh!

ywangd commented Feb 18, 2026

Uh oh!

DaveCTurner left a comment

Choose a reason for hiding this comment

Uh oh!

DaveCTurner left a comment

Choose a reason for hiding this comment

Uh oh!

ywangd commented Feb 19, 2026

Uh oh!

Uh oh!

ywangd commented Feb 19, 2026

💚 All backports created successfully

Questions ?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants