Skip to content

Fix PAUSED shard deletion blocking QUEUED promotion#142637

Merged
ywangd merged 1 commit intoelastic:mainfrom
ywangd:propagate-state-after-paused-snapshot-deletion
Feb 19, 2026
Merged

Fix PAUSED shard deletion blocking QUEUED promotion#142637
ywangd merged 1 commit intoelastic:mainfrom
ywangd:propagate-state-after-paused-snapshot-deletion

Conversation

@ywangd
Copy link
Copy Markdown
Member

@ywangd ywangd commented Feb 18, 2026

When a snapshot with a PAUSED_FOR_NODE_REMOVAL shard is deleted, the abort previously transitioned it directly to FAILED (#141408). This bypassed the normal state propagation that promotes QUEUED shards, allowing a subsequently created snapshot to incorrectly receive INIT instead of QUEUED for the same shard, violating the ordering invariant.

Change abort to transition PAUSED_FOR_NODE_REMOVAL to ABORTED so that new snapshots correctly get QUEUED. The data node detects the PAUSED local status on an ABORTED entry and reports FAILED to the master, which triggers QUEUED promotion through the existing state propagation.

Relates #141408

When a snapshot with a PAUSED_FOR_NODE_REMOVAL shard is deleted, the
abort previously transitioned it directly to FAILED. This bypassed the
normal state propagation that promotes QUEUED shards, allowing a
subsequently created snapshot to incorrectly receive INIT instead of
QUEUED for the same shard, violating the ordering invariant.

Change abort to transition PAUSED_FOR_NODE_REMOVAL to ABORTED (an
active state) so that new snapshots correctly get QUEUED. The data
node detects the PAUSED local status on an ABORTED entry and reports
FAILED to the master, which triggers QUEUED promotion through the
normal SHARD_STATE_EXECUTOR path.

Co-authored-by: Cursor <cursoragent@cursor.com>
@ywangd ywangd requested a review from DaveCTurner February 18, 2026 05:39
@ywangd ywangd added >non-issue :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v9.3.1 v9.4.0 v9.2.6 labels Feb 18, 2026
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Meta label for distributed team. label Feb 18, 2026
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@ywangd ywangd changed the title Fix PAUSED_FOR_NODE_REMOVAL shard blocking QUEUED promotion Fix PAUSED shard deletion blocking QUEUED promotion Feb 18, 2026
@ywangd
Copy link
Copy Markdown
Member Author

ywangd commented Feb 18, 2026

@DaveCTurner I labelled this as >non-issue because previous change #141408 is not yet released. Both v9.2.6 and v9.3.1 will have build candidate tomorrow (2026-02-19). So this PR can be a non-issue if we make it in time.

I pondered on the fix and at the end decided to let data node handle the ABORTED status even when it is already PAUSED. It is not the most efficient. But I think it is overall the simplest option and robust. Other alternatives considered are:

  1. In SnapshotsInProgress.abort, use SnapshotsService.createAndSubmitRequestToUpdateSnapshotState to send master update tasks directly. This was my initial thinking. But the updates can be lost if the master fails over and there is nothing to retry it.
  2. When creating a new snapshot, put the shard in INIT if there is already a QUEUED shard. Basically augumenting this if check. Once the deletion completes, it will kick off the next queued shard with removeSnapshotDeletionFromClusterState. I feel a bit uneasy with this option since it sorta alters some assumption of the states. Such change is needed for creating new snapshot or clone but not when propagate existing entries. Overall I am a bit concerned that it might introduce new bugs.

@ywangd
Copy link
Copy Markdown
Member Author

ywangd commented Feb 18, 2026

@DaveCTurner sorry to ping you again. There is still chance to get this into the upcoming releases if you are OK with the changes. Thanks a lot!

Copy link
Copy Markdown
Member

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agh sorry I missed this ping and I have some concerns.

When a shard is in state PAUSED_FOR_NODE_REMOVAL it is known not to be writing to the repository, but the ABORTED state loses this fact and requires the data node to re-confirm that it is not working on the shard which isn't really necessary. Moving it from PAUSED_FOR_NODE_REMOVAL to FAILED allowing the snapshot to complete was a deliberate decision.

I also worry that in a mixed-version cluster a newer-version master node could perform this strange transition to which an older-version data node wouldn't react properly - on an older version, it wouldn't send the move-to-FAILED notification back and we would be stuck again.

The ideal fix here would be for the abort CS update itself to move the next QUEUED shard to INIT, requiring some changes to SnapshotDeletionStartBatcher.Executor#abortRunningSnapshots. I know we're kinda splitting this up into two updates so we can re-use the ABORTED -> FAILED handling in SnapshotTaskExecutor but this isn't really what ABORTED is for. If we were to come along and fix this later we'd almost certainly forget that some versions misuse ABORTED in this way which would be a much trickier bug to track down.

I'd rather spend a little longer getting this right. It's going to be quite rare to hit this I think, and although the resulting snapshot will be stuck with a shard in state QUEUED, it will still be abortable at least.

Copy link
Copy Markdown
Member

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, sorry, I see that we only added the PAUSED_FOR_NODE_REMOVAL -> FAILED transition in #141408. Prior to that we were already doing PAUSED -> ABORTED, we just weren't handling it right on the data node.

Ok, let's go with this then. I still don't think PAUSED -> ABORTED is correct, but fundamentally this is just fixing the data-node handling of that pre-existing transition.

@ywangd
Copy link
Copy Markdown
Member Author

ywangd commented Feb 19, 2026

I see that we only added the PAUSED_FOR_NODE_REMOVAL -> FAILED transition in #141408. Prior to that we were already doing PAUSED -> ABORTED, we just weren't handling it right on the data node.

Yes exactly.

The ideal fix here would be for the abort CS update itself to move the next QUEUED shard to INIT

Yeah I agree this is the long term fix. We need to consolidate the state propagation code and reuse a single implementation in all places. I didn't list it as an alternative since it is a rather big change. And based on my understanding of previous slack conversation, we are ok to defer it and not tie it to this fix. Again, I do agree we should work on the consolidation as a general improvement.

It's going to be quite rare to hit

Indeed unlikely in production. But it makes running the stress test in a loop less useful since it runs into this issue from time to time which prevents discovering other potential issues.

Thanks a lot for the review! 🙏

@ywangd ywangd merged commit 3393f3a into elastic:main Feb 19, 2026
35 checks passed
ywangd added a commit to ywangd/elasticsearch that referenced this pull request Feb 19, 2026
…142637)

When a snapshot with a PAUSED_FOR_NODE_REMOVAL shard is deleted, the
abort previously transitioned it directly to FAILED. This bypassed the
normal state propagation that promotes QUEUED shards, allowing a
subsequently created snapshot to incorrectly receive INIT instead of
QUEUED for the same shard, violating the ordering invariant.

Change abort to transition PAUSED_FOR_NODE_REMOVAL to ABORTED (an
active state) so that new snapshots correctly get QUEUED. The data
node detects the PAUSED local status on an ABORTED entry and reports
FAILED to the master, which triggers QUEUED promotion through the
existing state propagation.

(cherry picked from commit 3393f3a)
@ywangd
Copy link
Copy Markdown
Member Author

ywangd commented Feb 19, 2026

💚 All backports created successfully

Status Branch Result
9.3
9.2

Questions ?

Please refer to the Backport tool documentation

ywangd added a commit to ywangd/elasticsearch that referenced this pull request Feb 19, 2026
…142637)

When a snapshot with a PAUSED_FOR_NODE_REMOVAL shard is deleted, the
abort previously transitioned it directly to FAILED. This bypassed the
normal state propagation that promotes QUEUED shards, allowing a
subsequently created snapshot to incorrectly receive INIT instead of
QUEUED for the same shard, violating the ordering invariant.

Change abort to transition PAUSED_FOR_NODE_REMOVAL to ABORTED (an
active state) so that new snapshots correctly get QUEUED. The data
node detects the PAUSED local status on an ABORTED entry and reports
FAILED to the master, which triggers QUEUED promotion through the
existing state propagation.

(cherry picked from commit 3393f3a)
elasticsearchmachine pushed a commit that referenced this pull request Feb 19, 2026
…42637) (#142673)

* Fix PAUSED_FOR_NODE_REMOVAL shard blocking QUEUED promotion (#142637)

When a snapshot with a PAUSED_FOR_NODE_REMOVAL shard is deleted, the
abort previously transitioned it directly to FAILED. This bypassed the
normal state propagation that promotes QUEUED shards, allowing a
subsequently created snapshot to incorrectly receive INIT instead of
QUEUED for the same shard, violating the ordering invariant.

Change abort to transition PAUSED_FOR_NODE_REMOVAL to ABORTED (an
active state) so that new snapshots correctly get QUEUED. The data
node detects the PAUSED local status on an ABORTED entry and reports
FAILED to the master, which triggers QUEUED promotion through the
existing state propagation.

(cherry picked from commit 3393f3a)

* fix imports
elasticsearchmachine pushed a commit that referenced this pull request Feb 19, 2026
…42637) (#142672)

* Fix PAUSED_FOR_NODE_REMOVAL shard blocking QUEUED promotion (#142637)

When a snapshot with a PAUSED_FOR_NODE_REMOVAL shard is deleted, the
abort previously transitioned it directly to FAILED. This bypassed the
normal state propagation that promotes QUEUED shards, allowing a
subsequently created snapshot to incorrectly receive INIT instead of
QUEUED for the same shard, violating the ordering invariant.

Change abort to transition PAUSED_FOR_NODE_REMOVAL to ABORTED (an
active state) so that new snapshots correctly get QUEUED. The data
node detects the PAUSED local status on an ABORTED entry and reports
FAILED to the master, which triggers QUEUED promotion through the
existing state propagation.

(cherry picked from commit 3393f3a)

* fix imports
szybia added a commit to szybia/elasticsearch that referenced this pull request Feb 19, 2026
…on-sliced-reindex

* upstream/main: (120 commits)
  [Fleet] Add OpAMP field mappings to fleet-agents (elastic#142550)
  Clarify `expectedSize` behaviour of `ReleasableBytesStreamOutput` (elastic#142451)
  Refactor KnnIndexTester to tidy up some options (elastic#142651)
  Fixed with elastic#142638 already (elastic#142655)
  Change *OverTimeTests to extend AbstractAggregationTestCase (elastic#142659)
  Fix byteRefBlockHashSize for release mode (elastic#142668)
  Mute org.elasticsearch.xpack.esql.tree.EsqlNodeSubclassTests testTransform {class org.elasticsearch.xpack.esql.plan.logical.MMR} elastic#142674
  Fix PAUSED_FOR_NODE_REMOVAL shard blocking QUEUED promotion (elastic#142637)
  Mute org.elasticsearch.xpack.logsdb.RandomizedRollingUpgradeIT testIndexingStandardSource elastic#142670
  Revert "[ESQL] Introduce pluggable external datasource framework (elastic#141678) (elastic#142663)
  Mute org.elasticsearch.xpack.esql.spatial.SpatialPushDownGeoShapeIT testQuantizedXY elastic#141234
  PromQL: infer start/end from query DSL filters (elastic#142580)
  Add GPU vector indexing monitoring to _xpack/usage (elastic#141932)
  Fix testTrackerClearShutdown: use non-zero startTimeMillis for DONE status (elastic#142646)
  Mute org.elasticsearch.xpack.esql.qa.single_node.GenerativeIT test elastic#142426
  ESQL_ Move time_zone to GA (elastic#142287)
  Mute org.elasticsearch.xpack.esql.qa.multi_node.GenerativeIT test elastic#142426
  DOCS: Convert Painless diagrams to mermaid (elastic#141851)
  ES|QL: fix validation in generative tests (elastic#142638)
  Unmute tests that do not reproduce failures (elastic#141712)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >non-issue Team:Distributed Meta label for distributed team. v9.2.6 v9.3.1 v9.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants