Fix PAUSED shard deletion blocking QUEUED promotion#142637
Fix PAUSED shard deletion blocking QUEUED promotion#142637ywangd merged 1 commit intoelastic:mainfrom
Conversation
When a snapshot with a PAUSED_FOR_NODE_REMOVAL shard is deleted, the abort previously transitioned it directly to FAILED. This bypassed the normal state propagation that promotes QUEUED shards, allowing a subsequently created snapshot to incorrectly receive INIT instead of QUEUED for the same shard, violating the ordering invariant. Change abort to transition PAUSED_FOR_NODE_REMOVAL to ABORTED (an active state) so that new snapshots correctly get QUEUED. The data node detects the PAUSED local status on an ABORTED entry and reports FAILED to the master, which triggers QUEUED promotion through the normal SHARD_STATE_EXECUTOR path. Co-authored-by: Cursor <cursoragent@cursor.com>
|
Pinging @elastic/es-distributed (Team:Distributed) |
|
@DaveCTurner I labelled this as I pondered on the fix and at the end decided to let data node handle the
|
|
@DaveCTurner sorry to ping you again. There is still chance to get this into the upcoming releases if you are OK with the changes. Thanks a lot! |
DaveCTurner
left a comment
There was a problem hiding this comment.
Agh sorry I missed this ping and I have some concerns.
When a shard is in state PAUSED_FOR_NODE_REMOVAL it is known not to be writing to the repository, but the ABORTED state loses this fact and requires the data node to re-confirm that it is not working on the shard which isn't really necessary. Moving it from PAUSED_FOR_NODE_REMOVAL to FAILED allowing the snapshot to complete was a deliberate decision.
I also worry that in a mixed-version cluster a newer-version master node could perform this strange transition to which an older-version data node wouldn't react properly - on an older version, it wouldn't send the move-to-FAILED notification back and we would be stuck again.
The ideal fix here would be for the abort CS update itself to move the next QUEUED shard to INIT, requiring some changes to SnapshotDeletionStartBatcher.Executor#abortRunningSnapshots. I know we're kinda splitting this up into two updates so we can re-use the ABORTED -> FAILED handling in SnapshotTaskExecutor but this isn't really what ABORTED is for. If we were to come along and fix this later we'd almost certainly forget that some versions misuse ABORTED in this way which would be a much trickier bug to track down.
I'd rather spend a little longer getting this right. It's going to be quite rare to hit this I think, and although the resulting snapshot will be stuck with a shard in state QUEUED, it will still be abortable at least.
DaveCTurner
left a comment
There was a problem hiding this comment.
Wait, sorry, I see that we only added the PAUSED_FOR_NODE_REMOVAL -> FAILED transition in #141408. Prior to that we were already doing PAUSED -> ABORTED, we just weren't handling it right on the data node.
Ok, let's go with this then. I still don't think PAUSED -> ABORTED is correct, but fundamentally this is just fixing the data-node handling of that pre-existing transition.
Yes exactly.
Yeah I agree this is the long term fix. We need to consolidate the state propagation code and reuse a single implementation in all places. I didn't list it as an alternative since it is a rather big change. And based on my understanding of previous slack conversation, we are ok to defer it and not tie it to this fix. Again, I do agree we should work on the consolidation as a general improvement.
Indeed unlikely in production. But it makes running the stress test in a loop less useful since it runs into this issue from time to time which prevents discovering other potential issues. Thanks a lot for the review! 🙏 |
…142637) When a snapshot with a PAUSED_FOR_NODE_REMOVAL shard is deleted, the abort previously transitioned it directly to FAILED. This bypassed the normal state propagation that promotes QUEUED shards, allowing a subsequently created snapshot to incorrectly receive INIT instead of QUEUED for the same shard, violating the ordering invariant. Change abort to transition PAUSED_FOR_NODE_REMOVAL to ABORTED (an active state) so that new snapshots correctly get QUEUED. The data node detects the PAUSED local status on an ABORTED entry and reports FAILED to the master, which triggers QUEUED promotion through the existing state propagation. (cherry picked from commit 3393f3a)
💚 All backports created successfully
Questions ?Please refer to the Backport tool documentation |
…142637) When a snapshot with a PAUSED_FOR_NODE_REMOVAL shard is deleted, the abort previously transitioned it directly to FAILED. This bypassed the normal state propagation that promotes QUEUED shards, allowing a subsequently created snapshot to incorrectly receive INIT instead of QUEUED for the same shard, violating the ordering invariant. Change abort to transition PAUSED_FOR_NODE_REMOVAL to ABORTED (an active state) so that new snapshots correctly get QUEUED. The data node detects the PAUSED local status on an ABORTED entry and reports FAILED to the master, which triggers QUEUED promotion through the existing state propagation. (cherry picked from commit 3393f3a)
…42637) (#142673) * Fix PAUSED_FOR_NODE_REMOVAL shard blocking QUEUED promotion (#142637) When a snapshot with a PAUSED_FOR_NODE_REMOVAL shard is deleted, the abort previously transitioned it directly to FAILED. This bypassed the normal state propagation that promotes QUEUED shards, allowing a subsequently created snapshot to incorrectly receive INIT instead of QUEUED for the same shard, violating the ordering invariant. Change abort to transition PAUSED_FOR_NODE_REMOVAL to ABORTED (an active state) so that new snapshots correctly get QUEUED. The data node detects the PAUSED local status on an ABORTED entry and reports FAILED to the master, which triggers QUEUED promotion through the existing state propagation. (cherry picked from commit 3393f3a) * fix imports
…42637) (#142672) * Fix PAUSED_FOR_NODE_REMOVAL shard blocking QUEUED promotion (#142637) When a snapshot with a PAUSED_FOR_NODE_REMOVAL shard is deleted, the abort previously transitioned it directly to FAILED. This bypassed the normal state propagation that promotes QUEUED shards, allowing a subsequently created snapshot to incorrectly receive INIT instead of QUEUED for the same shard, violating the ordering invariant. Change abort to transition PAUSED_FOR_NODE_REMOVAL to ABORTED (an active state) so that new snapshots correctly get QUEUED. The data node detects the PAUSED local status on an ABORTED entry and reports FAILED to the master, which triggers QUEUED promotion through the existing state propagation. (cherry picked from commit 3393f3a) * fix imports
…on-sliced-reindex * upstream/main: (120 commits) [Fleet] Add OpAMP field mappings to fleet-agents (elastic#142550) Clarify `expectedSize` behaviour of `ReleasableBytesStreamOutput` (elastic#142451) Refactor KnnIndexTester to tidy up some options (elastic#142651) Fixed with elastic#142638 already (elastic#142655) Change *OverTimeTests to extend AbstractAggregationTestCase (elastic#142659) Fix byteRefBlockHashSize for release mode (elastic#142668) Mute org.elasticsearch.xpack.esql.tree.EsqlNodeSubclassTests testTransform {class org.elasticsearch.xpack.esql.plan.logical.MMR} elastic#142674 Fix PAUSED_FOR_NODE_REMOVAL shard blocking QUEUED promotion (elastic#142637) Mute org.elasticsearch.xpack.logsdb.RandomizedRollingUpgradeIT testIndexingStandardSource elastic#142670 Revert "[ESQL] Introduce pluggable external datasource framework (elastic#141678) (elastic#142663) Mute org.elasticsearch.xpack.esql.spatial.SpatialPushDownGeoShapeIT testQuantizedXY elastic#141234 PromQL: infer start/end from query DSL filters (elastic#142580) Add GPU vector indexing monitoring to _xpack/usage (elastic#141932) Fix testTrackerClearShutdown: use non-zero startTimeMillis for DONE status (elastic#142646) Mute org.elasticsearch.xpack.esql.qa.single_node.GenerativeIT test elastic#142426 ESQL_ Move time_zone to GA (elastic#142287) Mute org.elasticsearch.xpack.esql.qa.multi_node.GenerativeIT test elastic#142426 DOCS: Convert Painless diagrams to mermaid (elastic#141851) ES|QL: fix validation in generative tests (elastic#142638) Unmute tests that do not reproduce failures (elastic#141712) ...
When a snapshot with a PAUSED_FOR_NODE_REMOVAL shard is deleted, the abort previously transitioned it directly to FAILED (#141408). This bypassed the normal state propagation that promotes QUEUED shards, allowing a subsequently created snapshot to incorrectly receive INIT instead of QUEUED for the same shard, violating the ordering invariant.
Change abort to transition PAUSED_FOR_NODE_REMOVAL to ABORTED so that new snapshots correctly get QUEUED. The data node detects the PAUSED local status on an ABORTED entry and reports FAILED to the master, which triggers QUEUED promotion through the existing state propagation.
Relates #141408