Skip to content

Fix testAbortSnapshotWhileRemovingNode#142852

Merged
joshua-adams-1 merged 8 commits intoelastic:mainfrom
joshua-adams-1:snapshot-shutdown-it-failure-23-feb
Feb 23, 2026
Merged

Fix testAbortSnapshotWhileRemovingNode#142852
joshua-adams-1 merged 8 commits intoelastic:mainfrom
joshua-adams-1:snapshot-shutdown-it-failure-23-feb

Conversation

@joshua-adams-1
Copy link
Copy Markdown
Contributor

Clear the mock transport rule in testAbortSnapshotWhileRemovingNode after releasing the single update_snapshot_status request we coordinate on, so any further requests are handled normally and complete before teardown. This prevents the test from failing in assertAfterTest() with "All incoming requests on node [X] should have finished".

Closes #142805

Clear the mock transport rule in testAbortSnapshotWhileRemovingNode
after releasing the single update_snapshot_status request we
coordinate on, so any further requests are handled normally and
complete before teardown. This prevents the test from failing in
assertAfterTest() with "All incoming requests on node [X] should
have finished".

Closes elastic#142805
Clear the mock transport rule in testAbortSnapshotWhileRemovingNode
after releasing the single update_snapshot_status request we
coordinate on, so any further requests are handled normally and
complete before teardown. This prevents the test from failing in
assertAfterTest() with "All incoming requests on node [X] should
have finished".

Closes elastic#142805
@joshua-adams-1 joshua-adams-1 self-assigned this Feb 23, 2026
@joshua-adams-1 joshua-adams-1 added >test Issues or PRs that are addressing/adding tests :Distributed/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. labels Feb 23, 2026
Copy link
Copy Markdown
Member

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a good change but are you sure it fixes the test failure completely?

@joshua-adams-1
Copy link
Copy Markdown
Contributor Author

@ywangd Hey, tell me if I'm wrong here, but AFAICT, the test blocks all internal:cluster/snapshot/update_snapshot_status requests on the master via addRequestHandlingBehavior and a CyclicBarrier(2). It only releases the barrier twice (once for "data node sent pause", once for "process that pause after abort"), so exactly one request is allowed through.

When the snapshot is aborted, the data node can send a second status update (e.g. FAILED/ABORTED or completion). That second request hits the same handler and blocks on the first safeAwait(barrier) with no further barrier releases, so it never completes. After the test method returns, InternalTestCluster.assertAfterTest() runs and calls assertRequestsFinished(), which uses the in-flight-requests circuit breaker. The stuck request keeps non-zero bytes in flight and triggers:

AssertionError: All incoming requests on node [node_s0] should have finished.
Expected 0 bytes for requests in-flight but got 189 bytes; pending tasks [
  internal:cluster/snapshot/update_snapshot_status, ...
]

I believe removing the masterTransportAction rules should prevent this from happening by letting all subsequent requests occur as intended.

@DaveCTurner
Copy link
Copy Markdown
Member

I see, thanks, yes the second status update is going to cause problems here. I'd rather not just let it through like this tho, instead I think we need to use a pair of CountDownLatch instances to block all updates until we're ready to release them.

There look to be several other spots where we modify the master's update handling and don't revert it before the end of the test. We should fix them too.


// Release the master node to respond
snapshotStatusUpdateLatch.countDown();
masterTransportService.clearAllRules();
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lmk if this is the wrong place to have this. I put it here since it's the line following snapshotStatusUpdateLatch being counted down (and therefore the update snapshot requests are allowed to be processed, and we can remove the rules).

@joshua-adams-1
Copy link
Copy Markdown
Contributor Author

@DaveCTurner Hey, thank you for the pointers. I've implemented two CountDownLatches but let me know if this isn't what you had in mind

Copy link
Copy Markdown
Member

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (I haven't actually reproduced the failure tho, it's quite rare)

@joshua-adams-1 joshua-adams-1 marked this pull request as ready for review February 23, 2026 15:28
@joshua-adams-1 joshua-adams-1 merged commit c8d36f0 into elastic:main Feb 23, 2026
35 checks passed
@joshua-adams-1 joshua-adams-1 deleted the snapshot-shutdown-it-failure-23-feb branch February 23, 2026 15:29
@ywangd
Copy link
Copy Markdown
Member

ywangd commented Feb 24, 2026

Thanks for fixing this. This is a new failure since #142637 which adds a 2nd shard snapshot update for PAUSED shard when it is deleted.

ywangd added a commit to ywangd/elasticsearch that referenced this pull request Feb 24, 2026
burqen pushed a commit that referenced this pull request Feb 24, 2026
Same issue as #142805 and fixed by #142852.

Resolves #142868
Resolves #142869
Resolves #142870
Resolves #142871
sidosera pushed a commit to sidosera/elasticsearch that referenced this pull request Feb 24, 2026
Clear the mock transport rule in testAbortSnapshotWhileRemovingNode
after releasing the single update_snapshot_status request we
coordinate on, so any further requests are handled normally and
complete before teardown. This prevents the test from failing in
assertAfterTest() with "All incoming requests on node [X] should
have finished".

Closes elastic#142805

* Fix testAbortSnapshotWhileRemovingNode

Clear the mock transport rule in testAbortSnapshotWhileRemovingNode
after releasing the single update_snapshot_status request we
coordinate on, so any further requests are handled normally and
complete before teardown. This prevents the test from failing in
assertAfterTest() with "All incoming requests on node [X] should
have finished".

Closes elastic#142805

* Update comment

* Remove second masterTransportService.clearAllRules();

* Use two CountDownLatches

* Add extra masterTransportService.clearAllRules();

* [CI] Auto commit changes from spotless

---------

Co-authored-by: elasticsearchmachine <infra-root+elasticsearchmachine@elastic.co>
sidosera pushed a commit to sidosera/elasticsearch that referenced this pull request Feb 24, 2026
ywangd pushed a commit to ywangd/elasticsearch that referenced this pull request Feb 24, 2026
Clear the mock transport rule in testAbortSnapshotWhileRemovingNode
after releasing the single update_snapshot_status request we
coordinate on, so any further requests are handled normally and
complete before teardown. This prevents the test from failing in
assertAfterTest() with "All incoming requests on node [X] should
have finished".

Closes elastic#142805

* Fix testAbortSnapshotWhileRemovingNode

Clear the mock transport rule in testAbortSnapshotWhileRemovingNode
after releasing the single update_snapshot_status request we
coordinate on, so any further requests are handled normally and
complete before teardown. This prevents the test from failing in
assertAfterTest() with "All incoming requests on node [X] should
have finished".

Closes elastic#142805

* Update comment

* Remove second masterTransportService.clearAllRules();

* Use two CountDownLatches

* Add extra masterTransportService.clearAllRules();

* [CI] Auto commit changes from spotless

---------

Co-authored-by: elasticsearchmachine <infra-root+elasticsearchmachine@elastic.co>
(cherry picked from commit c8d36f0)
@ywangd
Copy link
Copy Markdown
Member

ywangd commented Feb 24, 2026

💚 All backports created successfully

Status Branch Result
9.3
9.2

Questions ?

Please refer to the Backport tool documentation

ywangd pushed a commit to ywangd/elasticsearch that referenced this pull request Feb 24, 2026
Clear the mock transport rule in testAbortSnapshotWhileRemovingNode
after releasing the single update_snapshot_status request we
coordinate on, so any further requests are handled normally and
complete before teardown. This prevents the test from failing in
assertAfterTest() with "All incoming requests on node [X] should
have finished".

Closes elastic#142805

* Fix testAbortSnapshotWhileRemovingNode

Clear the mock transport rule in testAbortSnapshotWhileRemovingNode
after releasing the single update_snapshot_status request we
coordinate on, so any further requests are handled normally and
complete before teardown. This prevents the test from failing in
assertAfterTest() with "All incoming requests on node [X] should
have finished".

Closes elastic#142805

* Update comment

* Remove second masterTransportService.clearAllRules();

* Use two CountDownLatches

* Add extra masterTransportService.clearAllRules();

* [CI] Auto commit changes from spotless

---------

Co-authored-by: elasticsearchmachine <infra-root+elasticsearchmachine@elastic.co>
(cherry picked from commit c8d36f0)
elasticsearchmachine pushed a commit that referenced this pull request Feb 24, 2026
Clear the mock transport rule in testAbortSnapshotWhileRemovingNode
after releasing the single update_snapshot_status request we
coordinate on, so any further requests are handled normally and
complete before teardown. This prevents the test from failing in
assertAfterTest() with "All incoming requests on node [X] should
have finished".

Closes #142805

* Fix testAbortSnapshotWhileRemovingNode

Clear the mock transport rule in testAbortSnapshotWhileRemovingNode
after releasing the single update_snapshot_status request we
coordinate on, so any further requests are handled normally and
complete before teardown. This prevents the test from failing in
assertAfterTest() with "All incoming requests on node [X] should
have finished".

Closes #142805

* Update comment

* Remove second masterTransportService.clearAllRules();

* Use two CountDownLatches

* Add extra masterTransportService.clearAllRules();

* [CI] Auto commit changes from spotless

---------


(cherry picked from commit c8d36f0)

Co-authored-by: Joshua Adams <joshua.adams@elastic.co>
Co-authored-by: elasticsearchmachine <infra-root+elasticsearchmachine@elastic.co>
elasticsearchmachine pushed a commit that referenced this pull request Feb 24, 2026
Clear the mock transport rule in testAbortSnapshotWhileRemovingNode
after releasing the single update_snapshot_status request we
coordinate on, so any further requests are handled normally and
complete before teardown. This prevents the test from failing in
assertAfterTest() with "All incoming requests on node [X] should
have finished".

Closes #142805

* Fix testAbortSnapshotWhileRemovingNode

Clear the mock transport rule in testAbortSnapshotWhileRemovingNode
after releasing the single update_snapshot_status request we
coordinate on, so any further requests are handled normally and
complete before teardown. This prevents the test from failing in
assertAfterTest() with "All incoming requests on node [X] should
have finished".

Closes #142805

* Update comment

* Remove second masterTransportService.clearAllRules();

* Use two CountDownLatches

* Add extra masterTransportService.clearAllRules();

* [CI] Auto commit changes from spotless

---------


(cherry picked from commit c8d36f0)

Co-authored-by: Joshua Adams <joshua.adams@elastic.co>
Co-authored-by: elasticsearchmachine <infra-root+elasticsearchmachine@elastic.co>
@joshua-adams-1
Copy link
Copy Markdown
Contributor Author

Thanks for this @ywangd!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. >test Issues or PRs that are addressing/adding tests v9.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[CI] SnapshotShutdownIT class failing

4 participants