Fix testAbortSnapshotWhileRemovingNode by joshua-adams-1 · Pull Request #142852 · elastic/elasticsearch

joshua-adams-1 · 2026-02-23T12:24:29Z

Clear the mock transport rule in testAbortSnapshotWhileRemovingNode after releasing the single update_snapshot_status request we coordinate on, so any further requests are handled normally and complete before teardown. This prevents the test from failing in assertAfterTest() with "All incoming requests on node [X] should have finished".

Closes #142805

Clear the mock transport rule in testAbortSnapshotWhileRemovingNode after releasing the single update_snapshot_status request we coordinate on, so any further requests are handled normally and complete before teardown. This prevents the test from failing in assertAfterTest() with "All incoming requests on node [X] should have finished". Closes elastic#142805

DaveCTurner

This looks like a good change but are you sure it fixes the test failure completely?

joshua-adams-1 · 2026-02-23T12:29:08Z

@ywangd Hey, tell me if I'm wrong here, but AFAICT, the test blocks all internal:cluster/snapshot/update_snapshot_status requests on the master via addRequestHandlingBehavior and a CyclicBarrier(2). It only releases the barrier twice (once for "data node sent pause", once for "process that pause after abort"), so exactly one request is allowed through.

When the snapshot is aborted, the data node can send a second status update (e.g. FAILED/ABORTED or completion). That second request hits the same handler and blocks on the first safeAwait(barrier) with no further barrier releases, so it never completes. After the test method returns, InternalTestCluster.assertAfterTest() runs and calls assertRequestsFinished(), which uses the in-flight-requests circuit breaker. The stuck request keeps non-zero bytes in flight and triggers:

AssertionError: All incoming requests on node [node_s0] should have finished.
Expected 0 bytes for requests in-flight but got 189 bytes; pending tasks [
  internal:cluster/snapshot/update_snapshot_status, ...
]

I believe removing the masterTransportAction rules should prevent this from happening by letting all subsequent requests occur as intended.

DaveCTurner · 2026-02-23T12:37:27Z

I see, thanks, yes the second status update is going to cause problems here. I'd rather not just let it through like this tho, instead I think we need to use a pair of CountDownLatch instances to block all updates until we're ready to release them.

There look to be several other spots where we modify the master's update handling and don't revert it before the end of the test. We should fix them too.

joshua-adams-1 · 2026-02-23T12:52:21Z

server/src/internalClusterTest/java/org/elasticsearch/snapshots/SnapshotShutdownIT.java


        // Release the master node to respond
        snapshotStatusUpdateLatch.countDown();
+        masterTransportService.clearAllRules();


Lmk if this is the wrong place to have this. I put it here since it's the line following snapshotStatusUpdateLatch being counted down (and therefore the update snapshot requests are allowed to be processed, and we can remove the rules).

joshua-adams-1 · 2026-02-23T12:53:10Z

@DaveCTurner Hey, thank you for the pointers. I've implemented two CountDownLatches but let me know if this isn't what you had in mind

DaveCTurner

LGTM (I haven't actually reproduced the failure tho, it's quite rare)

ywangd · 2026-02-24T04:50:21Z

Thanks for fixing this. This is a new failure since #142637 which adds a 2nd shard snapshot update for PAUSED shard when it is deleted.

Same issue as elastic#142805 and fixed by elastic#142852. Resolves elastic#142868 Resolves elastic#142869 Resolves elastic#142870 Resolves elastic#142871

Same issue as #142805 and fixed by #142852. Resolves #142868 Resolves #142869 Resolves #142870 Resolves #142871

Clear the mock transport rule in testAbortSnapshotWhileRemovingNode after releasing the single update_snapshot_status request we coordinate on, so any further requests are handled normally and complete before teardown. This prevents the test from failing in assertAfterTest() with "All incoming requests on node [X] should have finished". Closes elastic#142805 * Fix testAbortSnapshotWhileRemovingNode Clear the mock transport rule in testAbortSnapshotWhileRemovingNode after releasing the single update_snapshot_status request we coordinate on, so any further requests are handled normally and complete before teardown. This prevents the test from failing in assertAfterTest() with "All incoming requests on node [X] should have finished". Closes elastic#142805 * Update comment * Remove second masterTransportService.clearAllRules(); * Use two CountDownLatches * Add extra masterTransportService.clearAllRules(); * [CI] Auto commit changes from spotless --------- Co-authored-by: elasticsearchmachine <infra-root+elasticsearchmachine@elastic.co>

Same issue as elastic#142805 and fixed by elastic#142852. Resolves elastic#142868 Resolves elastic#142869 Resolves elastic#142870 Resolves elastic#142871

Clear the mock transport rule in testAbortSnapshotWhileRemovingNode after releasing the single update_snapshot_status request we coordinate on, so any further requests are handled normally and complete before teardown. This prevents the test from failing in assertAfterTest() with "All incoming requests on node [X] should have finished". Closes elastic#142805 * Fix testAbortSnapshotWhileRemovingNode Clear the mock transport rule in testAbortSnapshotWhileRemovingNode after releasing the single update_snapshot_status request we coordinate on, so any further requests are handled normally and complete before teardown. This prevents the test from failing in assertAfterTest() with "All incoming requests on node [X] should have finished". Closes elastic#142805 * Update comment * Remove second masterTransportService.clearAllRules(); * Use two CountDownLatches * Add extra masterTransportService.clearAllRules(); * [CI] Auto commit changes from spotless --------- Co-authored-by: elasticsearchmachine <infra-root+elasticsearchmachine@elastic.co> (cherry picked from commit c8d36f0)

ywangd · 2026-02-24T21:42:58Z

💚 All backports created successfully

Status	Branch	Result
✅	9.3
✅	9.2

Questions ?

Please refer to the Backport tool documentation

Clear the mock transport rule in testAbortSnapshotWhileRemovingNode after releasing the single update_snapshot_status request we coordinate on, so any further requests are handled normally and complete before teardown. This prevents the test from failing in assertAfterTest() with "All incoming requests on node [X] should have finished". Closes elastic#142805 * Fix testAbortSnapshotWhileRemovingNode Clear the mock transport rule in testAbortSnapshotWhileRemovingNode after releasing the single update_snapshot_status request we coordinate on, so any further requests are handled normally and complete before teardown. This prevents the test from failing in assertAfterTest() with "All incoming requests on node [X] should have finished". Closes elastic#142805 * Update comment * Remove second masterTransportService.clearAllRules(); * Use two CountDownLatches * Add extra masterTransportService.clearAllRules(); * [CI] Auto commit changes from spotless --------- Co-authored-by: elasticsearchmachine <infra-root+elasticsearchmachine@elastic.co> (cherry picked from commit c8d36f0)

Clear the mock transport rule in testAbortSnapshotWhileRemovingNode after releasing the single update_snapshot_status request we coordinate on, so any further requests are handled normally and complete before teardown. This prevents the test from failing in assertAfterTest() with "All incoming requests on node [X] should have finished". Closes #142805 * Fix testAbortSnapshotWhileRemovingNode Clear the mock transport rule in testAbortSnapshotWhileRemovingNode after releasing the single update_snapshot_status request we coordinate on, so any further requests are handled normally and complete before teardown. This prevents the test from failing in assertAfterTest() with "All incoming requests on node [X] should have finished". Closes #142805 * Update comment * Remove second masterTransportService.clearAllRules(); * Use two CountDownLatches * Add extra masterTransportService.clearAllRules(); * [CI] Auto commit changes from spotless --------- (cherry picked from commit c8d36f0) Co-authored-by: Joshua Adams <joshua.adams@elastic.co> Co-authored-by: elasticsearchmachine <infra-root+elasticsearchmachine@elastic.co>

joshua-adams-1 · 2026-02-25T11:59:03Z

Thanks for this @ywangd!

joshua-adams-1 added 3 commits February 23, 2026 12:07

Update comment

71baf8b

joshua-adams-1 self-assigned this Feb 23, 2026

joshua-adams-1 added >test Issues or PRs that are addressing/adding tests :Distributed/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. labels Feb 23, 2026

elasticsearchmachine added the v9.4.0 label Feb 23, 2026

Remove second masterTransportService.clearAllRules();

d6d658e

DaveCTurner reviewed Feb 23, 2026

View reviewed changes

joshua-adams-1 added 2 commits February 23, 2026 12:46

Use two CountDownLatches

99a8806

Add extra masterTransportService.clearAllRules();

6845cbc

joshua-adams-1 commented Feb 23, 2026

View reviewed changes

[CI] Auto commit changes from spotless

dcb91c8

DaveCTurner approved these changes Feb 23, 2026

View reviewed changes

Merge branch 'main' into snapshot-shutdown-it-failure-23-feb

3801875

joshua-adams-1 marked this pull request as ready for review February 23, 2026 15:28

joshua-adams-1 merged commit c8d36f0 into elastic:main Feb 23, 2026
35 checks passed

joshua-adams-1 deleted the snapshot-shutdown-it-failure-23-feb branch February 23, 2026 15:29

ywangd mentioned this pull request Feb 24, 2026

[Test] Unmute tests of SnapshotShutdownIT #142921

Merged

burqen pushed a commit that referenced this pull request Feb 24, 2026

[Test] Unmute tests of SnapshotShutdownIT (#142921)

6de1684

Same issue as #142805 and fixed by #142852. Resolves #142868 Resolves #142869 Resolves #142870 Resolves #142871

ywangd mentioned this pull request Feb 24, 2026

[9.3] Fix testAbortSnapshotWhileRemovingNode (#142852) #142998

Merged

ywangd mentioned this pull request Feb 24, 2026

[9.2] Fix testAbortSnapshotWhileRemovingNode (#142852) #143000

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix testAbortSnapshotWhileRemovingNode#142852

Fix testAbortSnapshotWhileRemovingNode#142852
joshua-adams-1 merged 8 commits intoelastic:mainfrom
joshua-adams-1:snapshot-shutdown-it-failure-23-feb

joshua-adams-1 commented Feb 23, 2026

Uh oh!

DaveCTurner left a comment

Uh oh!

joshua-adams-1 commented Feb 23, 2026

Uh oh!

DaveCTurner commented Feb 23, 2026

Uh oh!

joshua-adams-1 Feb 23, 2026

Uh oh!

joshua-adams-1 commented Feb 23, 2026

Uh oh!

DaveCTurner left a comment

Uh oh!

Uh oh!

ywangd commented Feb 24, 2026

Uh oh!

ywangd commented Feb 24, 2026

Uh oh!

joshua-adams-1 commented Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

joshua-adams-1 commented Feb 23, 2026

Uh oh!

DaveCTurner left a comment

Choose a reason for hiding this comment

Uh oh!

joshua-adams-1 commented Feb 23, 2026

Uh oh!

DaveCTurner commented Feb 23, 2026

Uh oh!

joshua-adams-1 Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

joshua-adams-1 commented Feb 23, 2026

Uh oh!

DaveCTurner left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ywangd commented Feb 24, 2026

Uh oh!

ywangd commented Feb 24, 2026

💚 All backports created successfully

Questions ?

Uh oh!

joshua-adams-1 commented Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants