Attempt to fix ThreadLeakControl issues in S3BlobContainerRetriesTests #18201

nomoa · 2025-05-05T17:10:21Z

The Default aws client might create its own ScheduledExecutorService which is closed via ExecutorService.shutdown() when the client is closed. Calling shutdown might not ensure that all threads are gone when ThreadLeakControl is ran.
Provide our own ScheduledExecutorService during this test so that it runs thrown our tearDown method which calls ThreadPool#terminate waiting for 5 seconds.

Description

Seen in build https://build.ci.opensearch.org/job/gradle-check/57676/:

  2> May 05, 2025 12:38:10 PM com.carrotsearch.randomizedtesting.ThreadLeakControl checkThreadLeaks
  2> WARNING: Will linger awaiting termination of 2 leaked thread(s).
  2> May 05, 2025 12:38:16 PM com.carrotsearch.randomizedtesting.ThreadLeakControl checkThreadLeaks
  2> SEVERE: 1 thread leaked from SUITE scope at org.opensearch.repositories.s3.S3BlobContainerRetriesTests: 
  2>    1) Thread[id=99, name=sdk-ScheduledExecutor-4-0, state=TIMED_WAITING, group=TGRP-S3BlobContainerRetriesTests]
  2>         at java.****/jdk.internal.misc.Unsafe.park(Native Method)
  2>         at java.****/java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:269)
  2>         at java.****/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1763)
  2>         at java.****/java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1182)
  2>         at java.****/java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:899)
  2>         at java.****/java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1070)
  2>         at java.****/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
  2>         at java.****/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
  2>         at java.****/java.lang.Thread.run(Thread.java:1583)
  2> May 05, 2025 12:38:16 PM com.carrotsearch.randomizedtesting.ThreadLeakControl tryToInterruptAll

I believe the only thread remaining from a ScheduledThreadPoolExecutor should be the ones created by the default s3 client.

Related Issues

Resolves #17551

Check List

~~[ ] Functionality includes testing.~~
~~[ ] API changes companion pull request created, if applicable.~~
~~[ ] Public documentation issue/PR created, if applicable.~~

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

github-actions · 2025-05-05T17:15:02Z

❌ Gradle check result for 9f2e077: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

andrross · 2025-05-05T18:01:04Z

Thanks for looking into this @nomoa!

I shared some findings here: #17551 (comment). It's true that ExecutorService.shutdown() won't ensure all threads are stopped when it returns, but the leak control logic seems to wait 5 seconds, then forcibly interrupt the running threads, waits 3 more seconds, and only then fails the test. I think if the client had actually been closed, then waiting/interrupting would have succeeded in stopping the threads. The only way I can explain this behavior is if we're somehow leaking a client and not closing it. What do you think? Am I missing something here?

Also I'm going to close #17540 in favor of #17551 so that we don't have duplicate issues.

nomoa · 2025-05-06T09:25:50Z

@andrross thanks for taking a look at this PR!
It hadn't occurred to me to look what's happening in ThreadLeakControl. What I understand from it is that indeed it will wait for 5000 (@ThreadLeakLingering(linger = 5000) annotation on OpenSearchTestCase) and because of @ThreadLeakAction({Action.WARN, Action.INTERRUPT}) on LuceneTestCase it will attempt to interrupt and warn about the leaked threads.
The reason I believe it does not work is that ExecutorService threads are immune to Thread#interrupt(), probably because that these threads might run user code that can be interrupted and the outer loop (fetching tasks) cannot distinguish an interruption that's meant to stop the user process or meant to stop the executor service.
Simple test case showing that you cannot stop an execution service thread using interrupt:

        Collection<Thread> threads = Collections.synchronizedCollection(new ArrayList<>());
        ThreadFactory factory = new ThreadFactory() {
            @Override
            public Thread newThread(Runnable runnable) {
                Thread thread = new Thread(runnable);
                threads.add(thread);
                return thread;
            }
        };
        ScheduledExecutorService service = Executors.newScheduledThreadPool(1, factory);
        service.schedule(() -> {}, 10L, TimeUnit.SECONDS);
        service.shutdown();
        while (threads.isEmpty()) {
            Thread.sleep(10);
        }
        System.out.println("Thread started");
        while (true) {
            boolean oneAlive = false;
            for (Thread t: threads) {
                System.out.println("Interrupting thread " + t.getName() + " is alive: " + t.isAlive());
                t.interrupt();
                t.join(100);
                oneAlive |= t.isAlive();
            }
            if (!oneAlive) {
                break;
            }
        }
        System.out.println("10 seconds later");

So relying on Thread#interrupt in ThreadLeakControl to take care of leaked threads from an ExecutorService does not work.

My initial assumption (the one implemented in this PR) is that the default SDK client simply runs #shutdown() on client close and any scheduled tasks will have to be run/cancelled before the scheduler actually stops its threads.
I thought that possibly some tasks got scheduled and the scheduler did not have time to gracefully shutdown in the 5s grace time allowed by ThreadLeakControl.

On further look while still theoretically possible I'm not sure that's what happening, if that's the case this PR would have to be adapted to do the same on S3AsyncService.

The other possibility as you said is that a client simply gets leaked and its executor service is never shutdown.

Possible reasons I can think of could be:

race & improper ref counting in AmazonAsyncS3Reference causing the client.close to never happen (from S3AsyncService)
a client is leaked in the client() method from S3Service and/or S3AsyncService
A client reference is not closed by its user (not used with a try-with-ressource block?)

andrross · 2025-05-06T15:51:40Z

My initial assumption (the one implemented in this PR) is that the default SDK client simply runs #shutdown() on client close and any scheduled tasks will have to be run/cancelled before the scheduler actually stops its threads.

Ah, I think you might be right. I thought that ExecutorService::shutdown would cancel any scheduled-but-not-yet-started tasks, but that is not the case (shutdownNow() does that). If the AWS SDK were submitting a scheduled task more than about 8 seconds in the future right before the test finishes, then I think it would explain the behavior we're seeing. It would also explain the flakiness because the timing might have to line up just right for that task to get scheduled and then not finish.

nomoa · 2025-05-07T16:12:22Z

The last version of this PR is applying the same strategy for both S3Service & S3AsyncService based on the assumption that the leaked threads are coming from the clients own ScheduledExecutorService that is not able to shutdown in due time because some scheduled tasks are still waiting to be run/cancelled.
I was hoping to find ways to detect if a client was leaked (which is also a reason why this test might fail on ThreadLeakControl) but unfortunately was not able to do so.
Please let me know what you think. Also, please feel free to close this PR if you believe this is headed in the wrong direction.

github-actions · 2025-05-07T16:47:40Z

❌ Gradle check result for 77df22a: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

.../repository-s3/src/test/java/org/opensearch/repositories/s3/S3BlobContainerRetriesTests.java

plugins/repository-s3/src/main/java/org/opensearch/repositories/s3/S3AsyncService.java

github-actions · 2025-05-08T20:17:25Z

✅ Gradle check result for 7171696: SUCCESS

codecov · 2025-05-08T20:18:09Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 72.58%. Comparing base (560ac10) to head (d7dd4b0).
Report is 23 commits behind head on main.

Additional details and impacted files

@@             Coverage Diff              @@
##               main   #18201      +/-   ##
============================================
+ Coverage     72.56%   72.58%   +0.01%     
- Complexity    67261    67353      +92     
============================================
  Files          5476     5482       +6     
  Lines        310478   310684     +206     
  Branches      45133    45157      +24     
============================================
+ Hits         225313   225506     +193     
- Misses        66840    66850      +10     
- Partials      18325    18328       +3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

The Default aws client might create its own ScheduledExecutorService which is closed via ExecutorService.shutdown() when the client is closed. Calling shutdown might not ensure that all threads are gone when ThreadLeakControl is ran. Provide our own ScheduledExecutorService during this test so that it runs thrown our tearDown method which calls ThreadPool#terminate waiting for 5 seconds. Signed-off-by: David Causse <[email protected]>

nomoa

thanks for the review!

plugins/repository-s3/src/main/java/org/opensearch/repositories/s3/S3AsyncService.java

.../repository-s3/src/test/java/org/opensearch/repositories/s3/S3BlobContainerRetriesTests.java

github-actions · 2025-05-09T14:33:22Z

✅ Gradle check result for d7dd4b0: SUCCESS

Similar to opensearch-project#18201 but applied via the TestPlugin used in S3BlobStoreRepositoryTests. Closes opensearch-project#17551 Signed-off-by: David Causse <[email protected]>

Similar to opensearch-project#18201 but applied via the TestPlugin used in S3BlobStoreRepositoryTests. Closes opensearch-project#14299 Signed-off-by: David Causse <[email protected]>

…#18317) Similar to #18201 but applied via the TestPlugin used in S3BlobStoreRepositoryTests. Closes #14299 Signed-off-by: David Causse <[email protected]>

…opensearch-project#18317) Similar to opensearch-project#18201 but applied via the TestPlugin used in S3BlobStoreRepositoryTests. Closes opensearch-project#14299 Signed-off-by: David Causse <[email protected]>

opensearch-project#18201) The Default aws client might create its own ScheduledExecutorService which is closed via ExecutorService.shutdown() when the client is closed. Calling shutdown might not ensure that all threads are gone when ThreadLeakControl is ran. Provide our own ScheduledExecutorService during this test so that it runs thrown our tearDown method which calls ThreadPool#terminate waiting for 5 seconds. Signed-off-by: David Causse <[email protected]>Signed-off-by: TJ Neuenfeldt <[email protected]>

…opensearch-project#18317) Similar to opensearch-project#18201 but applied via the TestPlugin used in S3BlobStoreRepositoryTests. Closes opensearch-project#14299 Signed-off-by: David Causse <[email protected]>Signed-off-by: TJ Neuenfeldt <[email protected]>

opensearch-project#18201) The Default aws client might create its own ScheduledExecutorService which is closed via ExecutorService.shutdown() when the client is closed. Calling shutdown might not ensure that all threads are gone when ThreadLeakControl is ran. Provide our own ScheduledExecutorService during this test so that it runs thrown our tearDown method which calls ThreadPool#terminate waiting for 5 seconds. Signed-off-by: David Causse <[email protected]>

…opensearch-project#18317) Similar to opensearch-project#18201 but applied via the TestPlugin used in S3BlobStoreRepositoryTests. Closes opensearch-project#14299 Signed-off-by: David Causse <[email protected]>

nomoa requested a review from a team as a code owner May 5, 2025 17:10

github-actions bot added bug Something isn't working Build Build Tasks/Gradle Plugin, groovy scripts, build tools, Javadoc enforcement. labels May 5, 2025

github-actions bot added >test-failure Test failure from CI, local build, etc. autocut flaky-test Random test failure that succeeds on second run labels May 5, 2025

nomoa marked this pull request as draft May 6, 2025 11:54

nomoa force-pushed the fix-thread-leak-in-S3BlobContainerRetriesTests branch from 9f2e077 to 77df22a Compare May 7, 2025 16:11

nomoa marked this pull request as ready for review May 7, 2025 16:14

opensearch-ci-bot mentioned this pull request May 7, 2025

[AUTOCUT] Gradle Check Flaky Test Report for SearchIT #18129

Closed

nomoa force-pushed the fix-thread-leak-in-S3BlobContainerRetriesTests branch from 77df22a to 7171696 Compare May 8, 2025 19:10

andrross reviewed May 8, 2025

View reviewed changes

.../repository-s3/src/test/java/org/opensearch/repositories/s3/S3BlobContainerRetriesTests.java Show resolved Hide resolved

plugins/repository-s3/src/main/java/org/opensearch/repositories/s3/S3AsyncService.java Outdated Show resolved Hide resolved

nomoa commented May 9, 2025

View reviewed changes

plugins/repository-s3/src/main/java/org/opensearch/repositories/s3/S3AsyncService.java Outdated Show resolved Hide resolved

.../repository-s3/src/test/java/org/opensearch/repositories/s3/S3BlobContainerRetriesTests.java Show resolved Hide resolved

nomoa force-pushed the fix-thread-leak-in-S3BlobContainerRetriesTests branch from 7171696 to d7dd4b0 Compare May 9, 2025 13:23

andrross approved these changes May 12, 2025

View reviewed changes

andrross added the skip-changelog label May 12, 2025

andrross merged commit 60fee58 into opensearch-project:main May 12, 2025
34 checks passed

This was referenced May 13, 2025

[AUTOCUT] Gradle Check Flaky Test Report for ExistsQueryBuilderTests #18274

Closed

[AUTOCUT] Gradle Check Flaky Test Report for RemoteStoreReplicationSourceTests #16683

Open

nomoa deleted the fix-thread-leak-in-S3BlobContainerRetriesTests branch May 13, 2025 07:19

nomoa mentioned this pull request May 15, 2025

[AUTOCUT] Gradle Check Flaky Test Report for S3BlobStoreRepositoryTests #14299

Open

nomoa mentioned this pull request May 16, 2025

Attempt to fix ThreadLeakControl issues in S3BlobStoreRepositoryTests #18317

Merged

1 task

Attempt to fix ThreadLeakControl issues in S3BlobContainerRetriesTests #18201

Attempt to fix ThreadLeakControl issues in S3BlobContainerRetriesTests #18201

Uh oh!

Conversation

nomoa commented May 5, 2025 • edited by andrross Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issues

Check List

Uh oh!

github-actions bot commented May 5, 2025

Uh oh!

andrross commented May 5, 2025

Uh oh!

nomoa commented May 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andrross commented May 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nomoa commented May 7, 2025

Uh oh!

github-actions bot commented May 7, 2025

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented May 8, 2025

Uh oh!

codecov bot commented May 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

nomoa left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented May 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nomoa commented May 5, 2025 •

edited by andrross

Loading

nomoa commented May 6, 2025 •

edited

Loading

andrross commented May 6, 2025 •

edited

Loading

codecov bot commented May 8, 2025 •

edited

Loading