Skip to content

Conversation

@nomoa
Copy link
Contributor

@nomoa nomoa commented May 5, 2025

The Default aws client might create its own ScheduledExecutorService which is closed via ExecutorService.shutdown() when the client is closed. Calling shutdown might not ensure that all threads are gone when ThreadLeakControl is ran.
Provide our own ScheduledExecutorService during this test so that it runs thrown our tearDown method which calls ThreadPool#terminate waiting for 5 seconds.

Description

Seen in build https://build.ci.opensearch.org/job/gradle-check/57676/:

  2> May 05, 2025 12:38:10 PM com.carrotsearch.randomizedtesting.ThreadLeakControl checkThreadLeaks
  2> WARNING: Will linger awaiting termination of 2 leaked thread(s).
  2> May 05, 2025 12:38:16 PM com.carrotsearch.randomizedtesting.ThreadLeakControl checkThreadLeaks
  2> SEVERE: 1 thread leaked from SUITE scope at org.opensearch.repositories.s3.S3BlobContainerRetriesTests: 
  2>    1) Thread[id=99, name=sdk-ScheduledExecutor-4-0, state=TIMED_WAITING, group=TGRP-S3BlobContainerRetriesTests]
  2>         at java.****/jdk.internal.misc.Unsafe.park(Native Method)
  2>         at java.****/java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:269)
  2>         at java.****/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1763)
  2>         at java.****/java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1182)
  2>         at java.****/java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:899)
  2>         at java.****/java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1070)
  2>         at java.****/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
  2>         at java.****/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
  2>         at java.****/java.lang.Thread.run(Thread.java:1583)
  2> May 05, 2025 12:38:16 PM com.carrotsearch.randomizedtesting.ThreadLeakControl tryToInterruptAll

I believe the only thread remaining from a ScheduledThreadPoolExecutor should be the ones created by the default s3 client.

Related Issues

Resolves #17551

Check List

  • [ ] Functionality includes testing.
  • [ ] API changes companion pull request created, if applicable.
  • [ ] Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@nomoa nomoa requested a review from a team as a code owner May 5, 2025 17:10
@github-actions github-actions bot added bug Something isn't working Build Build Tasks/Gradle Plugin, groovy scripts, build tools, Javadoc enforcement. labels May 5, 2025
@github-actions
Copy link
Contributor

github-actions bot commented May 5, 2025

❌ Gradle check result for 9f2e077: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@andrross
Copy link
Member

andrross commented May 5, 2025

Thanks for looking into this @nomoa!

I shared some findings here: #17551 (comment). It's true that ExecutorService.shutdown() won't ensure all threads are stopped when it returns, but the leak control logic seems to wait 5 seconds, then forcibly interrupt the running threads, waits 3 more seconds, and only then fails the test. I think if the client had actually been closed, then waiting/interrupting would have succeeded in stopping the threads. The only way I can explain this behavior is if we're somehow leaking a client and not closing it. What do you think? Am I missing something here?

Also I'm going to close #17540 in favor of #17551 so that we don't have duplicate issues.

@github-actions github-actions bot added >test-failure Test failure from CI, local build, etc. autocut flaky-test Random test failure that succeeds on second run labels May 5, 2025
@nomoa
Copy link
Contributor Author

nomoa commented May 6, 2025

@andrross thanks for taking a look at this PR!
It hadn't occurred to me to look what's happening in ThreadLeakControl. What I understand from it is that indeed it will wait for 5000 (@ThreadLeakLingering(linger = 5000) annotation on OpenSearchTestCase) and because of @ThreadLeakAction({Action.WARN, Action.INTERRUPT}) on LuceneTestCase it will attempt to interrupt and warn about the leaked threads.
The reason I believe it does not work is that ExecutorService threads are immune to Thread#interrupt(), probably because that these threads might run user code that can be interrupted and the outer loop (fetching tasks) cannot distinguish an interruption that's meant to stop the user process or meant to stop the executor service.
Simple test case showing that you cannot stop an execution service thread using interrupt:

        Collection<Thread> threads = Collections.synchronizedCollection(new ArrayList<>());
        ThreadFactory factory = new ThreadFactory() {
            @Override
            public Thread newThread(Runnable runnable) {
                Thread thread = new Thread(runnable);
                threads.add(thread);
                return thread;
            }
        };
        ScheduledExecutorService service = Executors.newScheduledThreadPool(1, factory);
        service.schedule(() -> {}, 10L, TimeUnit.SECONDS);
        service.shutdown();
        while (threads.isEmpty()) {
            Thread.sleep(10);
        }
        System.out.println("Thread started");
        while (true) {
            boolean oneAlive = false;
            for (Thread t: threads) {
                System.out.println("Interrupting thread " + t.getName() + " is alive: " + t.isAlive());
                t.interrupt();
                t.join(100);
                oneAlive |= t.isAlive();
            }
            if (!oneAlive) {
                break;
            }
        }
        System.out.println("10 seconds later");

So relying on Thread#interrupt in ThreadLeakControl to take care of leaked threads from an ExecutorService does not work.

My initial assumption (the one implemented in this PR) is that the default SDK client simply runs #shutdown() on client close and any scheduled tasks will have to be run/cancelled before the scheduler actually stops its threads.
I thought that possibly some tasks got scheduled and the scheduler did not have time to gracefully shutdown in the 5s grace time allowed by ThreadLeakControl.

On further look while still theoretically possible I'm not sure that's what happening, if that's the case this PR would have to be adapted to do the same on S3AsyncService.

The other possibility as you said is that a client simply gets leaked and its executor service is never shutdown.

Possible reasons I can think of could be:

  • race & improper ref counting in AmazonAsyncS3Reference causing the client.close to never happen (from S3AsyncService)
  • a client is leaked in the client() method from S3Service and/or S3AsyncService
  • A client reference is not closed by its user (not used with a try-with-ressource block?)

@nomoa nomoa marked this pull request as draft May 6, 2025 11:54
@andrross
Copy link
Member

andrross commented May 6, 2025

My initial assumption (the one implemented in this PR) is that the default SDK client simply runs #shutdown() on client close and any scheduled tasks will have to be run/cancelled before the scheduler actually stops its threads.

Ah, I think you might be right. I thought that ExecutorService::shutdown would cancel any scheduled-but-not-yet-started tasks, but that is not the case (shutdownNow() does that). If the AWS SDK were submitting a scheduled task more than about 8 seconds in the future right before the test finishes, then I think it would explain the behavior we're seeing. It would also explain the flakiness because the timing might have to line up just right for that task to get scheduled and then not finish.

@nomoa nomoa force-pushed the fix-thread-leak-in-S3BlobContainerRetriesTests branch from 9f2e077 to 77df22a Compare May 7, 2025 16:11
@nomoa
Copy link
Contributor Author

nomoa commented May 7, 2025

The last version of this PR is applying the same strategy for both S3Service & S3AsyncService based on the assumption that the leaked threads are coming from the clients own ScheduledExecutorService that is not able to shutdown in due time because some scheduled tasks are still waiting to be run/cancelled.
I was hoping to find ways to detect if a client was leaked (which is also a reason why this test might fail on ThreadLeakControl) but unfortunately was not able to do so.
Please let me know what you think. Also, please feel free to close this PR if you believe this is headed in the wrong direction.

@nomoa nomoa marked this pull request as ready for review May 7, 2025 16:14
@github-actions
Copy link
Contributor

github-actions bot commented May 7, 2025

❌ Gradle check result for 77df22a: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@nomoa nomoa force-pushed the fix-thread-leak-in-S3BlobContainerRetriesTests branch from 77df22a to 7171696 Compare May 8, 2025 19:10
@github-actions
Copy link
Contributor

github-actions bot commented May 8, 2025

✅ Gradle check result for 7171696: SUCCESS

@codecov
Copy link

codecov bot commented May 8, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 72.58%. Comparing base (560ac10) to head (d7dd4b0).
Report is 23 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main   #18201      +/-   ##
============================================
+ Coverage     72.56%   72.58%   +0.01%     
- Complexity    67261    67353      +92     
============================================
  Files          5476     5482       +6     
  Lines        310478   310684     +206     
  Branches      45133    45157      +24     
============================================
+ Hits         225313   225506     +193     
- Misses        66840    66850      +10     
- Partials      18325    18328       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

The Default aws client might create its own ScheduledExecutorService
which is closed via ExecutorService.shutdown() when the client is
closed. Calling shutdown might not ensure that all threads are gone
when ThreadLeakControl is ran.
Provide our own ScheduledExecutorService during this test so that it
runs thrown our tearDown method which calls ThreadPool#terminate waiting
for 5 seconds.

Signed-off-by: David Causse <[email protected]>
Copy link
Contributor Author

@nomoa nomoa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the review!

@nomoa nomoa force-pushed the fix-thread-leak-in-S3BlobContainerRetriesTests branch from 7171696 to d7dd4b0 Compare May 9, 2025 13:23
@github-actions
Copy link
Contributor

github-actions bot commented May 9, 2025

✅ Gradle check result for d7dd4b0: SUCCESS

@andrross andrross merged commit 60fee58 into opensearch-project:main May 12, 2025
34 checks passed
@nomoa nomoa deleted the fix-thread-leak-in-S3BlobContainerRetriesTests branch May 13, 2025 07:19
nomoa added a commit to nomoa/OpenSearch that referenced this pull request May 16, 2025
Similar to opensearch-project#18201 but applied via the TestPlugin used in
S3BlobStoreRepositoryTests.

Closes opensearch-project#17551

Signed-off-by: David Causse <[email protected]>
nomoa added a commit to nomoa/OpenSearch that referenced this pull request May 16, 2025
Similar to opensearch-project#18201 but applied via the TestPlugin used in
S3BlobStoreRepositoryTests.

Closes opensearch-project#14299

Signed-off-by: David Causse <[email protected]>
nomoa added a commit to nomoa/OpenSearch that referenced this pull request May 16, 2025
Similar to opensearch-project#18201 but applied via the TestPlugin used in
S3BlobStoreRepositoryTests.

Closes opensearch-project#14299

Signed-off-by: David Causse <[email protected]>
nomoa added a commit to nomoa/OpenSearch that referenced this pull request May 19, 2025
Similar to opensearch-project#18201 but applied via the TestPlugin used in
S3BlobStoreRepositoryTests.

Closes opensearch-project#14299

Signed-off-by: David Causse <[email protected]>
andrross pushed a commit that referenced this pull request May 19, 2025
…#18317)

Similar to #18201 but applied via the TestPlugin used in
S3BlobStoreRepositoryTests.

Closes #14299

Signed-off-by: David Causse <[email protected]>
tandonks pushed a commit to tandonks/OpenSearch that referenced this pull request Jun 1, 2025
…opensearch-project#18317)

Similar to opensearch-project#18201 but applied via the TestPlugin used in
S3BlobStoreRepositoryTests.

Closes opensearch-project#14299

Signed-off-by: David Causse <[email protected]>
neuenfeldttj pushed a commit to neuenfeldttj/OpenSearch that referenced this pull request Jun 26, 2025
opensearch-project#18201)

The Default aws client might create its own ScheduledExecutorService
which is closed via ExecutorService.shutdown() when the client is
closed. Calling shutdown might not ensure that all threads are gone
when ThreadLeakControl is ran.
Provide our own ScheduledExecutorService during this test so that it
runs thrown our tearDown method which calls ThreadPool#terminate waiting
for 5 seconds.

Signed-off-by: David Causse <[email protected]>Signed-off-by: TJ Neuenfeldt <[email protected]>
neuenfeldttj pushed a commit to neuenfeldttj/OpenSearch that referenced this pull request Jun 26, 2025
…opensearch-project#18317)

Similar to opensearch-project#18201 but applied via the TestPlugin used in
S3BlobStoreRepositoryTests.

Closes opensearch-project#14299

Signed-off-by: David Causse <[email protected]>Signed-off-by: TJ Neuenfeldt <[email protected]>
neuenfeldttj pushed a commit to neuenfeldttj/OpenSearch that referenced this pull request Jun 26, 2025
opensearch-project#18201)

The Default aws client might create its own ScheduledExecutorService
which is closed via ExecutorService.shutdown() when the client is
closed. Calling shutdown might not ensure that all threads are gone
when ThreadLeakControl is ran.
Provide our own ScheduledExecutorService during this test so that it
runs thrown our tearDown method which calls ThreadPool#terminate waiting
for 5 seconds.

Signed-off-by: David Causse <[email protected]>
neuenfeldttj pushed a commit to neuenfeldttj/OpenSearch that referenced this pull request Jun 26, 2025
…opensearch-project#18317)

Similar to opensearch-project#18201 but applied via the TestPlugin used in
S3BlobStoreRepositoryTests.

Closes opensearch-project#14299

Signed-off-by: David Causse <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

autocut bug Something isn't working Build Build Tasks/Gradle Plugin, groovy scripts, build tools, Javadoc enforcement. flaky-test Random test failure that succeeds on second run skip-changelog >test-failure Test failure from CI, local build, etc.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[AUTOCUT] Gradle Check Flaky Test Report for S3BlobContainerRetriesTests

2 participants