Skip to content

Conversation

@ashking94
Copy link
Member

Description

This PR adds timeout handling for async operations in the S3 blob container to prevent snapshot deletion threads from getting stuck indefinitely. Previously, when S3 operations would hang or never complete, the snapshot deletion threadpool could get stuck with non-zero active thread count.

The change can be summarised as below -

  • Added a 30-second timeout for async operations in getFutureValue()
  • Added proper cancellation of futures on timeout
  • Added new test case testDeleteTimeoutWithNeverCompletingAsyncDeletionFuture to verify timeout behavior

Related Issues

Resolves #18314

Check List

  • Functionality includes testing.
  • [ ] API changes companion pull request created, if applicable.
  • [ ] Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions
Copy link
Contributor

❕ Gradle check result for 805f91b: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

@codecov
Copy link

codecov bot commented Jun 23, 2025

Codecov Report

Attention: Patch coverage is 97.91667% with 1 line in your changes missing coverage. Please review.

Project coverage is 72.65%. Comparing base (d404f33) to head (6e073dd).
Report is 5 commits behind head on main.

Files with missing lines Patch % Lines
...rg/opensearch/repositories/s3/S3BlobContainer.java 95.23% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #18583      +/-   ##
============================================
- Coverage     72.76%   72.65%   -0.11%     
+ Complexity    68167    68117      -50     
============================================
  Files          5541     5541              
  Lines        313426   313462      +36     
  Branches      45479    45479              
============================================
- Hits         228062   227748     -314     
- Misses        66709    67165     +456     
+ Partials      18655    18549     -106     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ashking94
Copy link
Member Author

Planning to add some more logs to be able to debug better.

@github-actions
Copy link
Contributor

❌ Gradle check result for 1ff0fff: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Contributor

❕ Gradle check result for 67e653b: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

@github-project-automation github-project-automation bot moved this to 👀 In review in Storage Project Board Jun 23, 2025
@github-actions
Copy link
Contributor

✅ Gradle check result for 6e073dd: SUCCESS

@ashking94 ashking94 merged commit 2581c58 into opensearch-project:main Jun 24, 2025
31 checks passed
@github-project-automation github-project-automation bot moved this from 👀 In review to ✅ Done in Storage Project Board Jun 24, 2025
opensearch-trigger-bot bot pushed a commit that referenced this pull request Jun 24, 2025
* Add timeout handling for S3 blob container async operations

Signed-off-by: Ashish Singh <[email protected]>

* Add logs to debug never completing future

Signed-off-by: Ashish Singh <[email protected]>

* Incorporare PR review comments

Signed-off-by: Ashish Singh <[email protected]>

---------

Signed-off-by: Ashish Singh <[email protected]>
(cherry picked from commit 2581c58)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
opensearch-trigger-bot bot pushed a commit that referenced this pull request Jun 24, 2025
* Add timeout handling for S3 blob container async operations

Signed-off-by: Ashish Singh <[email protected]>

* Add logs to debug never completing future

Signed-off-by: Ashish Singh <[email protected]>

* Incorporare PR review comments

Signed-off-by: Ashish Singh <[email protected]>

---------

Signed-off-by: Ashish Singh <[email protected]>
(cherry picked from commit 2581c58)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
ashking94 pushed a commit that referenced this pull request Jun 24, 2025
…18596)

* Add timeout handling for S3 blob container async operations



* Add logs to debug never completing future



* Incorporare PR review comments



---------


(cherry picked from commit 2581c58)

Signed-off-by: Ashish Singh <[email protected]>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
neuenfeldttj pushed a commit to neuenfeldttj/OpenSearch that referenced this pull request Jun 26, 2025
…ch-project#18583)

* Add timeout handling for S3 blob container async operations

Signed-off-by: Ashish Singh <[email protected]>

* Add logs to debug never completing future

Signed-off-by: Ashish Singh <[email protected]>

* Incorporare PR review comments

Signed-off-by: Ashish Singh <[email protected]>

---------

Signed-off-by: Ashish Singh <[email protected]>Signed-off-by: TJ Neuenfeldt <[email protected]>
neuenfeldttj pushed a commit to neuenfeldttj/OpenSearch that referenced this pull request Jun 26, 2025
…ch-project#18583)

* Add timeout handling for S3 blob container async operations

Signed-off-by: Ashish Singh <[email protected]>

* Add logs to debug never completing future

Signed-off-by: Ashish Singh <[email protected]>

* Incorporare PR review comments

Signed-off-by: Ashish Singh <[email protected]>

---------

Signed-off-by: Ashish Singh <[email protected]>
tandonks pushed a commit to tandonks/OpenSearch that referenced this pull request Aug 5, 2025
…ch-project#18583)

* Add timeout handling for S3 blob container async operations

Signed-off-by: Ashish Singh <[email protected]>

* Add logs to debug never completing future

Signed-off-by: Ashish Singh <[email protected]>

* Incorporare PR review comments

Signed-off-by: Ashish Singh <[email protected]>

---------

Signed-off-by: Ashish Singh <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: ✅ Done

Development

Successfully merging this pull request may close these issues.

Snapshot_deletion threadpool active thread count stuck at 1 after encountering failure

3 participants