Skip to content

Conversation

@varunbharadwaj
Copy link
Contributor

@varunbharadwaj varunbharadwaj commented Jul 30, 2025

Description

Transient consumer errors (such as kafka connection issues) fail shard/engine initialization today, resulting in possibility of cascading failures, where all the primary and replica shards can go down. This PR moves the consumer initialization logic into the poller with retries in case of transient errors.

Related Issues

Follow up for #16929

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@varunbharadwaj varunbharadwaj force-pushed the vb/consumerdisconnection branch from db4a5ad to 4a73eab Compare July 30, 2025 23:35
@varunbharadwaj varunbharadwaj changed the title [Pull-based Ingestion] Prevent shard initialization failures due to streaming source errors [Pull-based Ingestion] Prevent shard initialization failures due to streaming consumer errors Jul 30, 2025
@varunbharadwaj varunbharadwaj changed the title [Pull-based Ingestion] Prevent shard initialization failures due to streaming consumer errors [Pull-based Ingestion] Prevent shard initialization failures due to transient consumer errors Jul 30, 2025
@varunbharadwaj varunbharadwaj force-pushed the vb/consumerdisconnection branch from 4a73eab to f58323d Compare July 31, 2025 00:45
@github-actions
Copy link
Contributor

❌ Gradle check result for f58323d: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Contributor

❌ Gradle check result for af82f9b: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Contributor

❌ Gradle check result for af82f9b: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Contributor

❌ Gradle check result for af82f9b: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Contributor

✅ Gradle check result for af82f9b: SUCCESS

@codecov
Copy link

codecov bot commented Jul 31, 2025

Codecov Report

❌ Patch coverage is 87.50000% with 11 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.82%. Comparing base (9b22c9b) to head (032764e).
⚠️ Report is 9 commits behind head on main.

Files with missing lines Patch % Lines
...rch/indices/pollingingest/DefaultStreamPoller.java 88.00% 8 Missing and 1 partial ⚠️
...a/org/opensearch/index/engine/IngestionEngine.java 84.61% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #18877      +/-   ##
============================================
+ Coverage     72.77%   72.82%   +0.04%     
+ Complexity    68690    68673      -17     
============================================
  Files          5582     5582              
  Lines        315456   315508      +52     
  Branches      45778    45779       +1     
============================================
+ Hits         229568   229760     +192     
+ Misses        67290    67099     -191     
- Partials      18598    18649      +51     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@varunbharadwaj varunbharadwaj marked this pull request as ready for review July 31, 2025 18:18
@github-actions
Copy link
Contributor

❌ Gradle check result for 04ace5d: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Contributor

❌ Gradle check result for 21c0daf: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Contributor

github-actions bot commented Aug 1, 2025

✅ Gradle check result for 21c0daf: SUCCESS

@github-actions
Copy link
Contributor

github-actions bot commented Aug 1, 2025

❌ Gradle check result for 032764e: null

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Contributor

github-actions bot commented Aug 2, 2025

✅ Gradle check result for 032764e: SUCCESS

@yupeng9 yupeng9 merged commit 84b46d7 into opensearch-project:main Aug 2, 2025
58 of 60 checks passed
sunqijun1 pushed a commit to sunqijun1/OpenSearch that referenced this pull request Aug 4, 2025
…ransient consumer errors (opensearch-project#18877)

* Move consumer initialization to the poller to prevent engine failure

Signed-off-by: Varun Bharadwaj <[email protected]>

* Rename log messages and update exception

Signed-off-by: Varun Bharadwaj <[email protected]>

* update default poller to use private constructor

Signed-off-by: Varun Bharadwaj <[email protected]>

---------

Signed-off-by: Varun Bharadwaj <[email protected]>
Signed-off-by: sunqijun.jun <[email protected]>
tandonks pushed a commit to tandonks/OpenSearch that referenced this pull request Aug 5, 2025
…ransient consumer errors (opensearch-project#18877)

* Move consumer initialization to the poller to prevent engine failure

Signed-off-by: Varun Bharadwaj <[email protected]>

* Rename log messages and update exception

Signed-off-by: Varun Bharadwaj <[email protected]>

* update default poller to use private constructor

Signed-off-by: Varun Bharadwaj <[email protected]>

---------

Signed-off-by: Varun Bharadwaj <[email protected]>
vinaykpud pushed a commit to vinaykpud/OpenSearch that referenced this pull request Sep 26, 2025
…ransient consumer errors (opensearch-project#18877)

* Move consumer initialization to the poller to prevent engine failure

Signed-off-by: Varun Bharadwaj <[email protected]>

* Rename log messages and update exception

Signed-off-by: Varun Bharadwaj <[email protected]>

* update default poller to use private constructor

Signed-off-by: Varun Bharadwaj <[email protected]>

---------

Signed-off-by: Varun Bharadwaj <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants