Skip to content

Conversation

@guojialiang92
Copy link
Contributor

@guojialiang92 guojialiang92 commented May 12, 2025

Description

This PR is based on [17881]'s follow-up work. It implements the core process of local merged segment warmer, and the main modifications are as follows:

  • During the IndexReaderWarmer#warm, the primary shard will send send a PublishMergedSegmentRequest request containing information about the merged segment (ReplicationSegmentCheckpoint is introduced to represent the information of a merged segment) to all active state replicas.
  • The replica shard introduces MergedSegmentReplicationTarget to pull segment from the primary shard. It has to reuse most of the capabilities in SegmentReplicationTarget. The IndexReaderWarmer#warm process will not end until all active state replicas have completed pulling merged segment.
  • Use separate rate limiter and timeout controls for the merged segments pre-copy.
  • Added relevant UT and IT testing.

There are also some orthogonal work that needs to be supported in the future, summarized as follows:

  • Repilica shard cleaning up residual merged segments. For example, after the primary shard fail, the pre-copy merged segment of the previous term needs to be cleaned up, or some segments are no longer referenced after continuous merging.
  • Remote store support.

Related Issues

Resolves #[17528]

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions
Copy link
Contributor

❌ Gradle check result for aa13d96: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: guojialiang <[email protected]>
@github-actions
Copy link
Contributor

❌ Gradle check result for 1c1f802: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Contributor

❌ Gradle check result for 78795e3: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: guojialiang <[email protected]>
@guojialiang92 guojialiang92 force-pushed the dev/support-local-merged-segment-warmer branch from 78795e3 to 58528e9 Compare May 12, 2025 09:15
@github-actions
Copy link
Contributor

✅ Gradle check result for 58528e9: SUCCESS

@github-actions
Copy link
Contributor

❌ Gradle check result for d2df443: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: guojialiang <[email protected]>
@guojialiang92 guojialiang92 force-pushed the dev/support-local-merged-segment-warmer branch from d2df443 to a0b4e3a Compare June 25, 2025 16:13
@github-actions
Copy link
Contributor

❌ Gradle check result for a0b4e3a: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Contributor

❌ Gradle check result for cc5ea0b: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: guojialiang <[email protected]>
@github-actions
Copy link
Contributor

❌ Gradle check result for dedad5e: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: guojialiang <[email protected]>
@github-actions
Copy link
Contributor

❌ Gradle check result for 3224f35: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@ashking94
Copy link
Member

Re-triggered the PR build.

@github-actions
Copy link
Contributor

❌ Gradle check result for 3224f35: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@guojialiang92
Copy link
Contributor Author

@guojialiang92 Do share once you have have addressed all the comments.

Hi, @ashking94
I have addressed all the comments.
But there's a gradle test that fails frequently, and it doesn't seem to be related to this PR. I haven't reproduced it locally. I'm further troubleshooting.

@guojialiang92
Copy link
Contributor Author

guojialiang92 commented Jun 26, 2025

I think I found the reason why the test fails. It seems that the test is not robust.
If you print the log in IngestService#executePipelinesInBatchRequests, it will throw an NPE because indexRequest is empty.
The problem can be reproduced stably by changing debug to info locally.

I copied some exception logs from gradle test, which also proves this point.

[testExecuteBulkRequestInBatchWithExceptionAndDropInCallback_requestsWithMatchingChildSlots] Test testExecuteBulkRequestInBatchWithExceptionAndDropInCallback_requestsWithMatchingChildSlots(org.opensearch.ingest.IngestServiceTests) started
[2025-06-26T12:22:53,523][INFO ][o.o.i.IngestServiceTests ] [testExecuteBulkRequestInBatchWithExceptionAndDropInCallback_requestsWithMatchingChildSlots] before test
[2025-06-26T12:22:53,533][DEBUG][o.o.i.IngestService      ] [testExecuteBulkRequestInBatchWithExceptionAndDropInCallback_requestsWithMatchingChildSlots] batchSize: 5, batches: 1
[2025-06-26T12:22:53,535][DEBUG][o.o.i.IngestService      ] [testExecuteBulkRequestInBatchWithExceptionAndDropInCallback_requestsWithMatchingChildSlots] failed to execute pipeline [_id] for documents [_index/_id1, _index/_id2, _index/_id2, _index/_id3, _index/_id3]
java.lang.NullPointerException: Cannot invoke "org.opensearch.action.index.IndexRequest.index()" because "indexRequest" is null
	at org.opensearch.ingest.IngestService.lambda$executePipelinesInBatchRequests$1(IngestService.java:1012) ~[main/:?]
	at org.apache.logging.log4j.util.LambdaUtil.get(LambdaUtil.java:76) ~[log4j-api-2.21.0.jar:?]
	at org.apache.logging.log4j.spi.AbstractLogger.logMessage(AbstractLogger.java:1959) [log4j-api-2.21.0.jar:?]
	at org.apache.logging.log4j.spi.AbstractLogger.logIfEnabled(AbstractLogger.java:1804) [log4j-api-2.21.0.jar:?]
	at org.apache.logging.log4j.spi.AbstractLogger.debug(AbstractLogger.java:380) [log4j-api-2.21.0.jar:?]
	at org.opensearch.ingest.IngestService.lambda$executePipelinesInBatchRequests$0(IngestService.java:1008) ~[main/:?]
	at org.opensearch.ingest.IngestService.lambda$innerBatchExecute$0(IngestService.java:1431) ~[main/:?]
	at org.opensearch.ingest.Pipeline.lambda$batchExecute$0(Pipeline.java:277) ~[main/:?]
	at org.opensearch.ingest.CompoundProcessor.lambda$innerBatchExecute$0(CompoundProcessor.java:240) ~[main/:?]
	at 

Copy link
Member

@ashking94 ashking94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, Thanks!

@ashking94
Copy link
Member

I have re-triggered the test. We can merge it after it becomes green.

@github-actions
Copy link
Contributor

❌ Gradle check result for 3224f35: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@guojialiang92
Copy link
Contributor Author

❌ Gradle check result for 3224f35: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Hi, @ashking94

It seems to be the same issue. Can we merge bugfix PR first and then run the gradle test?

[testExecuteBulkRequestInBatchWithExceptionAndDropInCallback_requestsWithMatchingChildSlots] failed to execute pipeline [_id] for documents [_index/_id1, _index/_id2, _index/_id2, _index/_id3, _index/_id3]
java.lang.NullPointerException: Cannot invoke "org.opensearch.action.index.IndexRequest.index()" because "indexRequest" is null
	at org.opensearch.ingest.IngestService.lambda$executePipelinesInBatchRequests$1(IngestService.java:1012) ~[main/:?]
	at org.apache.logging.log4j.util.LambdaUtil.get(LambdaUtil.java:76) ~[log4j-api-2.21.0.jar:?]
	at org.apache.logging.log4j.spi.AbstractLogger.logMessage(AbstractLogger.java:1959) [log4j-api-2.21.0.jar:?]
	at org.apache.logging.log4j.spi.AbstractLogger.logIfEnabled(AbstractLogger.java:1804) [log4j-api-2.21.0.jar:?]
	at org.apache.logging.log4j.spi.AbstractLogger.debug(AbstractLogger.java:380) [log4j-api-2.21.0.jar:?]
	at org.opensearch.ingest.IngestService.lambda$executePipelinesInBatchRequests$0(IngestService.java:1008) ~[main/:?]
	at org.opensearch.ingest.IngestService.lambda$innerBatchExecute$0(IngestService.java:1431) ~[main/:?]
	at org.opensearch.ingest.Pipeline.lambda$batchExecute$0(Pipeline.java:277) ~[main/:?]
	at org.opensearch.ingest.CompoundProcessor.lambda$innerBatchExecute$0(CompoundProcessor.java:240) ~[main/:?]
	at org.opensearch.ingest.IngestServiceTests.lambda$testExecuteBulkRequestInBatchWithExceptionAndDropInCallback_requestsWithMatchingChildSlots$1(IngestServiceTests.java:2208) ~[test/:?]

@ashking94
Copy link
Member

@guojialiang92 can you rebase with main now?

@guojialiang92
Copy link
Contributor Author

@guojialiang92 can you rebase with main now?

Yeah

@github-actions
Copy link
Contributor

✅ Gradle check result for f5111ec: SUCCESS

@guojialiang92
Copy link
Contributor Author

guojialiang92 commented Jun 26, 2025

@ashking94
Thanks for the patient review. All the tests have passed.

@ashking94 ashking94 merged commit 5ced78c into opensearch-project:main Jun 26, 2025
31 checks passed
tandonks pushed a commit to tandonks/OpenSearch that referenced this pull request Aug 5, 2025
…ch-project#18255)

* support local merged segment warmer

Signed-off-by: guojialiang <[email protected]>

* update log

Signed-off-by: guojialiang <[email protected]>

* fix test

Signed-off-by: guojialiang <[email protected]>

* add tests

Signed-off-by: guojialiang <[email protected]>

* IndexShard support getActiveReplicaNodes

Signed-off-by: guojialiang <[email protected]>

* add test

Signed-off-by: guojialiang <[email protected]>

* add pre-verification.

Signed-off-by: guojialiang <[email protected]>

* extend ReplicationAction instead of ActionRequest

Signed-off-by: guojialiang <[email protected]>

* reuse Store#getSegmentMetadataMap

Signed-off-by: guojialiang <[email protected]>

* refactor updateReplicationRateLimiter

Signed-off-by: guojialiang <[email protected]>

* extract an abstract base class (AbstractSegmentReplicationTarget) from SegmentReplicationTarget

Signed-off-by: guojialiang <[email protected]>

* reduce unnecessary exception judgment

Signed-off-by: guojialiang <[email protected]>

* fix UT

Signed-off-by: guojialiang <[email protected]>

* add PublishMergedSegmentActionTests

Signed-off-by: guojialiang <[email protected]>

* gradle spotlessApply

Signed-off-by: guojialiang <[email protected]>

* add test

Signed-off-by: guojialiang <[email protected]>

* rename ReplicationSegmentCheckpoint to MergeSegmentCheckpoint

Signed-off-by: guojialiang <[email protected]>

* add some description

Signed-off-by: guojialiang <[email protected]>

* refactor code

Signed-off-by: guojialiang <[email protected]>

* extract an abstract base class (AbstractPublishCheckpointAction) from PublishCheckpointAction

Signed-off-by: guojialiang <[email protected]>

* fix :server:japicmp

Signed-off-by: guojialiang <[email protected]>

* update

Signed-off-by: guojialiang <[email protected]>

* update

Signed-off-by: guojialiang <[email protected]>

---------

Signed-off-by: guojialiang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants