-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added timing data and more granular stages to SegmentReplicationState #4222
Conversation
Gradle Check (Jenkins) Run Completed with:
|
Looks unrelated, refiring.
|
Gradle Check (Jenkins) Run Completed with:
|
Looks like a legit failure. @kartg you may want to look
|
server/src/main/java/org/opensearch/indices/replication/SegmentReplicationState.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/opensearch/indices/replication/SegmentReplicationState.java
Outdated
Show resolved
Hide resolved
Gradle Check (Jenkins) Run Completed with:
|
server/src/main/java/org/opensearch/indices/replication/SegmentReplicationTarget.java
Show resolved
Hide resolved
server/src/main/java/org/opensearch/indices/replication/SegmentReplicationTarget.java
Show resolved
Hide resolved
Looks like two flaky test failures:
and one legitimate test failure related to segment replication -
|
Gradle Check (Jenkins) Run Completed with:
|
One failing test, unrelated to segment replication, but i can consistently repro the failure locally with the given seed
|
@kartg could you please create an issue for this flaky test, I suspect it has been introduced in [1], first time spotted there [2] [1] #4183 |
Gradle Check (Jenkins) Run Completed with:
|
Gradle Check (Jenkins) Run Completed with:
|
Gradle Check (Jenkins) Run Completed with:
|
Gradle Check (Jenkins) Run Completed with:
|
This change introduces instrumentation logging that measures the latency of the various stages of segment replication as seen by each replica. Logs have also been added to the source node for checkpoint publishing and checkpoint metadata responses. All logging is currently at the TRACE level. Signed-off-by: Kartik Ganesh <[email protected]>
Signed-off-by: Kartik Ganesh <[email protected]>
Signed-off-by: Kartik Ganesh <[email protected]>
Signed-off-by: Kartik Ganesh <[email protected]>
9b6cbff
to
c395ac1
Compare
Gradle Check (Jenkins) Run Completed with:
|
Gradle Check (Jenkins) Run Completed with:
|
Gradle Check (Jenkins) Run Completed with:
|
@reta @dreamer-89 @mch2 test issues resolved, ready for your 👀 |
* Segments are immutable. So if the replica has any segments with the same name that differ from the one in the incoming | ||
* snapshot from source that means the local copy of the segment has been corrupted/changed in some way and we throw an | ||
* IllegalStateException to fail the shard | ||
*/ | ||
if (diff.different.isEmpty() == false) { | ||
getFilesListener.onFailure( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Certainly an existing code, but it seems like there is a bug here: we don't thrown an exception but call the listener direclty and move on ... @kartg wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@reta Good catch! We shouldn't be proceeding if the failure listener is invoked. Would you mind if i incorporated the fix for this as a follow-up PR? I'd like to keep the scope of this PR focused on the seg-rep stages and timing data.
// Get list of files to copy from this checkpoint. | ||
state.setStage(SegmentReplicationState.Stage.GET_CHECKPOINT_INFO); | ||
source.getCheckpointMetadata(getId(), checkpoint, checkpointInfoListener); | ||
|
||
checkpointInfoListener.whenComplete(checkpointInfo -> getFiles(checkpointInfo, getFilesListener), listener::onFailure); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is an issue between stage demarcation and handlng the unhappy path. @kartg please correct me if I am wrong but in case of failure checkpointInfoListener
(and others) would delegate to listener::onFailure
without marking the current stage as "finished", correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fine (I think), the dangling stage is what concerns me: since we publish timing on success only, we do nothing on failure (we don't even know if stage was run, which could be even more important than success for troubleshooting). In any case, leaving the decision up to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…#4222) * Added timing data and more granular stages to SegmentReplicationState This change introduces instrumentation logging that measures the latency of the various stages of segment replication as seen by each replica. Logs have also been added to the source node for checkpoint publishing and checkpoint metadata responses. All logging is currently at the TRACE level. Signed-off-by: Kartik Ganesh <[email protected]> * Fixing SegmentReplicationTarget tests Signed-off-by: Kartik Ganesh <[email protected]> * Incorporated PR feedback Signed-off-by: Kartik Ganesh <[email protected]> * Fixing SegmentReplicationTargetService tests Signed-off-by: Kartik Ganesh <[email protected]> Signed-off-by: Kartik Ganesh <[email protected]> (cherry picked from commit a2ba3a8)
…eplicationState (#4367) * Added timing data and more granular stages to SegmentReplicationState (#4222) * Added timing data and more granular stages to SegmentReplicationState This change introduces instrumentation logging that measures the latency of the various stages of segment replication as seen by each replica. Logs have also been added to the source node for checkpoint publishing and checkpoint metadata responses. All logging is currently at the TRACE level. Signed-off-by: Kartik Ganesh <[email protected]> * Fixing SegmentReplicationTarget tests Signed-off-by: Kartik Ganesh <[email protected]> * Incorporated PR feedback Signed-off-by: Kartik Ganesh <[email protected]> * Fixing SegmentReplicationTargetService tests Signed-off-by: Kartik Ganesh <[email protected]> Signed-off-by: Kartik Ganesh <[email protected]> (cherry picked from commit a2ba3a8) * Update changelog for backport pr. Signed-off-by: Marc Handalian <[email protected]> Signed-off-by: Marc Handalian <[email protected]> Co-authored-by: Kartik Ganesh <[email protected]> Co-authored-by: Marc Handalian <[email protected]>
Description
This change introduces instrumentation logging that measures the latency of the various stages of segment replication as seen by each replica. Logs have also been added to the source node for checkpoint publishing and checkpoint metadata responses. All logging is currently at the TRACE level.
Signed-off-by: Kartik Ganesh [email protected]
Issues Resolved
Part of #2583
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.