Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Segment Replicatin] [Bug] Replica shard failure due to different files during get checkpoint info #4295

Closed
Tracked by #3969
dreamer-89 opened this issue Aug 24, 2022 · 3 comments
Assignees
Labels
bug Something isn't working distributed framework

Comments

@dreamer-89
Copy link
Member

dreamer-89 commented Aug 24, 2022

This issue is reproducible with testDropPrimaryDuringReplication test containing 6 replicas

Failure trace

[2022-08-25T01:08:54,846][INFO ][o.o.i.r.SegmentReplicationTarget] [node_t4] [test-idx-1][0] Replication diff RecoveryDiff{identical=[], different=[name [_0.cfe], length [479], checksum [emydke], writtenBy [9.4.0], name [_0.si], length [318], checksum [kopt3c], writtenBy [9.4.0], name [_0.cfs], length [79844], checksum [1ctjaoz], writtenBy [9.4.0], name [_1.cfs], length [2830], checksum [1xwufkz], writtenBy [9.4.0], name [_1.cfe], length [479], checksum [400p57], writtenBy [9.4.0], name [_1.si], length [318], checksum [1iagmq9], writtenBy [9.4.0]], missing=[name [_0_1_Lucene90_0.dvm], length [160], checksum [17y84gu], writtenBy [9.4.0], name [_0_1_Lucene90_0.dvd], length [87], checksum [1gqmtlc], writtenBy [9.4.0], name [_0_1.fnm], length [1205], checksum [8ihdnr], writtenBy [9.4.0], name [segments_6], length [455], checksum [o2ni19], writtenBy [9.4.0]]}
[2022-08-25T01:08:54,847][ERROR][o.o.i.r.SegmentReplicationTargetService] [node_t4] replication failure
org.opensearch.OpenSearchException: Segment Replication failed
	at org.opensearch.indices.replication.SegmentReplicationTargetService$3.onFailure(SegmentReplicationTargetService.java:251) [main/:?]
	at org.opensearch.action.ActionListener$1.onFailure(ActionListener.java:88) [main/:?]
	at org.opensearch.action.ActionRunnable.onFailure(ActionRunnable.java:103) [main/:?]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:54) [main/:?]
	at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:341) [main/:?]
	at org.opensearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:120) [main/:?]
	at org.opensearch.common.util.concurrent.ListenableFuture.lambda$done$0(ListenableFuture.java:112) [main/:?]
	at java.util.ArrayList.forEach(ArrayList.java:1511) [?:?]
	at org.opensearch.common.util.concurrent.ListenableFuture.done(ListenableFuture.java:112) [main/:?]
	at org.opensearch.common.util.concurrent.BaseFuture.setException(BaseFuture.java:178) [main/:?]
	at org.opensearch.common.util.concurrent.ListenableFuture.onFailure(ListenableFuture.java:149) [main/:?]
	at org.opensearch.action.StepListener.innerOnFailure(StepListener.java:82) [main/:?]
	at org.opensearch.action.NotifyOnceListener.onFailure(NotifyOnceListener.java:62) [main/:?]
	at org.opensearch.indices.replication.SegmentReplicationTarget.getFiles(SegmentReplicationTarget.java:168) [main/:?]
	at org.opensearch.indices.replication.SegmentReplicationTarget.lambda$startReplication$1(SegmentReplicationTarget.java:147) [main/:?]
	at org.opensearch.action.ActionListener$1.onResponse(ActionListener.java:80) [main/:?]
	at org.opensearch.common.util.concurrent.ListenableFuture$1.doRun(ListenableFuture.java:126) [main/:?]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [main/:?]
	at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:341) [main/:?]
	at org.opensearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:120) [main/:?]
	at org.opensearch.common.util.concurrent.ListenableFuture.lambda$done$0(ListenableFuture.java:112) [main/:?]
	at java.util.ArrayList.forEach(ArrayList.java:1511) [?:?]
	at org.opensearch.common.util.concurrent.ListenableFuture.done(ListenableFuture.java:112) [main/:?]
	at org.opensearch.common.util.concurrent.BaseFuture.set(BaseFuture.java:160) [main/:?]
	at org.opensearch.common.util.concurrent.ListenableFuture.onResponse(ListenableFuture.java:141) [main/:?]
	at org.opensearch.action.StepListener.innerOnResponse(StepListener.java:77) [main/:?]
	at org.opensearch.action.NotifyOnceListener.onResponse(NotifyOnceListener.java:55) [main/:?]
	at org.opensearch.action.ActionListener$4.onResponse(ActionListener.java:180) [main/:?]
	at org.opensearch.action.ActionListener$6.onResponse(ActionListener.java:299) [main/:?]
	at org.opensearch.action.support.RetryableAction$RetryingListener.onResponse(RetryableAction.java:161) [main/:?]
	at org.opensearch.action.ActionListenerResponseHandler.handleResponse(ActionListenerResponseHandler.java:69) [main/:?]
	at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1369) [main/:?]
	at org.opensearch.transport.InboundHandler.doHandleResponse(InboundHandler.java:393) [main/:?]
	at org.opensearch.transport.InboundHandler.lambda$handleResponse$1(InboundHandler.java:387) [main/:?]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:747) [main/:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
	at java.lang.Thread.run(Thread.java:833) [?:?]
Caused by: java.lang.IllegalStateException: Shard [test-idx-1][0] has local copies of segments that differ from the primary
	at org.opensearch.indices.replication.SegmentReplicationTarget.getFiles(SegmentReplicationTarget.java:171) ~[main/:?]
	... 24 more
[2022-08-25T01:08:54,857][WARN ][o.o.i.e.Engine           ] [node_t4] [test-idx-1][0] failed engine [replication failure]
org.opensearch.OpenSearchException: Segment Replication failed
	at org.opensearch.indices.replication.SegmentReplicationTargetService$3.onFailure(SegmentReplicationTargetService.java:251) [main/:?]
	at org.opensearch.action.ActionListener$1.onFailure(ActionListener.java:88) [main/:?]
	at org.opensearch.action.ActionRunnable.onFailure(ActionRunnable.java:103) [main/:?]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:54) [main/:?]
	at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:341) [main/:?]
	at org.opensearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:120) [main/:?]
	at org.opensearch.common.util.concurrent.ListenableFuture.lambda$done$0(ListenableFuture.java:112) [main/:?]
	at java.util.ArrayList.forEach(ArrayList.java:1511) [?:?]
	at org.opensearch.common.util.concurrent.ListenableFuture.done(ListenableFuture.java:112) [main/:?]
	at org.opensearch.common.util.concurrent.BaseFuture.setException(BaseFuture.java:178) [main/:?]
	at org.opensearch.common.util.concurrent.ListenableFuture.onFailure(ListenableFuture.java:149) [main/:?]
	at org.opensearch.action.StepListener.innerOnFailure(StepListener.java:82) [main/:?]
	at org.opensearch.action.NotifyOnceListener.onFailure(NotifyOnceListener.java:62) [main/:?]
	at org.opensearch.indices.replication.SegmentReplicationTarget.getFiles(SegmentReplicationTarget.java:168) [main/:?]
	at org.opensearch.indices.replication.SegmentReplicationTarget.lambda$startReplication$1(SegmentReplicationTarget.java:147) [main/:?]
	at org.opensearch.action.ActionListener$1.onResponse(ActionListener.java:80) [main/:?]
	at org.opensearch.common.util.concurrent.ListenableFuture$1.doRun(ListenableFuture.java:126) [main/:?]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [main/:?]
	at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:341) [main/:?]
	at org.opensearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:120) [main/:?]
	at org.opensearch.common.util.concurrent.ListenableFuture.lambda$done$0(ListenableFuture.java:112) [main/:?]
	at java.util.ArrayList.forEach(ArrayList.java:1511) [?:?]
	at org.opensearch.common.util.concurrent.ListenableFuture.done(ListenableFuture.java:112) [main/:?]
	at org.opensearch.common.util.concurrent.BaseFuture.set(BaseFuture.java:160) [main/:?]
	at org.opensearch.common.util.concurrent.ListenableFuture.onResponse(ListenableFuture.java:141) [main/:?]
	at org.opensearch.action.StepListener.innerOnResponse(StepListener.java:77) [main/:?]
	at org.opensearch.action.NotifyOnceListener.onResponse(NotifyOnceListener.java:55) [main/:?]
	at org.opensearch.action.ActionListener$4.onResponse(ActionListener.java:180) [main/:?]
	at org.opensearch.action.ActionListener$6.onResponse(ActionListener.java:299) [main/:?]
	at org.opensearch.action.support.RetryableAction$RetryingListener.onResponse(RetryableAction.java:161) [main/:?]
	at org.opensearch.action.ActionListenerResponseHandler.handleResponse(ActionListenerResponseHandler.java:69) [main/:?]
	at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1369) [main/:?]
	at org.opensearch.transport.InboundHandler.doHandleResponse(InboundHandler.java:393) [main/:?]
	at org.opensearch.transport.InboundHandler.lambda$handleResponse$1(InboundHandler.java:387) [main/:?]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:747) [main/:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
	at java.lang.Thread.run(Thread.java:833) [?:?]
Caused by: java.lang.IllegalStateException: Shard [test-idx-1][0] has local copies of segments that differ from the primary
	at org.opensearch.indices.replication.SegmentReplicationTarget.getFiles(SegmentReplicationTarget.java:171) ~[main/:?]
	... 24 more
[2022-08-25T01:08:54,859][WARN ][o.o.i.c.IndicesClusterStateService] [node_t4] [test-idx-1][0] marking and sending shard failed due to [shard failure, reason [replication failure]]
org.opensearch.OpenSearchException: Segment Replication failed
	at org.opensearch.indices.replication.SegmentReplicationTargetService$3.onFailure(SegmentReplicationTargetService.java:251) ~[main/:?]
	at org.opensearch.action.ActionListener$1.onFailure(ActionListener.java:88) ~[main/:?]
	at org.opensearch.action.ActionRunnable.onFailure(ActionRunnable.java:103) ~[main/:?]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:54) ~[main/:?]
	at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:341) ~[main/:?]
	at org.opensearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:120) ~[main/:?]
	at org.opensearch.common.util.concurrent.ListenableFuture.lambda$done$0(ListenableFuture.java:112) ~[main/:?]
	at java.util.ArrayList.forEach(ArrayList.java:1511) ~[?:?]
	at org.opensearch.common.util.concurrent.ListenableFuture.done(ListenableFuture.java:112) ~[main/:?]
	at org.opensearch.common.util.concurrent.BaseFuture.setException(BaseFuture.java:178) ~[main/:?]
	at org.opensearch.common.util.concurrent.ListenableFuture.onFailure(ListenableFuture.java:149) ~[main/:?]
	at org.opensearch.action.StepListener.innerOnFailure(StepListener.java:82) ~[main/:?]
	at org.opensearch.action.NotifyOnceListener.onFailure(NotifyOnceListener.java:62) ~[main/:?]
	at org.opensearch.indices.replication.SegmentReplicationTarget.getFiles(SegmentReplicationTarget.java:168) ~[main/:?]
	at org.opensearch.indices.replication.SegmentReplicationTarget.lambda$startReplication$1(SegmentReplicationTarget.java:147) ~[main/:?]
	at org.opensearch.action.ActionListener$1.onResponse(ActionListener.java:80) ~[main/:?]
	at org.opensearch.common.util.concurrent.ListenableFuture$1.doRun(ListenableFuture.java:126) ~[main/:?]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) ~[main/:?]
	at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:341) ~[main/:?]
	at org.opensearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:120) ~[main/:?]
	at org.opensearch.common.util.concurrent.ListenableFuture.lambda$done$0(ListenableFuture.java:112) ~[main/:?]
	at java.util.ArrayList.forEach(ArrayList.java:1511) ~[?:?]
	at org.opensearch.common.util.concurrent.ListenableFuture.done(ListenableFuture.java:112) ~[main/:?]
	at org.opensearch.common.util.concurrent.BaseFuture.set(BaseFuture.java:160) ~[main/:?]
	at org.opensearch.common.util.concurrent.ListenableFuture.onResponse(ListenableFuture.java:141) ~[main/:?]
	at org.opensearch.action.StepListener.innerOnResponse(StepListener.java:77) ~[main/:?]
	at org.opensearch.action.NotifyOnceListener.onResponse(NotifyOnceListener.java:55) ~[main/:?]
	at org.opensearch.action.ActionListener$4.onResponse(ActionListener.java:180) ~[main/:?]
	at org.opensearch.action.ActionListener$6.onResponse(ActionListener.java:299) ~[main/:?]
	at org.opensearch.action.support.RetryableAction$RetryingListener.onResponse(RetryableAction.java:161) ~[main/:?]
	at org.opensearch.action.ActionListenerResponseHandler.handleResponse(ActionListenerResponseHandler.java:69) ~[main/:?]
	at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1369) ~[main/:?]
	at org.opensearch.transport.InboundHandler.doHandleResponse(InboundHandler.java:393) ~[main/:?]
	at org.opensearch.transport.InboundHandler.lambda$handleResponse$1(InboundHandler.java:387) ~[main/:?]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:747) [main/:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
	at java.lang.Thread.run(Thread.java:833) [?:?]
Caused by: java.lang.IllegalStateException: Shard [test-idx-1][0] has local copies of segments that differ from the primary
	at org.opensearch.indices.replication.SegmentReplicationTarget.getFiles(SegmentReplicationTarget.java:171) ~[main/:?]
	... 24 more
@dreamer-89 dreamer-89 self-assigned this Aug 24, 2022
@dreamer-89 dreamer-89 changed the title Replica shard failure due to different files during get checkpoint info [Segment Replicatin] [Bug] Replica shard failure due to different files during get checkpoint info Aug 24, 2022
@dreamer-89
Copy link
Member Author

dreamer-89 commented Aug 24, 2022

This is a typical case of failover where new chosen primary doesn't have files from previous primary. When this newly promoted primary writes files (already copied by replicas), it conflicts with replicas copies due to checksum mis-match.


Consider with scenario below:

  1. node_1 is initial primary with 6 replicas (node_1, ..., node_7).
  2. node_5 copies _0.cfe, _0.si, _0.cfs files from node_1. While other nodes weren't able to copy any file.
  3. node_1 is shutted down, node_3 is selected as new primary (please note, it doesn't have _0.cfe... files copied.)
  4. node_3 writes new files _0.cfe,... which have different checksum from the files already present on node_5. This breaks segment replication flow and shard is marked failed.

Log trace

node_5 starts replication from previous primary node_1.

[2022-08-24T15:44:28,049][INFO ][o.o.i.r.SegmentReplicationTarget] [node_t5] [test-idx-1][0] Replication diff RecoveryDiff{identical=[name [segments_2], length [208], checksum [1dxroa4], writtenBy [9.4.0]],
 different=[],
 missing=[name [_0.cfe], length [479], checksum [35v82n], writtenBy [9.4.0], name [_0.si], length [318], checksum [1xqgvy1], writtenBy [9.4.0], name [_0.cfs], length [15172], checksum [1w1e2qr], writtenBy [9.4.0], name [_1.cfs], length [86047], checksum [1vmp06p], writtenBy [9.4.0], name [_1.cfe], length [479], checksum [m510co], writtenBy [9.4.0], name [_1.si], length [318], checksum [13hu3cn], writtenBy [9.4.0]]}
...

During primary promotion, node_3 is selected as new primary which doesn't have _0.cfe,... files copied from node_1(previous primary). It commits new files on disk with different checksum which conflicts with files on node_5 (copied from previous primary node_1).

[2022-08-24T15:44:30,455][INFO ][o.o.i.s.Store            ] [node_t5] storeFileMetadata name [_0.cfe], length [479], checksum [35v82n], writtenBy [9.4.0]
...
[2022-08-24T15:44:30,456][INFO ][o.o.i.r.SegmentReplicationTarget] [node_t5] [test-idx-1][0] Replication diff RecoveryDiff{identical=[],
 different=[name [_0.cfe], length [479], checksum [5st6v1], writtenBy [9.4.0], name [_0.si], length [318], checksum [95dia], writtenBy [9.4.0], name [_0.cfs], length [108596], checksum [trx4we], writtenBy [9.4.0], name [_1.cfs], length [2830], checksum [16nrfy6], writtenBy [9.4.0], name [_1.cfe], length [479], checksum [9ywxpq], writtenBy [9.4.0], name [_1.si], length [318], checksum [1gue2je], writtenBy [9.4.0]],
 missing=[name [_0_1_Lucene90_0.dvm], length [160], checksum [x78dqk], writtenBy [9.4.0], name [_0_1_Lucene90_0.dvd], length [87], checksum [nkyf4i], writtenBy [9.4.0], name [_0_1.fnm], length [1205], checksum [ifl3bw], writtenBy [9.4.0], name [segments_6], length [455], checksum [7y71n8], writtenBy [9.4.0]]}
...

Post node_t3 promotion as primary all shards are updated

[2022-08-25T03:12:41,631][INFO ][o.o.i.s.IndexShard       ] [node_t4] [test-idx-1][0] detected new primary with primary term [2], global checkpoint [125], max_seq_no [125]
[2022-08-25T03:12:41,631][INFO ][o.o.i.s.IndexShard       ] [node_t2] [test-idx-1][0] detected new primary with primary term [2], global checkpoint [125], max_seq_no [125]
[2022-08-25T03:12:41,631][INFO ][o.o.i.s.IndexShard       ] [node_t7] [test-idx-1][0] detected new primary with primary term [2], global checkpoint [125], max_seq_no [125]
[2022-08-25T03:12:41,631][INFO ][o.o.i.s.IndexShard       ] [node_t6] [test-idx-1][0] detected new primary with primary term [2], global checkpoint [125], max_seq_no [125]
[2022-08-25T03:12:41,631][INFO ][o.o.i.s.IndexShard       ] [node_t5] [test-idx-1][0] detected new primary with primary term [2], global checkpoint [125], max_seq_no [125]

@dreamer-89
Copy link
Member Author

The #4304 also needs to handle the conflict which happens on replica during file copy operation from new primary. The failure happens at verifyChecksum step where checksum of file copied from previous primary is used. Added below timeseries sample (based on log trace below) for better understanding. Converting to draft for now.

Timeseries

node_t1, node_t2, ...., nodet_7 builds up cluster with node_t1 as primary
node_t1 -> replicates files to t4 (only node upto date with t1)
node_t1 -> dropped
node_t7 -> promoted as primary
node_t7 -> starts replication with node_t4
node_t4 -> checksum error on _1.si file. This first fails primary & then fails replica.

Logically, the failure should not happen as the local checksum() in IndexOutput should build checksum from copied bytes rather than fetching checksum of older file (copied from primary). This checksum is compared with the StoreFileMetadata copied from new primary.

Alternative

The probable solutions would be remove existing files on replica (which are part of diff.different) or rename them (better for search availability). There is one workaround which prevents this situation from happening captured in this PR #4365

Log trace

[2022-08-31T05:41:11,036][INFO ][o.o.i.r.SegmentReplicationTarget] [node_t4] [test-idx-1][0] --> requesting files [name [_0.cfe], length [479], checksum [xuaex7], writtenBy [9.4.0], name [_0.si], length [320], checksum [12na9tm], writtenBy [9.4.0], name [_0.cfs], length [4796], checksum [5na8fq], writtenBy [9.4.0], name [_1.cfs], length [24887], checksum [1p7u2kb], writtenBy [9.4.0], name [_1.cfe], length [479], checksum [fxa4j1], writtenBy [9.4.0], name [_1.si], length [320], checksum [s04ei6], writtenBy [9.4.0], name [_2.si], length [320], checksum [1ezotng], writtenBy [9.4.0], name [_2.cfe], length [479], checksum [1i0oyv5], writtenBy [9.4.0], name [_2.cfs], length [25883], checksum [x503c6], writtenBy [9.4.0]]
[2022-08-31T05:41:11,262][INFO ][o.o.i.r.SegmentReplicationTarget] [node_t4] [test-idx-1][0] --> finalize replication
...
...
[2022-08-31T05:41:11,440][INFO ][o.o.i.e.Engine           ] [node_t7] [test-idx-1][0] --> Opened IE with infos [segments_4] - disk [segments_4, write.lock]

[2022-08-31T05:41:12,039][INFO ][o.o.i.e.Engine           ] [node_t7] [test-idx-1][0] --> IE.getSegmentInfosSnapshot with infos [_1.cfs, _0.cfe, _0.si, _0_1_Lucene90_0.dvm, _1.cfe, _1.si, _0.cfs, _0_1_Lucene90_0.dvd, _0_1.fnm, segments_6] - disk [_0.cfe, _0.cfs, _0.si, _0_1.fnm, _0_1_Lucene90_0.dvd, _0_1_Lucene90_0.dvm, _1.cfe, _1.cfs, _1.si, segments_6, write.lock]
...


[2022-08-31T05:41:12,107][INFO ][o.o.i.s.S.MetadataSnapshot] [node_t4] --> Different file name [_1.si], length [320], checksum [s04ei6], writtenBy [9.4.0]
[2022-08-31T05:41:12,107][INFO ][o.o.i.s.S.MetadataSnapshot] [node_t4] --> Local Copy name [_1.si], length [320], checksum [1d5tur9], writtenBy [9.4.0]

...

[2022-08-31T05:41:12,121][WARN ][o.o.i.r.SegmentReplicationSourceHandler] [node_t7] [test-idx-1][0][sending segments to node_t4] [test-idx-1][0] Corrupted file detected name [_1.si], length [320], checksum [s04ei6], writtenBy [9.4.0] checksum mismatch
[2022-08-31T05:41:12,124][WARN ][o.o.i.e.Engine           ] [node_t7] [test-idx-1][0] failed engine [error sending files]
org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=s04ei6 actual=1d5tur9 (resource=name [_1.si], length [320], checksum [s04ei6], writtenBy [9.4.0]) (resource=VerifyingIndexOutput(_1.si))
	at org.opensearch.index.store.Store$LuceneVerifyingIndexOutput.readAndCompareChecksum(Store.java:1486) ~[main/:?]
	at org.opensearch.index.store.Store$LuceneVerifyingIndexOutput.writeByte(Store.java:1459) ~[main/:?]
	at org.opensearch.index.store.Store$LuceneVerifyingIndexOutput.writeBytes(Store.java:1495) ~[main/:?]
	at org.opensearch.indices.recovery.MultiFileWriter.innerWriteFileChunk(MultiFileWriter.java:144) ~[main/:?]
	at org.opensearch.indices.recovery.MultiFileWriter$FileChunkWriter.writeChunk(MultiFileWriter.java:239) ~[main/:?]
	at org.opensearch.indices.recovery.MultiFileWriter.writeFileChunk(MultiFileWriter.java:92) ~[main/:?]
	at org.opensearch.indices.replication.SegmentReplicationTarget.writeFileChunk(SegmentReplicationTarget.java:130) ~[main/:?]
	at org.opensearch.indices.replication.common.ReplicationTarget.handleFileChunk(ReplicationTarget.java:260) ~[main/:?]
	at org.opensearch.indices.replication.SegmentReplicationTargetService$FileChunkTransportRequestHandler.messageReceived(SegmentReplicationTargetService.java:291) ~[main/:?]
	at org.opensearch.indices.replication.SegmentReplicationTargetService$FileChunkTransportRequestHandler.messageReceived(SegmentReplicationTargetService.java:280) ~[main/:?]
	at org.opensearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:106) ~[main/:?]
	at org.opensearch.transport.InboundHandler$RequestHandler.doRun(InboundHandler.java:453) ~[main/:?]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:806) ~[main/:?]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) ~[main/:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
	at java.lang.Thread.run(Thread.java:833) [?:?]
[2022-08-31T05:41:12,129][WARN ][o.o.i.c.IndicesClusterStateService] [node_t7] [test-idx-1][0] marking and sending shard failed due to [shard failure, reason [error sending files]]
org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=s04ei6 actual=1d5tur9 (resource=name [_1.si], length [320], checksum [s04ei6], writtenBy [9.4.0]) (resource=VerifyingIndexOutput(_1.si))
	at org.opensearch.index.store.Store$LuceneVerifyingIndexOutput.readAndCompareChecksum(Store.java:1486) ~[main/:?]
	at org.opensearch.index.store.Store$LuceneVerifyingIndexOutput.writeByte(Store.java:1459) ~[main/:?]
	at org.opensearch.index.store.Store$LuceneVerifyingIndexOutput.writeBytes(Store.java:1495) ~[main/:?]
	at org.opensearch.indices.recovery.MultiFileWriter.innerWriteFileChunk(MultiFileWriter.java:144) ~[main/:?]
	at org.opensearch.indices.recovery.MultiFileWriter$FileChunkWriter.writeChunk(MultiFileWriter.java:239) ~[main/:?]
	at org.opensearch.indices.recovery.MultiFileWriter.writeFileChunk(MultiFileWriter.java:92) ~[main/:?]
	at org.opensearch.indices.replication.SegmentReplicationTarget.writeFileChunk(SegmentReplicationTarget.java:130) ~[main/:?]
	at org.opensearch.indices.replication.common.ReplicationTarget.handleFileChunk(ReplicationTarget.java:260) ~[main/:?]
	at org.opensearch.indices.replication.SegmentReplicationTargetService$FileChunkTransportRequestHandler.messageReceived(SegmentReplicationTargetService.java:291) ~[main/:?]
	at org.opensearch.indices.replication.SegmentReplicationTargetService$FileChunkTransportRequestHandler.messageReceived(SegmentReplicationTargetService.java:280) ~[main/:?]
	at org.opensearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:106) ~[main/:?]
	at org.opensearch.transport.InboundHandler$RequestHandler.doRun(InboundHandler.java:453) ~[main/:?]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:806) ~[main/:?]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) ~[main/:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
	at java.lang.Thread.run(Thread.java:833) [?:?]
[2022-08-31T05:41:12,130][ERROR][o.o.i.r.SegmentReplicationTargetService] [node_t4] replication failure
org.opensearch.OpenSearchException: Segment Replication failed
	at org.opensearch.indices.replication.SegmentReplicationTargetService$3.onFailure(SegmentReplicationTargetService.java:274) [main/:?]
	at org.opensearch.action.ActionListener$1.onFailure(ActionListener.java:88) [main/:?]
	at org.opensearch.action.ActionRunnable.onFailure(ActionRunnable.java:103) [main/:?]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:54) [main/:?]
	at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:341) [main/:?]
	at org.opensearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:120) [main/:?]
	at org.opensearch.common.util.concurrent.ListenableFuture.lambda$done$0(ListenableFuture.java:112) [main/:?]
	at java.util.ArrayList.forEach(ArrayList.java:1511) [?:?]
	at org.opensearch.common.util.concurrent.ListenableFuture.done(ListenableFuture.java:112) [main/:?]
	at org.opensearch.common.util.concurrent.BaseFuture.setException(BaseFuture.java:178) [main/:?]
	at org.opensearch.common.util.concurrent.ListenableFuture.onFailure(ListenableFuture.java:149) [main/:?]
	at org.opensearch.action.StepListener.innerOnFailure(StepListener.java:82) [main/:?]
	at org.opensearch.action.NotifyOnceListener.onFailure(NotifyOnceListener.java:62) [main/:?]
	at org.opensearch.action.ActionListener$4.onFailure(ActionListener.java:190) [main/:?]
	at org.opensearch.action.ActionListener$6.onFailure(ActionListener.java:309) [main/:?]
	at org.opensearch.action.support.RetryableAction$RetryingListener.onFinalFailure(RetryableAction.java:201) [main/:?]
	at org.opensearch.action.support.RetryableAction$RetryingListener.onFailure(RetryableAction.java:193) [main/:?]
	at org.opensearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:74) [main/:?]
	at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1379) [main/:?]
	at org.opensearch.transport.InboundHandler.lambda$handleException$3(InboundHandler.java:420) [main/:?]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:747) [main/:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
	at java.lang.Thread.run(Thread.java:833) [?:?]
Caused by: org.opensearch.transport.RemoteTransportException: [node_t7][127.0.0.1:50795][internal:index/shard/replication/get_segment_files]
Caused by: org.opensearch.common.util.concurrent.UncategorizedExecutionException: Failed execution
	at org.opensearch.common.util.concurrent.FutureUtils.rethrowExecutionException(FutureUtils.java:109) ~[main/:?]
	at org.opensearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:101) ~[main/:?]
	at org.opensearch.common.util.concurrent.ListenableFuture$1.doRun(ListenableFuture.java:125) ~[main/:?]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) ~[main/:?]
	at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:341) ~[main/:?]
	at org.opensearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:120) ~[main/:?]
	at org.opensearch.common.util.concurrent.ListenableFuture.lambda$done$0(ListenableFuture.java:112) ~[main/:?]
	at java.util.ArrayList.forEach(ArrayList.java:1511) ~[?:?]
	at org.opensearch.common.util.concurrent.ListenableFuture.done(ListenableFuture.java:112) ~[main/:?]
	at org.opensearch.common.util.concurrent.BaseFuture.setException(BaseFuture.java:178) ~[main/:?]
	at org.opensearch.common.util.concurrent.ListenableFuture.onFailure(ListenableFuture.java:149) ~[main/:?]
	at org.opensearch.action.StepListener.innerOnFailure(StepListener.java:82) ~[main/:?]
	at org.opensearch.action.NotifyOnceListener.onFailure(NotifyOnceListener.java:62) ~[main/:?]
	at org.opensearch.indices.recovery.MultiChunkTransfer.onCompleted(MultiChunkTransfer.java:172) ~[main/:?]
	at org.opensearch.indices.recovery.MultiChunkTransfer.handleItems(MultiChunkTransfer.java:160) ~[main/:?]
	at org.opensearch.indices.recovery.MultiChunkTransfer$1.write(MultiChunkTransfer.java:98) ~[main/:?]
	at org.opensearch.common.util.concurrent.AsyncIOProcessor.processList(AsyncIOProcessor.java:123) ~[main/:?]
	at ...

@dreamer-89
Copy link
Member Author

Discussed separately with @mch2, we are moving forward with alternative #4365

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working distributed framework
Projects
None yet
Development

No branches or pull requests

2 participants