-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Segment Replicatin] [Bug] Replica shard failure due to different files during get checkpoint info #4295
Comments
This is a typical case of failover where new chosen primary doesn't have files from previous primary. When this newly promoted primary writes files (already copied by replicas), it conflicts with replicas copies due to checksum mis-match. Consider with scenario below:
Log tracenode_5 starts replication from previous primary
During primary promotion,
Post node_t3 promotion as primary all shards are updated
|
The #4304 also needs to handle the conflict which happens on replica during file copy operation from new primary. The failure happens at verifyChecksum step where checksum of file copied from previous primary is used. Added below timeseries sample (based on log trace below) for better understanding. Converting to draft for now. Timeseriesnode_t1, node_t2, ...., nodet_7 builds up cluster with node_t1 as primary Logically, the failure should not happen as the local checksum() in IndexOutput should build checksum from copied bytes rather than fetching checksum of older file (copied from primary). This checksum is compared with the StoreFileMetadata copied from new primary. AlternativeThe probable solutions would be remove existing files on replica (which are part of diff.different) or rename them (better for search availability). There is one workaround which prevents this situation from happening captured in this PR #4365 Log trace
|
This issue is reproducible with testDropPrimaryDuringReplication test containing 6 replicas
Failure trace
The text was updated successfully, but these errors were encountered: