Fix replica-primary inconsistencies when indexing during primary relocation with ongoing replica recoveries #19287

ywelsch · 2016-07-06T16:27:21Z

Primary relocation violates two invariants that ensure proper interaction between document replication and peer recoveries, ultimately leading to documents not being properly replicated (see #19248 for more details).

Closes #19248

bleskes · 2016-07-18T08:11:41Z

core/src/main/java/org/elasticsearch/indices/recovery/RecoveriesCollection.java

we use this implementation a lot - maybe we should make utility base class for it (different PR). a default method implementation will be tricky because of the local node. Maybe it should be a parameter to the onClusterServiceClose. Food for thought.

bleskes · 2016-07-18T09:08:37Z

thx @ywelsch . I left some comments. Can we also maybe add some tests to RecoverySourceHandlerTests to test the new relocation behavior (waiting for a cluster state version and fail on move to relocated )?

ywelsch · 2016-07-18T12:04:17Z

@bleskes I've pushed a new change addressing comments. I've also added the unit tests that you asked for but I am not convinced that they are too useful.

bleskes · 2016-07-18T12:05:54Z

@ywelsch I'm wondering if changing the title to Fix replica-primary inconsistencies when indexing during primary relocation with ongoing replica recoveries will better describe the situation? or even take over the issue's title? (we use PR to drive change lists)

bleskes · 2016-07-18T19:39:04Z

core/src/main/java/org/elasticsearch/indices/recovery/RecoverySource.java

nit: block suggest "block the thread" like on have it wait on a lock. don't you like "delay"?

delayNewRecoveries sounds good!

bleskes · 2016-07-18T21:29:04Z

LGTM. Thanks @ywelsch

…cation with ongoing replica recoveries Primary relocation violates two invariants that ensure proper interaction between document replication and peer recoveries, ultimately leading to documents not being properly replicated. Invariant 1: Document writes must be replicated based on the routing table of a cluster state that includes all shards which have ongoing or finished recoveries. This is ensured by the fact that do not start a recovery that is not reflected by the cluster state available on the primary node and we always sample a fresh cluster state before starting to replicate write operations. Invariant 2: Every operation that is not part of the snapshot taken for phase 2, must be succesfully indexed on the target replica (pending shard level errors which will cause the target shard to be failed). To ensure this, we start replicating to the target shard as soon as the recovery start and open it's engine before we take the snapshot. All operations that are indexed after the snapshot was taken are guaranteed to arrive to the shard when it's ready to index them. Note that this also means that the replication doesn't fail a shard if it's not yet ready to recieve operations - it's a normal part of a recovering shard. With primary relocations, the two invariants can be possibly violated. Let's consider a primary relocating while there is another replica shard recovering from the primary shard. Invariant 1 can be violated if the target of the primary relocation is so lagging on cluster state processing that it doesn't even know about the new initializing replica. This is very rare in practice as replica recoveries take time to copy all the index files but it is a theoretical gap that surfaces in testing scenarios. Invariant 2 can be violated even if the target primary knows about the initializing replica. This can happen if the target primary replicates an operation to the intializing shard and that operation arrives to the initializing shard before it opens it's engine but arrives to the primary source after it has taken the snapshot of the translog. Those operations will be currently missed on the new initializing replica. The fix to reestablish invariant 1 is to ensure that the primary relocation target has a cluster state with all replica recoveries that were successfully started on primary relocation source. The fix to reestablish invariant 2 is to check after opening engine on the replica if the primary has been relocated in the meanwhile and fail the recovery.

We introduced WAIT_CLUSTERSTATE action in #19287 (5.0), but then stopped using it since #25692 (6.0). This change removes that action and related code in 7.x and 8.0. Relates #19287 Relates #25692

ywelsch added >bug resiliency :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. v5.0.0-alpha5 labels Jul 6, 2016

ywelsch assigned bleskes Jul 6, 2016

ywelsch force-pushed the fix/relocation-replica-data-loss branch from d50aaa0 to 6bdf79d Compare July 15, 2016 16:47

bleskes reviewed Jul 18, 2016
View reviewed changes

ywelsch changed the title ~~Fix data loss when indexing during primary relocation with ongoing replica recoveries~~ Fix replica-primary inconsistencies when indexing during primary relocation with ongoing replica recoveries Jul 18, 2016

bleskes reviewed Jul 18, 2016
View reviewed changes

ywelsch force-pushed the fix/relocation-replica-data-loss branch from 9dfbfe9 to 2bc4000 Compare July 19, 2016 11:09

ywelsch force-pushed the fix/relocation-replica-data-loss branch from 2bc4000 to d7f99a4 Compare July 19, 2016 11:33

ywelsch merged commit c4fe8e7 into elastic:master Jul 19, 2016

dnhatn mentioned this pull request Mar 13, 2019

Remove wait for cluster state step in peer recovery #40004

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix replica-primary inconsistencies when indexing during primary relocation with ongoing replica recoveries #19287

Fix replica-primary inconsistencies when indexing during primary relocation with ongoing replica recoveries #19287

Uh oh!

ywelsch commented Jul 6, 2016

Uh oh!

bleskes Jul 18, 2016

Uh oh!

bleskes commented Jul 18, 2016

Uh oh!

ywelsch commented Jul 18, 2016

Uh oh!

bleskes commented Jul 18, 2016

Uh oh!

bleskes Jul 18, 2016

Uh oh!

ywelsch Jul 19, 2016

Uh oh!

bleskes commented Jul 18, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix replica-primary inconsistencies when indexing during primary relocation with ongoing replica recoveries #19287

Fix replica-primary inconsistencies when indexing during primary relocation with ongoing replica recoveries #19287

Uh oh!

Conversation

ywelsch commented Jul 6, 2016

Uh oh!

bleskes Jul 18, 2016

Choose a reason for hiding this comment

Uh oh!

bleskes commented Jul 18, 2016

Uh oh!

ywelsch commented Jul 18, 2016

Uh oh!

bleskes commented Jul 18, 2016

Uh oh!

bleskes Jul 18, 2016

Choose a reason for hiding this comment

Uh oh!

ywelsch Jul 19, 2016

Choose a reason for hiding this comment

Uh oh!

bleskes commented Jul 18, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants