Skip to content

Conversation

@ywelsch
Copy link
Contributor

@ywelsch ywelsch commented Jul 6, 2016

Primary relocation violates two invariants that ensure proper interaction between document replication and peer recoveries, ultimately leading to documents not being properly replicated (see #19248 for more details).

Closes #19248

@ywelsch ywelsch added >bug resiliency :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. v5.0.0-alpha5 labels Jul 6, 2016
@ywelsch ywelsch force-pushed the fix/relocation-replica-data-loss branch from d50aaa0 to 6bdf79d Compare July 15, 2016 16:47
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we use this implementation a lot - maybe we should make utility base class for it (different PR). a default method implementation will be tricky because of the local node. Maybe it should be a parameter to the onClusterServiceClose. Food for thought.

@bleskes
Copy link
Contributor

bleskes commented Jul 18, 2016

thx @ywelsch . I left some comments. Can we also maybe add some tests to RecoverySourceHandlerTests to test the new relocation behavior (waiting for a cluster state version and fail on move to relocated )?

@ywelsch
Copy link
Contributor Author

ywelsch commented Jul 18, 2016

@bleskes I've pushed a new change addressing comments. I've also added the unit tests that you asked for but I am not convinced that they are too useful.

@bleskes
Copy link
Contributor

bleskes commented Jul 18, 2016

@ywelsch I'm wondering if changing the title to Fix replica-primary inconsistencies when indexing during primary relocation with ongoing replica recoveries will better describe the situation? or even take over the issue's title? (we use PR to drive change lists)

@ywelsch ywelsch changed the title Fix data loss when indexing during primary relocation with ongoing replica recoveries Fix replica-primary inconsistencies when indexing during primary relocation with ongoing replica recoveries Jul 18, 2016
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: block suggest "block the thread" like on have it wait on a lock. don't you like "delay"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delayNewRecoveries sounds good!

@bleskes
Copy link
Contributor

bleskes commented Jul 18, 2016

LGTM. Thanks @ywelsch

@ywelsch ywelsch force-pushed the fix/relocation-replica-data-loss branch from 9dfbfe9 to 2bc4000 Compare July 19, 2016 11:09
…cation with ongoing replica recoveries

Primary relocation violates two invariants that ensure proper interaction between document replication and peer recoveries,
ultimately leading to documents not being properly replicated.

Invariant 1: Document writes must be replicated based on the routing table of a cluster state that includes all shards which
have ongoing or finished recoveries. This is ensured by the fact that do not start a recovery that is not reflected by the
cluster state available on the primary node and we always sample a fresh cluster state before starting to replicate write
operations.

Invariant 2: Every operation that is not part of the snapshot taken for phase 2, must be succesfully indexed on the target
replica (pending shard level errors which will cause the target shard to be failed). To ensure this, we start replicating to
the target shard as soon as the recovery start and open it's engine before we take the snapshot. All operations that are
indexed after the snapshot was taken are guaranteed to arrive to the shard when it's ready to index them. Note that this also
means that the replication doesn't fail a shard if it's not yet ready to recieve operations - it's a normal part of a
recovering shard.

With primary relocations, the two invariants can be possibly violated. Let's consider a primary
relocating while there is another replica shard recovering from the primary shard.

Invariant 1 can be violated if the target of the primary relocation is so lagging on cluster state processing that it doesn't
even know about the new initializing replica. This is very rare in practice as replica recoveries take time to copy all the
index files but it is a theoretical gap that surfaces in testing scenarios.

Invariant 2 can be violated even if the target primary knows about the initializing replica. This can happen if the target
primary replicates an operation to the intializing shard and that operation arrives to the initializing shard before it opens
it's engine but arrives to the primary source after it has taken the snapshot of the translog. Those operations will be
currently missed on the new initializing replica.

The fix to reestablish invariant 1 is to ensure that the primary relocation target has a cluster state with all replica
recoveries that were successfully started on primary relocation source. The fix to reestablish invariant 2 is to check after
opening engine on the replica if the primary has been relocated in the meanwhile and fail the recovery.
@ywelsch ywelsch force-pushed the fix/relocation-replica-data-loss branch from 2bc4000 to d7f99a4 Compare July 19, 2016 11:33
@ywelsch ywelsch merged commit c4fe8e7 into elastic:master Jul 19, 2016
dnhatn added a commit that referenced this pull request Mar 18, 2019
We introduced WAIT_CLUSTERSTATE action in #19287 (5.0), but then stopped
using it since #25692 (6.0). This change removes that action and related
code in 7.x and 8.0.

Relates #19287
Relates #25692
dnhatn added a commit that referenced this pull request Mar 18, 2019
We introduced WAIT_CLUSTERSTATE action in #19287 (5.0), but then stopped
using it since #25692 (6.0). This change removes that action and related
code in 7.x and 8.0.

Relates #19287
Relates #25692
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>bug :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. resiliency v5.0.0-alpha5

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants