-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Fix replica-primary inconsistencies when indexing during primary relocation with ongoing replica recoveries #19287
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix replica-primary inconsistencies when indexing during primary relocation with ongoing replica recoveries #19287
Conversation
d50aaa0 to
6bdf79d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we use this implementation a lot - maybe we should make utility base class for it (different PR). a default method implementation will be tricky because of the local node. Maybe it should be a parameter to the onClusterServiceClose. Food for thought.
|
thx @ywelsch . I left some comments. Can we also maybe add some tests to RecoverySourceHandlerTests to test the new relocation behavior (waiting for a cluster state version and fail on move to relocated )? |
|
@bleskes I've pushed a new change addressing comments. I've also added the unit tests that you asked for but I am not convinced that they are too useful. |
|
@ywelsch I'm wondering if changing the title to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: block suggest "block the thread" like on have it wait on a lock. don't you like "delay"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delayNewRecoveries sounds good!
|
LGTM. Thanks @ywelsch |
9dfbfe9 to
2bc4000
Compare
…cation with ongoing replica recoveries Primary relocation violates two invariants that ensure proper interaction between document replication and peer recoveries, ultimately leading to documents not being properly replicated. Invariant 1: Document writes must be replicated based on the routing table of a cluster state that includes all shards which have ongoing or finished recoveries. This is ensured by the fact that do not start a recovery that is not reflected by the cluster state available on the primary node and we always sample a fresh cluster state before starting to replicate write operations. Invariant 2: Every operation that is not part of the snapshot taken for phase 2, must be succesfully indexed on the target replica (pending shard level errors which will cause the target shard to be failed). To ensure this, we start replicating to the target shard as soon as the recovery start and open it's engine before we take the snapshot. All operations that are indexed after the snapshot was taken are guaranteed to arrive to the shard when it's ready to index them. Note that this also means that the replication doesn't fail a shard if it's not yet ready to recieve operations - it's a normal part of a recovering shard. With primary relocations, the two invariants can be possibly violated. Let's consider a primary relocating while there is another replica shard recovering from the primary shard. Invariant 1 can be violated if the target of the primary relocation is so lagging on cluster state processing that it doesn't even know about the new initializing replica. This is very rare in practice as replica recoveries take time to copy all the index files but it is a theoretical gap that surfaces in testing scenarios. Invariant 2 can be violated even if the target primary knows about the initializing replica. This can happen if the target primary replicates an operation to the intializing shard and that operation arrives to the initializing shard before it opens it's engine but arrives to the primary source after it has taken the snapshot of the translog. Those operations will be currently missed on the new initializing replica. The fix to reestablish invariant 1 is to ensure that the primary relocation target has a cluster state with all replica recoveries that were successfully started on primary relocation source. The fix to reestablish invariant 2 is to check after opening engine on the replica if the primary has been relocated in the meanwhile and fail the recovery.
2bc4000 to
d7f99a4
Compare
Primary relocation violates two invariants that ensure proper interaction between document replication and peer recoveries, ultimately leading to documents not being properly replicated (see #19248 for more details).
Closes #19248