diff --git a/docs/resiliency/index.asciidoc b/docs/resiliency/index.asciidoc index a181bfcc5afee..8c14aca953512 100644 --- a/docs/resiliency/index.asciidoc +++ b/docs/resiliency/index.asciidoc @@ -63,22 +63,6 @@ to create new scenarios. We have currently ported all published Jepsen scenarios framework. As the Jepsen tests evolve, we will continue porting new scenarios that are not covered yet. We are committed to investigating all new scenarios and will report issues that we find on this page and in our GitHub repository. -[float] -=== Repeated network partitions can cause cluster state updates to be lost (STATUS: ONGOING) - -During a networking partition, cluster state updates (like mapping changes or shard assignments) -are committed if a majority of the master-eligible nodes received the update correctly. This means that the current master has access -to enough nodes in the cluster to continue to operate correctly. When the network partition heals, the isolated nodes catch -up with the current state and receive the previously missed changes. However, if a second partition happens while the cluster -is still recovering from the previous one *and* the old master falls on the minority side, it may be that a new master is elected -which has not yet catch up. If that happens, cluster state updates can be lost. - -This problem is mostly fixed by {GIT}20384[#20384] (v5.0.0), which takes committed cluster state updates into account during master -election. This considerably reduces the chance of this rare problem occurring but does not fully mitigate it. If the second partition -happens concurrently with a cluster state update and blocks the cluster state commit message from reaching a majority of nodes, it may be -that the in flight update will be lost. If the now-isolated master can still acknowledge the cluster state update to the client this -will amount to the loss of an acknowledged change. Fixing that last scenario needs considerable work. We are currently working on it but have no ETA yet. - [float] === Better request retry mechanism when nodes are disconnected (STATUS: ONGOING) @@ -170,6 +154,33 @@ shard. == Completed +[float] +=== Repeated network partitions can cause cluster state updates to be lost (STATUS: DONE, v7.0.0) + +During a networking partition, cluster state updates (like mapping changes or +shard assignments) are committed if a majority of the master-eligible nodes +received the update correctly. This means that the current master has access to +enough nodes in the cluster to continue to operate correctly. When the network +partition heals, the isolated nodes catch up with the current state and receive +the previously missed changes. However, if a second partition happens while the +cluster is still recovering from the previous one *and* the old master falls on +the minority side, it may be that a new master is elected which has not yet +catch up. If that happens, cluster state updates can be lost. + +This problem is mostly fixed by {GIT}20384[#20384] (v5.0.0), which takes +committed cluster state updates into account during master election. This +considerably reduces the chance of this rare problem occurring but does not +fully mitigate it. If the second partition happens concurrently with a cluster +state update and blocks the cluster state commit message from reaching a +majority of nodes, it may be that the in flight update will be lost. If the +now-isolated master can still acknowledge the cluster state update to the client +this will amount to the loss of an acknowledged change. + +Fixing this last scenario was one of the goals of {GIT}32006[#32006] and its +sub-issues. See particularly {GIT}32171[#32171] and +https://github.com/elastic/elasticsearch-formal-models/blob/master/ZenWithTerms/tla/ZenWithTerms.tla[the +TLA+ formal model] used to verify these changes. + [float] === Divergence between primary and replica shard copies when documents deleted (STATUS: DONE, V6.3.0)