Fix FollowingEngine#lookupPrimaryTerm when sequence numbers are disabled#143935
Conversation
lookupPrimaryTerm is called by FollowingEngine when a duplicate operation is detected on the primary (i.e., an operation with a seq_no that was already processed). It retrieves the primary term of the existing operation on the follower primary shard so that TransportBulkShardOperationsAction can rewrite the duplicate with the correct term before replicating it to followers replicas, ensuring consistency between primary and replicas. But the method currently only fetch the _seq_no indexed with SeqNoIndexOptions.POINTS_AND_DOC_VALUES option, not DOC_VALUES_ONLY. This is now fixed. Also, if the _seq_no cannot be fetched because it has been merged away on the follower primary, the method now returns `OptionalLong.empty()` as it would for duplicate that are past the global checkpoint. Since _seq_no should be retained for operations below the global checkpoint this is not a production issue, rather a safety net to avoid throwing the IllegalStateException.
|
Pinging @elastic/es-distributed (Team:Distributed) |
|
Hi @tlrx, I've created a changelog YAML for you. |
did you meant "beyond" instead of below? Also, this manifests in tests only, right? |
Yes, beyond.
I think it can manifest in production too, throwing an illegal state exception on time-series/logsdb follower primary shard and therefore failing the engine. |
But in that case we shouldn't be pruning sequence numbers beyond the global checkpoint? I think that I'm missing something here |
|
Thanks Francisco |
…led (elastic#143935) lookupPrimaryTerm is called by FollowingEngine when a duplicate operation is detected on the primary (i.e., an operation with a seq_no that was already processed). It retrieves the primary term of the existing operation on the follower primary shard so that TransportBulkShardOperationsAction can rewrite the duplicate with the correct term before replicating it to followers replicas, ensuring consistency between primary and replicas. But the method currently only fetch the _seq_no indexed with SeqNoIndexOptions.POINTS_AND_DOC_VALUES option, not DOC_VALUES_ONLY. This is now fixed. Also, if the _seq_no cannot be fetched because it has been merged away on the follower primary, the method now returns OptionalLong.empty() as it would for duplicate that are past the global checkpoint. Since _seq_no should be retained for operations beyond the global checkpoint this is not a production issue, rather a safety net to avoid throwing the IllegalStateException. The only test we have is FollowEnginTests.testProcessOnPrimary but making it work for the DOC_VALUES_ONLY is already a pain, even more with DISABLE_SEQUENCE_NUMBERS.
lookupPrimaryTerm is called by FollowingEngine when a duplicate operation is detected on the primary (i.e., an operation with a seq_no that was already processed). It retrieves the primary term of the existing operation on the follower primary shard so that TransportBulkShardOperationsAction can rewrite the duplicate with the correct term before replicating it to followers replicas, ensuring consistency between primary and replicas.
But the method currently only fetch the _seq_no indexed with SeqNoIndexOptions.POINTS_AND_DOC_VALUES option, not DOC_VALUES_ONLY.
This is now fixed.
Also, if the _seq_no cannot be fetched because it has been merged away on the follower primary, the method now returns
OptionalLong.empty()as it would for duplicate that are past the global checkpoint. Since _seq_no should be retained for operations beyond the global checkpoint this is not a production issue, rather a safety net to avoid throwing the IllegalStateException.The only test we have is FollowEnginTests.testProcessOnPrimary but making it work for the DOC_VALUES_ONLY is already a pain, even more with DISABLE_SEQUENCE_NUMBERS.