Fix replica writes after _seq_no doc values are pruned#144180
Merged
tlrx merged 5 commits intoelastic:mainfrom Mar 13, 2026
Merged
Fix replica writes after _seq_no doc values are pruned#144180tlrx merged 5 commits intoelastic:mainfrom
tlrx merged 5 commits intoelastic:mainfrom
Conversation
When sequence numbers are disabled, PruningMergePolicy removes _seq_no
doc values from merged segments once the global checkpoint advances past
them.
A subsequent write (update, delete) for the same document on the replica
then fails in compareOpToLuceneDocBasedOnSeqNo because readNumericDocValues
expects the doc value to exist.
With assertions enabled this throws an AssertionError that causes the
primary waiting for the replica response (see test failure here). In
production (no assertions), this would cause a replica shard failure.
The fix skips loading _seq_no doc values when sequence numbers are
disabled, returning UNASSIGNED_SEQ_NO instead in docAndSeqNo, which
would then matches the condition in compareOpToLuceneDocBasedOnSeqNo:
```
} else if (op.seqNo() > docAndSeqNo.seqNo) {
status = OpVsLuceneDocStatus.OP_NEWER;
}
```
and the operation would be processed normally (OP_NEWER)
on the replica.
Relates elastic#136305
Collaborator
|
Pinging @elastic/es-distributed (Team:Distributed) |
19 tasks
romseygeek
approved these changes
Mar 13, 2026
|
|
||
| /** Return null if id is not found. */ | ||
| DocIdAndSeqNo lookupSeqNo(BytesRef id, LeafReaderContext context) throws IOException { | ||
| DocIdAndSeqNo lookupSeqNo(BytesRef id, LeafReaderContext context, boolean loadSeqNo) throws IOException { |
Contributor
There was a problem hiding this comment.
Maybe rename this to lookupDocIdAndSeqNo?
Member
Author
|
Thanks Alan & Francisco! |
szybia
added a commit
to szybia/elasticsearch
that referenced
this pull request
Mar 13, 2026
…elocations * upstream/main: (72 commits) [Test] Randomly disable sequence numbers in CcrTimeSeriesDataStreamsIT (elastic#143930) Fix AsyncSearchIndexServiceTests.testCircuitBreaker failure (elastic#144058) Refine GenerativeIT some more, this time with accounting for some added (elastic#144220) ESQL: Physical Planning on the Lookup Node (elastic#143707) Mute org.elasticsearch.xpack.esql.CsvIT test {csv-spec:approximation.Approximate stats by with zero variance} elastic#144240 Trigger counter metrics in test for delta temporality measurements (elastic#144193) fix capabiltiy approximation_v3 (elastic#144230) [ci] Add PR pipeline for testing ipv6 and fix tests not working with ipv6 (elastic#140473) update (elastic#144095) Make from/to optional in TBUCKET when Kibana timestamp filter is present (elastic#144057) Extract reroute behavior from create-index request classes (elastic#144140) ESQL: Fix release build only failures (elastic#144122) ES|QL query approximation: move sample correction to data node (elastic#144005) Add indexing pressure tracking to OTLP endpoints (elastic#144009) Fix replica writes after _seq_no doc values are pruned (elastic#144180) allow tests to configure supportsLoadingConfig (elastic#144061) [ES|QL] Unmute testGiantTextFieldInSubqueryIntermediateResultsWithSort (elastic#144126) [ESQL][DOCS] Add CPS page (unpublished for moment) (elastic#144206) ESQL: Forbid "load" unmapped_fields for certain commands (elastic#144115) Add CCS Remote Views Detection (elastic#143384) ...
michalborek
pushed a commit
to michalborek/elasticsearch
that referenced
this pull request
Mar 23, 2026
When sequence numbers are disabled, PruningMergePolicy removes _seq_no doc values from merged segments once the global checkpoint advances past them.
A subsequent write (update, delete) for the same document on the replica then fails in compareOpToLuceneDocBasedOnSeqNo because readNumericDocValues expects the doc value to exist.
With assertions enabled this throws an AssertionError that causes the primary waiting for the replica response (see test failure here). In production (no assertions), this would cause a replica shard failure.
The fix skips loading _seq_no doc values when sequence numbers are disabled, returning UNASSIGNED_SEQ_NO instead in docAndSeqNo, which would then matches the condition in compareOpToLuceneDocBasedOnSeqNo:
} else if (op.seqNo() > docAndSeqNo.seqNo) {
status = OpVsLuceneDocStatus.OP_NEWER;
}
and the operation would be processed normally (OP_NEWER) on the replica.
Not marking as bug since it's not released yet.
Relates elastic#136305
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
When sequence numbers are disabled, PruningMergePolicy removes _seq_no doc values from merged segments once the global checkpoint advances past them.
A subsequent write (update, delete) for the same document on the replica then fails in compareOpToLuceneDocBasedOnSeqNo because readNumericDocValues expects the doc value to exist.
With assertions enabled this throws an AssertionError that causes the primary waiting for the replica response (see test failure here). In production (no assertions), this would cause a replica shard failure.
The fix skips loading _seq_no doc values when sequence numbers are disabled, returning UNASSIGNED_SEQ_NO instead in docAndSeqNo, which would then matches the condition in compareOpToLuceneDocBasedOnSeqNo:
and the operation would be processed normally (OP_NEWER) on the replica.
Not marking as bug since it's not released yet.
Relates #136305