Expose byte offsets on XContentParser via getCurrentLocation()#143501
Merged
quackaplop merged 5 commits intoelastic:mainfrom Mar 5, 2026
Merged
Expose byte offsets on XContentParser via getCurrentLocation()#143501quackaplop merged 5 commits intoelastic:mainfrom
quackaplop merged 5 commits intoelastic:mainfrom
Conversation
Collaborator
|
Pinging @elastic/es-core-infra (Team:Core/Infra) |
quackaplop
added a commit
to quackaplop/elasticsearch
that referenced
this pull request
Mar 3, 2026
Contributor
Author
|
buildkite test this |
quackaplop
added a commit
to quackaplop/elasticsearch
that referenced
this pull request
Mar 4, 2026
78ad2d7 to
ef29423
Compare
quackaplop
added a commit
to quackaplop/elasticsearch
that referenced
this pull request
Mar 4, 2026
Contributor
|
Nit: the description says "returns XContentLocation.UNDEFINED" but it looks like it's actually |
smalyshev
reviewed
Mar 5, 2026
| * Sentinel for parsers that have no underlying stream (e.g. {@code MapXContentParser}). | ||
| * Line and column are zero (outside the valid 1-based range), byte offset is {@code -1}. | ||
| */ | ||
| public static final XContentLocation INVALID = new XContentLocation(0, 0, -1L); |
Contributor
There was a problem hiding this comment.
Not sure I understand the reason why there's both UNKNOWN and INVALID and what's the difference between them. Maybe add some comment when each is supposed to be used?
Contributor
Author
There was a problem hiding this comment.
Honestly, this is a bit of BCW mess. Maybe I should just remove INVALID - it is really just used in tests. Will do this.
smalyshev
reviewed
Mar 5, 2026
libs/x-content/src/main/java/org/elasticsearch/xcontent/XContentLocation.java
Show resolved
Hide resolved
smalyshev
approved these changes
Mar 5, 2026
Contributor
smalyshev
left a comment
There was a problem hiding this comment.
lgtm, added some nitpicks
Add getCurrentLocation() to XContentParser, returning the parser's cursor position (how far it has consumed into the stream). Together with getTokenLocation() (token start), this enables zero-copy byte-range slicing of JSON values. - XContentLocation: add byteOffset field, UNKNOWN and UNDEFINED constants - JsonXContentParser: pass through Jackson's byte offset - FilterXContentParser: delegate getCurrentLocation() - DotExpandingXContentParser: return cached location for synthetic tokens - CompletionFieldMapper.MultiFieldParser: return fixed location - MapXContentParser: return UNDEFINED (no underlying stream) - ParameterizableYamlXContentParser: delegate getCurrentLocation()
End-to-end tests that navigate to values by key, nested path, and array index using standard streaming token walking, then extract via byte-offset slicing and verify against copyCurrentStructure(). Covers object/array/scalar extraction, mixed object+array paths, and pretty-printed JSON.
- Remove XContentLocation.INVALID constant; MapXContentParser returns inline new XContentLocation(0, 0) as it did before this PR. - Add hasValidLineNumber(), hasValidColumnNumber(), hasValidByteOffset() so callers don't need to know the magic sentinel values.
913b1c5 to
38501c9
Compare
jfreden
pushed a commit
to jfreden/elasticsearch
that referenced
this pull request
Mar 5, 2026
…ic#143501) * Expose byte offsets on XContentParser via getCurrentLocation() Add getCurrentLocation() to XContentParser, returning the parser's cursor position (how far it has consumed into the stream). Together with getTokenLocation() (token start), this enables zero-copy byte-range slicing of JSON values. - XContentLocation: add byteOffset field, UNKNOWN and UNDEFINED constants - JsonXContentParser: pass through Jackson's byte offset - FilterXContentParser: delegate getCurrentLocation() - DotExpandingXContentParser: return cached location for synthetic tokens - CompletionFieldMapper.MultiFieldParser: return fixed location - MapXContentParser: return UNDEFINED (no underlying stream) - ParameterizableYamlXContentParser: delegate getCurrentLocation() * Add changelog for elastic#143501 * Add streaming JSON navigation + byte-slicing tests End-to-end tests that navigate to values by key, nested path, and array index using standard streaming token walking, then extract via byte-offset slicing and verify against copyCurrentStructure(). Covers object/array/scalar extraction, mixed object+array paths, and pretty-printed JSON. * Rename XContentLocation.UNDEFINED to INVALID * Address review feedback: remove INVALID, add validity methods - Remove XContentLocation.INVALID constant; MapXContentParser returns inline new XContentLocation(0, 0) as it did before this PR. - Add hasValidLineNumber(), hasValidColumnNumber(), hasValidByteOffset() so callers don't need to know the magic sentinel values.
quackaplop
added a commit
to quackaplop/elasticsearch
that referenced
this pull request
Mar 5, 2026
…traction Replace copyCurrentStructure() re-serialization with zero-copy byte slicing for JSON input. When the extracted value is an object, array, or number, slice bytes directly from the input buffer using XContentLocation.byteOffset() offsets (exposed in elastic#143501). Also refactors navigation from recursive descent to iterative loop, confining raw byte access to the extraction point. Adds JMH benchmarks for JSON_EXTRACT through the full eval pipeline.
spinscale
pushed a commit
to spinscale/elasticsearch
that referenced
this pull request
Mar 6, 2026
…ic#143501) * Expose byte offsets on XContentParser via getCurrentLocation() Add getCurrentLocation() to XContentParser, returning the parser's cursor position (how far it has consumed into the stream). Together with getTokenLocation() (token start), this enables zero-copy byte-range slicing of JSON values. - XContentLocation: add byteOffset field, UNKNOWN and UNDEFINED constants - JsonXContentParser: pass through Jackson's byte offset - FilterXContentParser: delegate getCurrentLocation() - DotExpandingXContentParser: return cached location for synthetic tokens - CompletionFieldMapper.MultiFieldParser: return fixed location - MapXContentParser: return UNDEFINED (no underlying stream) - ParameterizableYamlXContentParser: delegate getCurrentLocation() * Add changelog for elastic#143501 * Add streaming JSON navigation + byte-slicing tests End-to-end tests that navigate to values by key, nested path, and array index using standard streaming token walking, then extract via byte-offset slicing and verify against copyCurrentStructure(). Covers object/array/scalar extraction, mixed object+array paths, and pretty-printed JSON. * Rename XContentLocation.UNDEFINED to INVALID * Address review feedback: remove INVALID, add validity methods - Remove XContentLocation.INVALID constant; MapXContentParser returns inline new XContentLocation(0, 0) as it did before this PR. - Add hasValidLineNumber(), hasValidColumnNumber(), hasValidByteOffset() so callers don't need to know the magic sentinel values.
quackaplop
added a commit
that referenced
this pull request
Mar 6, 2026
…traction (#143702) * JSON_EXTRACT: zero-copy byte slicing for object, array, and number extraction Replace copyCurrentStructure() re-serialization with zero-copy byte slicing for JSON input. When the extracted value is an object, array, or number, slice bytes directly from the input buffer using XContentLocation.byteOffset() offsets (exposed in #143501). Also refactors navigation from recursive descent to iterative loop, confining raw byte access to the extraction point. Adds JMH benchmarks for JSON_EXTRACT through the full eval pipeline. * Add changelog for #143702 * [CI] Auto commit changes from spotless * Clean up navigation helpers to avoid threading unused parameters Navigation methods now only position the parser — they no longer carry builder, segments, depth, rawBytes, or rawOffset. * Use full variable names instead of abbreviations --------- Co-authored-by: elasticsearchmachine <infra-root+elasticsearchmachine@elastic.co>
sidosera
pushed a commit
to sidosera/elasticsearch
that referenced
this pull request
Mar 6, 2026
…traction (elastic#143702) * JSON_EXTRACT: zero-copy byte slicing for object, array, and number extraction Replace copyCurrentStructure() re-serialization with zero-copy byte slicing for JSON input. When the extracted value is an object, array, or number, slice bytes directly from the input buffer using XContentLocation.byteOffset() offsets (exposed in elastic#143501). Also refactors navigation from recursive descent to iterative loop, confining raw byte access to the extraction point. Adds JMH benchmarks for JSON_EXTRACT through the full eval pipeline. * Add changelog for elastic#143702 * [CI] Auto commit changes from spotless * Clean up navigation helpers to avoid threading unused parameters Navigation methods now only position the parser — they no longer carry builder, segments, depth, rawBytes, or rawOffset. * Use full variable names instead of abbreviations --------- Co-authored-by: elasticsearchmachine <infra-root+elasticsearchmachine@elastic.co>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds byte-offset awareness to
XContentParser, enabling callers to determine the exact byte range of any token or sub-structure in the underlying stream. This is a prerequisite for zero-copy byte slicing inJSON_EXTRACT(PR #142375), where extracting an object/array result can become a directArrays.copyOfRange()instead of walking every token viacopyCurrentStructure().Changes
New API surface:
XContentParser.getCurrentLocation()— returns the parser's current byte position (just past the last consumed byte), complementing the existinggetTokenLocation()(start of current token). Together they define the byte range[tokenLocation, currentLocation)for slicing.XContentLocationrecord gains abyteOffsetfield (third component). The existing 2-arg constructor defaults it to-1(not available). A newUNDEFINEDconstant(0, 0, -1)is added for parsers with no underlying byte stream (e.g.,MapXContentParser).Implementation across parser hierarchy:
JsonXContentParser— passes through Jackson'sgetByteOffset()from bothcurrentTokenLocation()andcurrentLocation()FilterXContentParser— delegatesgetCurrentLocation()to the wrapped parser (all 11+ decorators inherit this)DotExpandingXContentParser— returns cached location for synthetic dot-expanded tokens, delegates for original contentCompletionFieldMapper.MultiFieldParser— returns fixedlocationOffset(no real positions for rewritten metadata)MapXContentParser— returnsXContentLocation.UNDEFINED(no underlying stream)ParameterizableYamlXContentParser— delegates to inner parserByte slicing feasibility by format:
-1)Tests
XContentLocationTests— record constructors,UNKNOWNvsUNDEFINED, equality includingbyteOffsetJsonXContentParserByteOffsetTests— 15 tests covering token offsets, object/array/scalar/boolean/null slicing, multi-byte UTF-8, surrogate pairs (4-byteU+1F389), empty containers, multi-line JSONXContentParserTests— cross-format tests (YAML returns-1, CBOR/Smile have offsets),FilterXContentParserdelegationMapXContentParserTests— both location methods returnUNDEFINEDDotExpandingXContentParserTests—getCurrentLocation()mirrorsgetTokenLocation()pattern (synthetic vs original tokens)CompletionFieldMapperTests—getCurrentLocation()assertions alongside existinggetTokenLocation()GeoPointFieldMapperTests—getCurrentLocation()delegation verifiedCloses #142873