Skip to content

Expose byte offsets on XContentParser via getCurrentLocation()#143501

Merged
quackaplop merged 5 commits intoelastic:mainfrom
quackaplop:feature/xcontent-location-offsets
Mar 5, 2026
Merged

Expose byte offsets on XContentParser via getCurrentLocation()#143501
quackaplop merged 5 commits intoelastic:mainfrom
quackaplop:feature/xcontent-location-offsets

Conversation

@quackaplop
Copy link
Copy Markdown
Contributor

Summary

Adds byte-offset awareness to XContentParser, enabling callers to determine the exact byte range of any token or sub-structure in the underlying stream. This is a prerequisite for zero-copy byte slicing in JSON_EXTRACT (PR #142375), where extracting an object/array result can become a direct Arrays.copyOfRange() instead of walking every token via copyCurrentStructure().

Changes

New API surface:

  • XContentParser.getCurrentLocation() — returns the parser's current byte position (just past the last consumed byte), complementing the existing getTokenLocation() (start of current token). Together they define the byte range [tokenLocation, currentLocation) for slicing.
  • XContentLocation record gains a byteOffset field (third component). The existing 2-arg constructor defaults it to -1 (not available). A new UNDEFINED constant (0, 0, -1) is added for parsers with no underlying byte stream (e.g., MapXContentParser).

Implementation across parser hierarchy:

  • JsonXContentParser — passes through Jackson's getByteOffset() from both currentTokenLocation() and currentLocation()
  • FilterXContentParser — delegates getCurrentLocation() to the wrapped parser (all 11+ decorators inherit this)
  • DotExpandingXContentParser — returns cached location for synthetic dot-expanded tokens, delegates for original content
  • CompletionFieldMapper.MultiFieldParser — returns fixed locationOffset (no real positions for rewritten metadata)
  • MapXContentParser — returns XContentLocation.UNDEFINED (no underlying stream)
  • ParameterizableYamlXContentParser — delegates to inner parser

Byte slicing feasibility by format:

Format Byte offsets? Slicing safe? Notes
JSON Yes Yes Sliced sub-structure is valid standalone JSON
CBOR Yes Yes Self-contained data items
SMILE Yes No Back-references for repeated field names
YAML No (-1) No Whitespace-sensitive, anchors/aliases

Tests

  • XContentLocationTests — record constructors, UNKNOWN vs UNDEFINED, equality including byteOffset
  • JsonXContentParserByteOffsetTests — 15 tests covering token offsets, object/array/scalar/boolean/null slicing, multi-byte UTF-8, surrogate pairs (4-byte U+1F389), empty containers, multi-line JSON
  • XContentParserTests — cross-format tests (YAML returns -1, CBOR/Smile have offsets), FilterXContentParser delegation
  • MapXContentParserTests — both location methods return UNDEFINED
  • DotExpandingXContentParserTestsgetCurrentLocation() mirrors getTokenLocation() pattern (synthetic vs original tokens)
  • CompletionFieldMapperTestsgetCurrentLocation() assertions alongside existing getTokenLocation()
  • GeoPointFieldMapperTestsgetCurrentLocation() delegation verified

Closes #142873

@quackaplop quackaplop requested a review from a team as a code owner March 3, 2026 16:42
@quackaplop quackaplop added >enhancement :Core/Infra/Core Core issues without another label Team:Core/Infra Meta label for core/infra team labels Mar 3, 2026
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-core-infra (Team:Core/Infra)

quackaplop added a commit to quackaplop/elasticsearch that referenced this pull request Mar 3, 2026
@quackaplop
Copy link
Copy Markdown
Contributor Author

buildkite test this

@quackaplop quackaplop marked this pull request as draft March 3, 2026 18:18
quackaplop added a commit to quackaplop/elasticsearch that referenced this pull request Mar 4, 2026
@quackaplop quackaplop force-pushed the feature/xcontent-location-offsets branch 2 times, most recently from 78ad2d7 to ef29423 Compare March 4, 2026 14:12
quackaplop added a commit to quackaplop/elasticsearch that referenced this pull request Mar 4, 2026
@quackaplop quackaplop marked this pull request as ready for review March 4, 2026 14:13
@quackaplop quackaplop requested review from Mikep86 and smalyshev March 4, 2026 22:57
@smalyshev
Copy link
Copy Markdown
Contributor

Nit: the description says "returns XContentLocation.UNDEFINED" but it looks like it's actually XContentLocation.INVALID?

* Sentinel for parsers that have no underlying stream (e.g. {@code MapXContentParser}).
* Line and column are zero (outside the valid 1-based range), byte offset is {@code -1}.
*/
public static final XContentLocation INVALID = new XContentLocation(0, 0, -1L);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand the reason why there's both UNKNOWN and INVALID and what's the difference between them. Maybe add some comment when each is supposed to be used?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, this is a bit of BCW mess. Maybe I should just remove INVALID - it is really just used in tests. Will do this.

Copy link
Copy Markdown
Contributor

@smalyshev smalyshev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, added some nitpicks

@quackaplop quackaplop enabled auto-merge (squash) March 5, 2026 11:04
@elasticsearchmachine elasticsearchmachine added the serverless-linked Added by automation, don't add manually label Mar 5, 2026
Add getCurrentLocation() to XContentParser, returning the parser's
cursor position (how far it has consumed into the stream). Together
with getTokenLocation() (token start), this enables zero-copy
byte-range slicing of JSON values.

- XContentLocation: add byteOffset field, UNKNOWN and UNDEFINED constants
- JsonXContentParser: pass through Jackson's byte offset
- FilterXContentParser: delegate getCurrentLocation()
- DotExpandingXContentParser: return cached location for synthetic tokens
- CompletionFieldMapper.MultiFieldParser: return fixed location
- MapXContentParser: return UNDEFINED (no underlying stream)
- ParameterizableYamlXContentParser: delegate getCurrentLocation()
End-to-end tests that navigate to values by key, nested path,
and array index using standard streaming token walking, then
extract via byte-offset slicing and verify against
copyCurrentStructure(). Covers object/array/scalar extraction,
mixed object+array paths, and pretty-printed JSON.
- Remove XContentLocation.INVALID constant; MapXContentParser returns
  inline new XContentLocation(0, 0) as it did before this PR.
- Add hasValidLineNumber(), hasValidColumnNumber(), hasValidByteOffset()
  so callers don't need to know the magic sentinel values.
@quackaplop quackaplop force-pushed the feature/xcontent-location-offsets branch from 913b1c5 to 38501c9 Compare March 5, 2026 12:26
@quackaplop quackaplop merged commit 71cf379 into elastic:main Mar 5, 2026
34 checks passed
jfreden pushed a commit to jfreden/elasticsearch that referenced this pull request Mar 5, 2026
…ic#143501)

* Expose byte offsets on XContentParser via getCurrentLocation()

Add getCurrentLocation() to XContentParser, returning the parser's
cursor position (how far it has consumed into the stream). Together
with getTokenLocation() (token start), this enables zero-copy
byte-range slicing of JSON values.

- XContentLocation: add byteOffset field, UNKNOWN and UNDEFINED constants
- JsonXContentParser: pass through Jackson's byte offset
- FilterXContentParser: delegate getCurrentLocation()
- DotExpandingXContentParser: return cached location for synthetic tokens
- CompletionFieldMapper.MultiFieldParser: return fixed location
- MapXContentParser: return UNDEFINED (no underlying stream)
- ParameterizableYamlXContentParser: delegate getCurrentLocation()

* Add changelog for elastic#143501

* Add streaming JSON navigation + byte-slicing tests

End-to-end tests that navigate to values by key, nested path,
and array index using standard streaming token walking, then
extract via byte-offset slicing and verify against
copyCurrentStructure(). Covers object/array/scalar extraction,
mixed object+array paths, and pretty-printed JSON.

* Rename XContentLocation.UNDEFINED to INVALID

* Address review feedback: remove INVALID, add validity methods

- Remove XContentLocation.INVALID constant; MapXContentParser returns
  inline new XContentLocation(0, 0) as it did before this PR.
- Add hasValidLineNumber(), hasValidColumnNumber(), hasValidByteOffset()
  so callers don't need to know the magic sentinel values.
quackaplop added a commit to quackaplop/elasticsearch that referenced this pull request Mar 5, 2026
…traction

Replace copyCurrentStructure() re-serialization with zero-copy byte
slicing for JSON input. When the extracted value is an object, array,
or number, slice bytes directly from the input buffer using
XContentLocation.byteOffset() offsets (exposed in elastic#143501).

Also refactors navigation from recursive descent to iterative loop,
confining raw byte access to the extraction point. Adds JMH benchmarks
for JSON_EXTRACT through the full eval pipeline.
spinscale pushed a commit to spinscale/elasticsearch that referenced this pull request Mar 6, 2026
…ic#143501)

* Expose byte offsets on XContentParser via getCurrentLocation()

Add getCurrentLocation() to XContentParser, returning the parser's
cursor position (how far it has consumed into the stream). Together
with getTokenLocation() (token start), this enables zero-copy
byte-range slicing of JSON values.

- XContentLocation: add byteOffset field, UNKNOWN and UNDEFINED constants
- JsonXContentParser: pass through Jackson's byte offset
- FilterXContentParser: delegate getCurrentLocation()
- DotExpandingXContentParser: return cached location for synthetic tokens
- CompletionFieldMapper.MultiFieldParser: return fixed location
- MapXContentParser: return UNDEFINED (no underlying stream)
- ParameterizableYamlXContentParser: delegate getCurrentLocation()

* Add changelog for elastic#143501

* Add streaming JSON navigation + byte-slicing tests

End-to-end tests that navigate to values by key, nested path,
and array index using standard streaming token walking, then
extract via byte-offset slicing and verify against
copyCurrentStructure(). Covers object/array/scalar extraction,
mixed object+array paths, and pretty-printed JSON.

* Rename XContentLocation.UNDEFINED to INVALID

* Address review feedback: remove INVALID, add validity methods

- Remove XContentLocation.INVALID constant; MapXContentParser returns
  inline new XContentLocation(0, 0) as it did before this PR.
- Add hasValidLineNumber(), hasValidColumnNumber(), hasValidByteOffset()
  so callers don't need to know the magic sentinel values.
quackaplop added a commit that referenced this pull request Mar 6, 2026
…traction (#143702)

* JSON_EXTRACT: zero-copy byte slicing for object, array, and number extraction

Replace copyCurrentStructure() re-serialization with zero-copy byte
slicing for JSON input. When the extracted value is an object, array,
or number, slice bytes directly from the input buffer using
XContentLocation.byteOffset() offsets (exposed in #143501).

Also refactors navigation from recursive descent to iterative loop,
confining raw byte access to the extraction point. Adds JMH benchmarks
for JSON_EXTRACT through the full eval pipeline.

* Add changelog for #143702

* [CI] Auto commit changes from spotless

* Clean up navigation helpers to avoid threading unused parameters

Navigation methods now only position the parser — they no longer carry
builder, segments, depth, rawBytes, or rawOffset.

* Use full variable names instead of abbreviations

---------

Co-authored-by: elasticsearchmachine <infra-root+elasticsearchmachine@elastic.co>
sidosera pushed a commit to sidosera/elasticsearch that referenced this pull request Mar 6, 2026
…traction (elastic#143702)

* JSON_EXTRACT: zero-copy byte slicing for object, array, and number extraction

Replace copyCurrentStructure() re-serialization with zero-copy byte
slicing for JSON input. When the extracted value is an object, array,
or number, slice bytes directly from the input buffer using
XContentLocation.byteOffset() offsets (exposed in elastic#143501).

Also refactors navigation from recursive descent to iterative loop,
confining raw byte access to the extraction point. Adds JMH benchmarks
for JSON_EXTRACT through the full eval pipeline.

* Add changelog for elastic#143702

* [CI] Auto commit changes from spotless

* Clean up navigation helpers to avoid threading unused parameters

Navigation methods now only position the parser — they no longer carry
builder, segments, depth, rawBytes, or rawOffset.

* Use full variable names instead of abbreviations

---------

Co-authored-by: elasticsearchmachine <infra-root+elasticsearchmachine@elastic.co>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Core/Infra/Core Core issues without another label >enhancement serverless-linked Added by automation, don't add manually Team:Core/Infra Meta label for core/infra team v9.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Expose byte offsets on XContentParser for zero-copy sub-structure extraction

3 participants