Skip to content

Validate individual offset values in BULK_OFFSETS bounds checks#144643

Merged
ChrisHegarty merged 18 commits intoelastic:mainfrom
ChrisHegarty:validate-bulk-offset-values
Mar 26, 2026
Merged

Validate individual offset values in BULK_OFFSETS bounds checks#144643
ChrisHegarty merged 18 commits intoelastic:mainfrom
ChrisHegarty:validate-bulk-offset-values

Conversation

@ChrisHegarty
Copy link
Copy Markdown
Contributor

@ChrisHegarty ChrisHegarty commented Mar 20, 2026

While working on bulk sparse scoring (#144557), I noticed that checkBulkOffsets and checkBBQBulkOffsets validated segment sizes but not individual offset values. An out-of-range or negative offset would silently read memory beyond the data segment, risking a crash or silently wrong results.

The solution is to replace the sequential size check with per-offset validation that checks each offset points to a valid vector within the data segment. The O(count) loop should be negligible relative to the O(count * dims) native call, but we've made the checks conditional on asserts to avoid any potential negative cost of this, and asserts should be good enough given our testing.

Note: INT4 skips size=2 (packedLen=1) because checkBulkOffsets computes rowBytes = packedLen * 4 / 8 which truncates to 0 via integer division, making the bounds check trivially pass. This is a pre-existing issue with how INT4 passes packed byte length (not element count) as the length parameter to the generic check formula. We can address this separately, if needed.

@ChrisHegarty ChrisHegarty requested a review from a team as a code owner March 20, 2026 12:43
@ChrisHegarty ChrisHegarty added >test Issues or PRs that are addressing/adding tests :Search Relevance/Vectors Vector search Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch labels Mar 20, 2026
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-search-relevance (Team:Search Relevance)

Copy link
Copy Markdown
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would have been a very sneaky bug to track down!

@ldematte
Copy link
Copy Markdown
Contributor

Note: INT4 skips size=2 (packedLen=1) because checkBulkOffsets computes rowBytes = packedLen * 4 / 8 which truncates to 0 via integer division, making the bounds check trivially pass. This is a pre-existing issue with how INT4 passes packed byte length (not element count) as the length parameter to the generic check formula. We can address this separately, if needed.

I think @thecoop already noticed and fixed that?

Copy link
Copy Markdown
Contributor

@ldematte ldematte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the change, but I wonder if we should make that a "assert-like" check, running only with assertions enabled (e.g. in tests)

@ChrisHegarty
Copy link
Copy Markdown
Contributor Author

I think @thecoop already noticed and fixed that?

It's still broken in main, but I did not try to fix it here. Just test and avoid it for noe.

I like the change, but I wonder if we should make that a "assert-like" check, running only with assertions enabled (e.g. in tests)

I dunno. Maybe. I am worried about dereferencing the wrong memory, so checks are super important. The existing checks are always present, so I just followed the same pattern - tho this is expanding somewhat. I don't think that it will have any real noticeable affect on performance. However we could check afterwards and move these to asserts is needed?

@ldematte
Copy link
Copy Markdown
Contributor

I was already not 100% sure about the existing checks to be honest :)
I'm not sure how "light" they are, even compared with bulk operations. Remember, even the cost of calling a function is visible, with the small-ish bulk sizes we have and the level of optimization we get with SIMD, we are talking about nanoseconds. But you are probably right and I'm worrying for nothing.

I am worried about dereferencing the wrong memory, so checks are super important

Well, that ship has sailed the day we decided to go with native code I think :D It's true that in this case the check is meaningful though.

@thecoop
Copy link
Copy Markdown
Member

thecoop commented Mar 23, 2026

Do we have info on what performance effect this has? It's changing the check from O(1) to some kind of O(n), so it's going to have some effect.

@thecoop
Copy link
Copy Markdown
Member

thecoop commented Mar 23, 2026

Could we have a more in-depth check with assertions, and a O(1) top-level sanity check for production? All the code paths should be covered during tests, so there should be no need to run the full checks in production...right?

@ldematte
Copy link
Copy Markdown
Contributor

Could we have a more in-depth check with assertions, and a O(1) top-level sanity check for production? All the code paths should be covered during tests, so there should be no need to run the full checks in production...right?

That could be a good middle ground; my suggestion was along the same lines, but a bit more radical -- assertions should be enough as tests should cover us, and we should have validation somewhere so that it's not possible to generate invalid data (e.g. ordinals that are negative or > the num of vectors).

@ChrisHegarty
Copy link
Copy Markdown
Contributor Author

The O(count) per-offset bounds checks in checkBulkOffsets and checkBBQBulkOffsets are moved into separate validateBulkOffsets / validateBBQBulkOffsets methods, called via assert so they have zero cost in production. The validate methods also add alignment checks on the offsets/result segments and non-negative/positive guards on count, length, and pitch.

Copy link
Copy Markdown
Contributor

@ldematte ldematte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!
But what about Int4? Is that fixed already, or do we want to address it here, or in a separate PR?

@ChrisHegarty
Copy link
Copy Markdown
Contributor Author

LGTM! But what about Int4? Is that fixed already, or do we want to address it here, or in a separate PR?

I already added an int4 unit test in this PR, so it should be covered and verified. Is there something else that I'm missing?

@ldematte
Copy link
Copy Markdown
Contributor

I already added an int4 unit test in this PR, so it should be covered and verified. Is there something else that I'm missing?

I don't know.. probably not, I was referring to the note in the PR description. If that's solved, probably you just need to update the description.

@ChrisHegarty
Copy link
Copy Markdown
Contributor Author

I already added an int4 unit test in this PR, so it should be covered and verified. Is there something else that I'm missing?

I don't know.. probably not, I was referring to the note in the PR description. If that's solved, probably you just need to update the description.

Oh yeah. That's a separate issue. If we want to do it at all.

@ChrisHegarty ChrisHegarty merged commit 0106d3e into elastic:main Mar 26, 2026
35 checks passed
@ChrisHegarty ChrisHegarty deleted the validate-bulk-offset-values branch March 26, 2026 11:17
szybia added a commit to szybia/elasticsearch that referenced this pull request Mar 26, 2026
* upstream/main: (146 commits)
  Revert "[Native] Gradle-related tweaks to improve handling of the simdvec native library (elastic#144539)"
  Fix ArrayIndexOutOfBoundsException in fetch phase with partial results (elastic#144385)
  ESQL: Correctly manage NULL data type for SUM (elastic#144942)
  [ESQL] Fixes GroupedTopNBenchmark not executing (elastic#144944)
  Fix reader context leak when query response serialization fails (elastic#144708)
  Validate individual offset values in BULK_OFFSETS bounds checks (elastic#144643)
  Merge main21 source set into main in simdvec (elastic#144921)
  [TEST] Unmute TsidExtractingIdFieldMapperTests (elastic#144848)
  [Native] Gradle-related tweaks to improve handling of the simdvec native library (elastic#144539)
  Fix `ThreadedActionListenerTests#testRejectionHandling` (elastic#144795)
  Add new DLM Frozen Tier Transition execution plugin and service (elastic#144595)
  Prometheus: execute query_range via parsed EsqlStatement plan (elastic#144416)
  Investigate `testBulkIndexingRequestSplitting` failure (elastic#144766)
  Add test utility for wrapping directories in FilterDirectory layer (elastic#143563)
  Fix ES|QL decay tests with negative scale (elastic#144657)
  Fix circuit breaker leak in percolator query construction (elastic#144827)
  Use XPerFieldDocValuesFormat in AbstractTSDBSyntheticIdCodec (elastic#144744)
  [DOCS] Document how reindex work in CPS (elastic#144016)
  Fix Int4 vector library tests failing on Java 21 (elastic#144830)
  [DiskBBQ] Fix index sorting on flush (elastic#144938)
  ...
seanzatzdev pushed a commit to seanzatzdev/elasticsearch that referenced this pull request Mar 26, 2026
…tic#144643)

While working on bulk sparse scoring (elastic#144557), I noticed that checkBulkOffsets and checkBBQBulkOffsets validated segment sizes but not individual offset values. An out-of-range or negative offset would silently read memory beyond the data segment, risking a crash or silently wrong results.

The solution is to replace the sequential size check with per-offset validation that checks each offset points to a valid vector within the data segment. The O(count) loop should be negligible relative to the O(count * dims) native call, but we've made the checks conditional on asserts to avoid any potential negative cost of this, and asserts should be good enough given our testing.

Note: INT4 skips size=2 (packedLen=1) because checkBulkOffsets computes rowBytes = packedLen * 4 / 8 which truncates to 0 via integer division, making the bounds check trivially pass. This is a pre-existing issue with how INT4 passes packed byte length (not element count) as the length parameter to the generic check formula. We can address this separately, if needed.
seanzatzdev pushed a commit to seanzatzdev/elasticsearch that referenced this pull request Mar 27, 2026
…tic#144643)

While working on bulk sparse scoring (elastic#144557), I noticed that checkBulkOffsets and checkBBQBulkOffsets validated segment sizes but not individual offset values. An out-of-range or negative offset would silently read memory beyond the data segment, risking a crash or silently wrong results.

The solution is to replace the sequential size check with per-offset validation that checks each offset points to a valid vector within the data segment. The O(count) loop should be negligible relative to the O(count * dims) native call, but we've made the checks conditional on asserts to avoid any potential negative cost of this, and asserts should be good enough given our testing.

Note: INT4 skips size=2 (packedLen=1) because checkBulkOffsets computes rowBytes = packedLen * 4 / 8 which truncates to 0 via integer division, making the bounds check trivially pass. This is a pre-existing issue with how INT4 passes packed byte length (not element count) as the length parameter to the generic check formula. We can address this separately, if needed.
mamazzol pushed a commit to mamazzol/elasticsearch that referenced this pull request Mar 30, 2026
…tic#144643)

While working on bulk sparse scoring (elastic#144557), I noticed that checkBulkOffsets and checkBBQBulkOffsets validated segment sizes but not individual offset values. An out-of-range or negative offset would silently read memory beyond the data segment, risking a crash or silently wrong results.

The solution is to replace the sequential size check with per-offset validation that checks each offset points to a valid vector within the data segment. The O(count) loop should be negligible relative to the O(count * dims) native call, but we've made the checks conditional on asserts to avoid any potential negative cost of this, and asserts should be good enough given our testing.

Note: INT4 skips size=2 (packedLen=1) because checkBulkOffsets computes rowBytes = packedLen * 4 / 8 which truncates to 0 via integer division, making the bounds check trivially pass. This is a pre-existing issue with how INT4 passes packed byte length (not element count) as the length parameter to the generic check formula. We can address this separately, if needed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Search Relevance/Vectors Vector search Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch >test Issues or PRs that are addressing/adding tests v9.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants