Skip to content

Create ARM bulk sqrI8 implementation#142461

Merged
ldematte merged 4 commits intoelastic:mainfrom
thecoop:arm-bulk-vector-ops
Feb 27, 2026
Merged

Create ARM bulk sqrI8 implementation#142461
ldematte merged 4 commits intoelastic:mainfrom
thecoop:arm-bulk-vector-ops

Conversation

@thecoop
Copy link
Copy Markdown
Member

@thecoop thecoop commented Feb 13, 2026

Bulk sqri8 implementation for ARM

Generally the same at low vector number, ~40% faster at high numbers (tested on c8g instance):

VectorScorerInt7uBulkBenchmark.scoreQueryMultipleRandomBulk          16    1024   EUCLIDEAN            NATIVE           128  thrpt    5  101079.664 ±  85.816  ops/s
VectorScorerInt7uBulkBenchmark.scoreQueryMultipleRandomBulk          16    1024   EUCLIDEAN            NATIVE          1500  thrpt    5    7767.945 ±  36.184  ops/s
VectorScorerInt7uBulkBenchmark.scoreQueryMultipleRandomBulk          16    1024   EUCLIDEAN            NATIVE        130000  thrpt    5     288.917 ±  22.621  ops/s
VectorScorerInt7uBulkBenchmark.scoreQueryMultipleRandomBulk        1024    1024   EUCLIDEAN            NATIVE           128  thrpt    5  101709.353 ± 489.989  ops/s
VectorScorerInt7uBulkBenchmark.scoreQueryMultipleRandomBulk        1024    1024   EUCLIDEAN            NATIVE          1500  thrpt    5    8107.120 ±  27.128  ops/s
VectorScorerInt7uBulkBenchmark.scoreQueryMultipleRandomBulk        1024    1024   EUCLIDEAN            NATIVE        130000  thrpt    5     298.933 ±  12.169  ops/s

becomes

VectorScorerInt7uBulkBenchmark.scoreQueryMultipleRandomBulk          16    1024   EUCLIDEAN            NATIVE           128  thrpt    5   98494.216 ± 220.629  ops/s
VectorScorerInt7uBulkBenchmark.scoreQueryMultipleRandomBulk          16    1024   EUCLIDEAN            NATIVE          1500  thrpt    5    7889.580 ±  28.657  ops/s
VectorScorerInt7uBulkBenchmark.scoreQueryMultipleRandomBulk          16    1024   EUCLIDEAN            NATIVE        130000  thrpt    5     416.363 ±   9.919  ops/s
VectorScorerInt7uBulkBenchmark.scoreQueryMultipleRandomBulk        1024    1024   EUCLIDEAN            NATIVE           128  thrpt    5  101136.627 ± 180.760  ops/s
VectorScorerInt7uBulkBenchmark.scoreQueryMultipleRandomBulk        1024    1024   EUCLIDEAN            NATIVE          1500  thrpt    5    8262.145 ±  52.086  ops/s
VectorScorerInt7uBulkBenchmark.scoreQueryMultipleRandomBulk        1024    1024   EUCLIDEAN            NATIVE        130000  thrpt    5     421.630 ±  14.092  ops/s

Provides a 25% speed boost in KnnIndexTester

@ldematte
Copy link
Copy Markdown
Contributor

I see things are improving (by a lot, I see 2x) only for big datasets. I wonder how this will measure after #142480
Also, which instance type did you use for the benchmarks? (Man, we really need infra for JMH across plaftorms!)

dotd1q4_inner_bulk<array_mapper>(a, query, length, pitch, offsets, count, results);
}

EXPORT int64_t vec_dotd2q4(
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This just moves the methods around so they're in a consistent order

@thecoop thecoop changed the title Create ARM bulk sqr implementation Create ARM bulk sqrI8 implementation Feb 25, 2026
@thecoop thecoop removed the WIP label Feb 25, 2026
@elasticsearchmachine elasticsearchmachine added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Feb 25, 2026
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-search-relevance (Team:Search Relevance)

Copy link
Copy Markdown
Contributor

@ldematte ldematte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@thecoop thecoop added the test-arm Pull Requests that should be tested against arm agents label Feb 26, 2026
@ldematte ldematte requested a review from a team as a code owner February 26, 2026 17:02
@ldematte ldematte enabled auto-merge (squash) February 26, 2026 17:03
@ldematte ldematte merged commit 9d4c9cd into elastic:main Feb 27, 2026
41 checks passed
PeteGillinElastic pushed a commit to PeteGillinElastic/elasticsearch that referenced this pull request Feb 27, 2026
Bulk sqri8 implementation for ARM.
Generally the same at low vector number, ~40% faster at high numbers (tested on c8g instance)
szybia added a commit to szybia/elasticsearch that referenced this pull request Feb 27, 2026
…cations

* upstream/main: (35 commits)
  Create ARM bulk sqrI8 implementation (elastic#142461)
  Rework get-snapshots predicates (elastic#143161)
  Refactor downsampling fetchers and producers (elastic#140357)
  ESQL: Unmute test and add extra logging to generative test validation (elastic#143168)
  Fix metadata fields being nullified/loaded by unmapped_fields setting (elastic#143155)
  Determine remote cluster version (elastic#142494)
  Populate failure message for aborted clones (elastic#143206)
  Allow kibana_system role to read and manage logs streams (elastic#143053)
  Mute org.elasticsearch.xpack.esql.CsvIT test {csv-spec:eval.DocsLength} elastic#143224
  Mute org.elasticsearch.xpack.esql.CsvIT test {csv-spec:eval.DocsByteLength} elastic#143223
  Mute org.elasticsearch.xpack.esql.CsvIT test {csv-spec:docs.DocsBitLength} elastic#143222
  Fix FloatVectorScorerSupplier bulkScore bug (elastic#143211)
  ESQL: Add data node execution for external sources (elastic#143209)
  [ESQL] Cleanup commands docs (elastic#143058)
  [ML]Fix latest transforms disregarding updates when sort and sync fields are non-monotonic (elastic#142856)
  Mute org.elasticsearch.index.mapper.IpFieldMapperTests testSyntheticSourceInObject elastic#143212
  Tests: Fix StoreDirectoryMetricsIT (elastic#143084)
  ESQL: Add distribution strategy for external sources (elastic#143194)
  CSV IT spec (elastic#142585)
  Fix VectorScorerOSQBenchmark.score to read corrections properly (elastic#143137)
  ...
tballison pushed a commit to tballison/elasticsearch that referenced this pull request Mar 3, 2026
Bulk sqri8 implementation for ARM.
Generally the same at low vector number, ~40% faster at high numbers (tested on c8g instance)
@thecoop thecoop deleted the arm-bulk-vector-ops branch March 5, 2026 15:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>non-issue :Search Relevance/Vectors Vector search Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch test-arm Pull Requests that should be tested against arm agents v9.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants