Create ARM bulk sqrI8 implementation by thecoop · Pull Request #142461 · elastic/elasticsearch

thecoop · 2026-02-13T11:36:31Z

Bulk sqri8 implementation for ARM

Generally the same at low vector number, ~40% faster at high numbers (tested on c8g instance):

VectorScorerInt7uBulkBenchmark.scoreQueryMultipleRandomBulk          16    1024   EUCLIDEAN            NATIVE           128  thrpt    5  101079.664 ±  85.816  ops/s
VectorScorerInt7uBulkBenchmark.scoreQueryMultipleRandomBulk          16    1024   EUCLIDEAN            NATIVE          1500  thrpt    5    7767.945 ±  36.184  ops/s
VectorScorerInt7uBulkBenchmark.scoreQueryMultipleRandomBulk          16    1024   EUCLIDEAN            NATIVE        130000  thrpt    5     288.917 ±  22.621  ops/s
VectorScorerInt7uBulkBenchmark.scoreQueryMultipleRandomBulk        1024    1024   EUCLIDEAN            NATIVE           128  thrpt    5  101709.353 ± 489.989  ops/s
VectorScorerInt7uBulkBenchmark.scoreQueryMultipleRandomBulk        1024    1024   EUCLIDEAN            NATIVE          1500  thrpt    5    8107.120 ±  27.128  ops/s
VectorScorerInt7uBulkBenchmark.scoreQueryMultipleRandomBulk        1024    1024   EUCLIDEAN            NATIVE        130000  thrpt    5     298.933 ±  12.169  ops/s

becomes

VectorScorerInt7uBulkBenchmark.scoreQueryMultipleRandomBulk          16    1024   EUCLIDEAN            NATIVE           128  thrpt    5   98494.216 ± 220.629  ops/s
VectorScorerInt7uBulkBenchmark.scoreQueryMultipleRandomBulk          16    1024   EUCLIDEAN            NATIVE          1500  thrpt    5    7889.580 ±  28.657  ops/s
VectorScorerInt7uBulkBenchmark.scoreQueryMultipleRandomBulk          16    1024   EUCLIDEAN            NATIVE        130000  thrpt    5     416.363 ±   9.919  ops/s
VectorScorerInt7uBulkBenchmark.scoreQueryMultipleRandomBulk        1024    1024   EUCLIDEAN            NATIVE           128  thrpt    5  101136.627 ± 180.760  ops/s
VectorScorerInt7uBulkBenchmark.scoreQueryMultipleRandomBulk        1024    1024   EUCLIDEAN            NATIVE          1500  thrpt    5    8262.145 ±  52.086  ops/s
VectorScorerInt7uBulkBenchmark.scoreQueryMultipleRandomBulk        1024    1024   EUCLIDEAN            NATIVE        130000  thrpt    5     421.630 ±  14.092  ops/s

Provides a 25% speed boost in KnnIndexTester

ldematte · 2026-02-23T14:33:45Z

I see things are improving (by a lot, I see 2x) only for big datasets. I wonder how this will measure after #142480
Also, which instance type did you use for the benchmarks? (Man, we really need infra for JMH across plaftorms!)

thecoop · 2026-02-25T11:28:14Z

libs/simdvec/native/src/vec/c/aarch64/vec_1.cpp

    dotd1q4_inner_bulk<array_mapper>(a, query, length, pitch, offsets, count, results);
 }

+EXPORT int64_t vec_dotd2q4(


This just moves the methods around so they're in a consistent order

elasticsearchmachine · 2026-02-25T17:25:27Z

Pinging @elastic/es-search-relevance (Team:Search Relevance)

ldematte

LGTM

Bulk sqri8 implementation for ARM. Generally the same at low vector number, ~40% faster at high numbers (tested on c8g instance)

…cations * upstream/main: (35 commits) Create ARM bulk sqrI8 implementation (elastic#142461) Rework get-snapshots predicates (elastic#143161) Refactor downsampling fetchers and producers (elastic#140357) ESQL: Unmute test and add extra logging to generative test validation (elastic#143168) Fix metadata fields being nullified/loaded by unmapped_fields setting (elastic#143155) Determine remote cluster version (elastic#142494) Populate failure message for aborted clones (elastic#143206) Allow kibana_system role to read and manage logs streams (elastic#143053) Mute org.elasticsearch.xpack.esql.CsvIT test {csv-spec:eval.DocsLength} elastic#143224 Mute org.elasticsearch.xpack.esql.CsvIT test {csv-spec:eval.DocsByteLength} elastic#143223 Mute org.elasticsearch.xpack.esql.CsvIT test {csv-spec:docs.DocsBitLength} elastic#143222 Fix FloatVectorScorerSupplier bulkScore bug (elastic#143211) ESQL: Add data node execution for external sources (elastic#143209) [ESQL] Cleanup commands docs (elastic#143058) [ML]Fix latest transforms disregarding updates when sort and sync fields are non-monotonic (elastic#142856) Mute org.elasticsearch.index.mapper.IpFieldMapperTests testSyntheticSourceInObject elastic#143212 Tests: Fix StoreDirectoryMetricsIT (elastic#143084) ESQL: Add distribution strategy for external sources (elastic#143194) CSV IT spec (elastic#142585) Fix VectorScorerOSQBenchmark.score to read corrections properly (elastic#143137) ...

Bulk sqri8 implementation for ARM. Generally the same at low vector number, ~40% faster at high numbers (tested on c8g instance)

Create custom bulk sqr implementation

1ce7333

thecoop requested a review from ldematte February 13, 2026 11:36

thecoop added >non-issue WIP :Search Relevance/Vectors Vector search labels Feb 13, 2026

elasticsearchmachine added the v9.4.0 label Feb 13, 2026

thecoop commented Feb 25, 2026

View reviewed changes

Merge branch 'main' into arm-bulk-vector-ops

c2730e6

thecoop changed the title ~~Create ARM bulk sqr implementation~~ Create ARM bulk sqrI8 implementation Feb 25, 2026

thecoop removed the WIP label Feb 25, 2026

elasticsearchmachine added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Feb 25, 2026

ldematte approved these changes Feb 26, 2026

View reviewed changes

thecoop added the test-arm Pull Requests that should be tested against arm agents label Feb 26, 2026

Publish vec binaries + update version

a07c1e0

ldematte requested a review from a team as a code owner February 26, 2026 17:02

ldematte enabled auto-merge (squash) February 26, 2026 17:03

Merge branch 'main' into arm-bulk-vector-ops

1611a5d

ldematte merged commit 9d4c9cd into elastic:main Feb 27, 2026
41 checks passed

prwhelan mentioned this pull request Feb 27, 2026

[Transform] Clean up internal tests #143246

Merged

tballison pushed a commit to tballison/elasticsearch that referenced this pull request Mar 3, 2026

Create ARM bulk sqrI8 implementation (elastic#142461)

3757f95

Bulk sqri8 implementation for ARM. Generally the same at low vector number, ~40% faster at high numbers (tested on c8g instance)

thecoop deleted the arm-bulk-vector-ops branch March 5, 2026 15:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create ARM bulk sqrI8 implementation#142461

Create ARM bulk sqrI8 implementation#142461
ldematte merged 4 commits intoelastic:mainfrom
thecoop:arm-bulk-vector-ops

thecoop commented Feb 13, 2026 •

edited

Loading

Uh oh!

ldematte commented Feb 23, 2026

Uh oh!

thecoop Feb 25, 2026

Uh oh!

elasticsearchmachine commented Feb 25, 2026

Uh oh!

ldematte left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

thecoop commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ldematte commented Feb 23, 2026

Uh oh!

thecoop Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

elasticsearchmachine commented Feb 25, 2026

Uh oh!

ldematte left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

thecoop commented Feb 13, 2026 •

edited

Loading