perf: use dataset-level scan for indexed vector search to avoid per-fragment redundancy by shmilygkd · Pull Request #432 · lance-format/lance-spark

shmilygkd · 2026-04-14T04:14:21Z

Problem

Lance's IVF index is a global structure: centroids are computed over the full dataset and cannot be restricted to individual fragment boundaries. As a result, each `Fragment.newScan()` with a nearest query executes a complete global index search rather than a fragment-local one. With N fragments this has two consequences:

Incorrect recall: Each Spark partition retrieves globally-optimal candidates. Unioning N such result sets inflates recall relative to a single global search at the same `nprobes`, producing results inconsistent with a direct pylance query.
N-fold overhead: Each task independently loads the same global index metadata and `nprobes` partition files. Index I/O and compute scale as N×.

Fix

Introduce `LanceSplit.isIndexedVectorSearch()` to distinguish indexed vector search (`nearest` with `useIndex=true`) from brute-force KNN (`useIndex=false`).

For indexed search, all fragments are merged into a single `LanceSplit` and `Dataset.newScan()` is used instead of `Fragment.newScan()`, executing one global index search on a single executor. Recall is now consistent with pylance.

For brute-force KNN (`useIndex=false`), per-fragment splits are preserved so each partition physically scans its own fragment in parallel.

Additional fixes:

Guard against empty datasets in the indexed path
Fix SPJ partition key computation: previously skipped for all `nearest` queries; now skipped only for indexed search, preserving valid partition keys for brute-force KNN

Testing

Added `LanceSplitVectorSearchTest` with integration tests verifying that `useIndex=true` produces exactly one split and `useIndex=false` produces one split per fragment.

…ragment redundancy Lance's IVF index is built globally across all fragments. When each fragment maps to a separate Spark partition, indexed vector search runs once per fragment instead of once per query — incurring N-fold task scheduling overhead and lower recall than a single global IVF search. Changes: - Add `LanceSplit.isIndexedVectorSearch()` to distinguish indexed vector search (nearest + useIndex=true) from brute-force KNN (useIndex=false). - For indexed search, merge all fragments into a single split and use `Dataset.newScan()` instead of `Fragment.newScan()` to execute a single global index search. Guard against empty datasets (no fragments). - For brute-force KNN, keep per-fragment splits for parallel scan and set `prefilter=true` on fragment scanners for correctness. - Skip SPJ partition key computation only for indexed vector search; brute-force KNN retains per-fragment splits so its partition key remains valid and SPJ can proceed normally. - Add tests covering planScan() split count: indexed search produces one split; brute-force KNN produces one split per fragment.

github-actions Bot added the performance Features that improves performance label Apr 14, 2026

shmilygkd force-pushed the perf/vector-search-single-split branch from cbc1372 to 2edc065 Compare April 14, 2026 04:40

shmilygkd closed this Apr 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: use dataset-level scan for indexed vector search to avoid per-fragment redundancy#432

perf: use dataset-level scan for indexed vector search to avoid per-fragment redundancy#432
shmilygkd wants to merge 1 commit intolance-format:mainfrom
shmilygkd:perf/vector-search-single-split

shmilygkd commented Apr 14, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

shmilygkd commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

shmilygkd commented Apr 14, 2026 •

edited

Loading