Skip to content

perf: use dataset-level scan for indexed vector search to avoid per-fragment redundancy#432

Closed
shmilygkd wants to merge 1 commit intolance-format:mainfrom
shmilygkd:perf/vector-search-single-split
Closed

perf: use dataset-level scan for indexed vector search to avoid per-fragment redundancy#432
shmilygkd wants to merge 1 commit intolance-format:mainfrom
shmilygkd:perf/vector-search-single-split

Conversation

@shmilygkd
Copy link
Copy Markdown

@shmilygkd shmilygkd commented Apr 14, 2026

Problem

Lance's IVF index is a global structure: centroids are computed over the full dataset and cannot be restricted to individual fragment boundaries. As a result, each `Fragment.newScan()` with a nearest query executes a complete global index search rather than a fragment-local one. With N fragments this has two consequences:

  1. Incorrect recall: Each Spark partition retrieves globally-optimal candidates. Unioning N such result sets inflates recall relative to a single global search at the same `nprobes`, producing results inconsistent with a direct pylance query.
  2. N-fold overhead: Each task independently loads the same global index metadata and `nprobes` partition files. Index I/O and compute scale as N×.

Fix

Introduce `LanceSplit.isIndexedVectorSearch()` to distinguish indexed vector search (`nearest` with `useIndex=true`) from brute-force KNN (`useIndex=false`).

For indexed search, all fragments are merged into a single `LanceSplit` and `Dataset.newScan()` is used instead of `Fragment.newScan()`, executing one global index search on a single executor. Recall is now consistent with pylance.

For brute-force KNN (`useIndex=false`), per-fragment splits are preserved so each partition physically scans its own fragment in parallel.

Additional fixes:

  • Guard against empty datasets in the indexed path
  • Fix SPJ partition key computation: previously skipped for all `nearest` queries; now skipped only for indexed search, preserving valid partition keys for brute-force KNN

Testing

Added `LanceSplitVectorSearchTest` with integration tests verifying that `useIndex=true` produces exactly one split and `useIndex=false` produces one split per fragment.

@github-actions github-actions Bot added the performance Features that improves performance label Apr 14, 2026
…ragment redundancy

Lance's IVF index is built globally across all fragments. When each
fragment maps to a separate Spark partition, indexed vector search runs
once per fragment instead of once per query — incurring N-fold task
scheduling overhead and lower recall than a single global IVF search.

Changes:

- Add `LanceSplit.isIndexedVectorSearch()` to distinguish indexed vector
  search (nearest + useIndex=true) from brute-force KNN (useIndex=false).
- For indexed search, merge all fragments into a single split and use
  `Dataset.newScan()` instead of `Fragment.newScan()` to execute a single
  global index search. Guard against empty datasets (no fragments).
- For brute-force KNN, keep per-fragment splits for parallel scan and
  set `prefilter=true` on fragment scanners for correctness.
- Skip SPJ partition key computation only for indexed vector search;
  brute-force KNN retains per-fragment splits so its partition key
  remains valid and SPJ can proceed normally.
- Add tests covering planScan() split count: indexed search produces one
  split; brute-force KNN produces one split per fragment.
@shmilygkd shmilygkd force-pushed the perf/vector-search-single-split branch from cbc1372 to 2edc065 Compare April 14, 2026 04:40
@shmilygkd shmilygkd closed this Apr 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance Features that improves performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant