perf: use dataset-level scan for indexed vector search to avoid per-fragment redundancy#432
Closed
shmilygkd wants to merge 1 commit intolance-format:mainfrom
Closed
Conversation
…ragment redundancy Lance's IVF index is built globally across all fragments. When each fragment maps to a separate Spark partition, indexed vector search runs once per fragment instead of once per query — incurring N-fold task scheduling overhead and lower recall than a single global IVF search. Changes: - Add `LanceSplit.isIndexedVectorSearch()` to distinguish indexed vector search (nearest + useIndex=true) from brute-force KNN (useIndex=false). - For indexed search, merge all fragments into a single split and use `Dataset.newScan()` instead of `Fragment.newScan()` to execute a single global index search. Guard against empty datasets (no fragments). - For brute-force KNN, keep per-fragment splits for parallel scan and set `prefilter=true` on fragment scanners for correctness. - Skip SPJ partition key computation only for indexed vector search; brute-force KNN retains per-fragment splits so its partition key remains valid and SPJ can proceed normally. - Add tests covering planScan() split count: indexed search produces one split; brute-force KNN produces one split per fragment.
cbc1372 to
2edc065
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Lance's IVF index is a global structure: centroids are computed over the full dataset and cannot be restricted to individual fragment boundaries. As a result, each `Fragment.newScan()` with a nearest query executes a complete global index search rather than a fragment-local one. With N fragments this has two consequences:
Fix
Introduce `LanceSplit.isIndexedVectorSearch()` to distinguish indexed vector search (`nearest` with `useIndex=true`) from brute-force KNN (`useIndex=false`).
For indexed search, all fragments are merged into a single `LanceSplit` and `Dataset.newScan()` is used instead of `Fragment.newScan()`, executing one global index search on a single executor. Recall is now consistent with pylance.
For brute-force KNN (`useIndex=false`), per-fragment splits are preserved so each partition physically scans its own fragment in parallel.
Additional fixes:
Testing
Added `LanceSplitVectorSearchTest` with integration tests verifying that `useIndex=true` produces exactly one split and `useIndex=false` produces one split per fragment.