Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(search): Hnsw Vector Search Plan Operator & Executor #2434

Merged
merged 13 commits into from
Jul 24, 2024

Conversation

Beihao-Zhou
Copy link
Member

@Beihao-Zhou Beihao-Zhou commented Jul 18, 2024

This PR is part of #2426

Summary

  • "Pure" KNN Plan Operator + Executor (i.e. search on HNSW graph directly on disk)

  • "Pure" Range Query Plan Operator + Executor

    1. KNN search to get k nearest candidates as a mini-batch initialization, ensuring each one of them is within the expected range
      • If there has been candidates out of the range, then return End;
    2. Expanding results from step 1 by fetching their unvisited neighbours
      • Return results which are within the range and treat them as the starting candidates for the next batch

    Ref: Idea from Pgvector Hnsw Scan Code - bool hnswgettuple(IndexScanDesc scan, ScanDirection dir)

Next

  • Hybrid Query with Filter (Potential Solution)

    1. KNN query as usual
    2. Post-process the results by selecting only those results returned by the index that matches the query filter

    Note that since both "Pure" Range Query Executor and "Pure" KNN Search Executor are based on disk, "Pure" Range Query Executor cannot be reused for hybrid query. I plan to add the new hybrid query executor after implementing the ir expression for HNSW (e.g. reuseFilterExecutor for HnswRangeQueryExpr?). Also, I currently don't have a clear idea on how to continue fetching more nearest vectors after some vectors are filtered for hybrid queries. So for hybrid query, I'll make changes in future PRs after the expression layer is more clear to me.

Resources

@Beihao-Zhou
Copy link
Member Author

Hiii @PragmaTwice, once we discussed the potential solutions for vector search query executor on Slack, this PR is attempting to solve it. I'm wondering if the solution above sounds good to you as an initial implementation, and whether you see any room for improvement. Thanks!

@Beihao-Zhou Beihao-Zhou changed the title feat(search): Hnsw Vector Search Plan Operator & Execution feat(search): Hnsw Vector Search Plan Operator & Executor Jul 18, 2024
@PragmaTwice
Copy link
Member

Yeah I think it's good to be an initial implementation. We can gradually refactor it to a more iterative way.

src/search/ir_plan.h Outdated Show resolved Hide resolved
@Beihao-Zhou
Copy link
Member Author

This PR is ready for review! @PragmaTwice
I probably want to add the hybrid executor after implementing the IR expressions (See the Next section in the first conversation block) if this sounds good to you. <3

@Beihao-Zhou Beihao-Zhou marked this pull request as ready for review July 23, 2024 07:04
@PragmaTwice
Copy link
Member

PragmaTwice commented Jul 23, 2024

This PR is ready for review! @PragmaTwice I probably want to add the hybrid executor after implementing the IR expressions (See the Next section in the first conversation block) if this sounds good to you. <3

Yeah we can then add new nodes in IR and support these new field scan operators in optimizer passes.

e.g. for RediSearch query @year:[2020 2022] @v:[VECTOR_RANGE ...], it's possible to generate a query plan that looks like filter (2020 <= @year and @year <= 2022) from (hnsw-vector-range-scan @v, ...).

As for which query plan is the best, we can leave it to the cost model to determine.


IndexInfo *index;
redis::SearchKey search_key;
redis::HnswVectorFieldMetadata field_metadata;
Copy link
Member

@PragmaTwice PragmaTwice Jul 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems here we can just keep a pointer e.g. const redis::HnswVectorFieldMetadata *field_metadata.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh this was because HnswIndex constructor does not support const HnswVectorFieldMetadata* .

HnswIndex(const SearchKey& search_key, HnswVectorFieldMetadata* vector, engine::Storage* storage);

HnswIndex needs to modify HnswVectorFieldMetadata* in other functions, so I did copy here. Do you have ideas on how we can improve it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me think, if the metadata is modified in a certain vector query, it seems that the global metadata of the vector field (in server->index_mgr->index_map) will not change. Will this cause any problems?

Copy link
Member Author

@Beihao-Zhou Beihao-Zhou Jul 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HnswVectorFieldMetadata* is only modified when the node is inserted/deleted in a higher layer than all other nodes, so num_levels in HnswVectorFieldMetadata : IndexFieldMetadata is modified. So the affected field is server->index_mgr->index_map[<index_key>]->fields[<field>]->metadata.

In indexer.cc, we do

auto *metadata = iter->second.metadata.get();
if (auto vector = dynamic_cast<HnswVectorFieldMetadata *>(metadata)) {
    GET_OR_RET(UpdateHnswVectorIndex(key, original, current, search_key, vector));
}

So *metadata changes correspondingly.

But in this PR, since we don't modify metadata, I guess it will not cause actual problems by copying (maybe with consistency issue but this would be fixed later aligning with #2310 , and the expensive copy because of the large size of HnswIndex but I can fix that right after the PR if the solution using static member variable sounds good to you <3).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you mean.

It's good for now, but we can enhance it later. For example, we could create a const version of HnswIndex that takes a const pointer to HnswVectorFieldMetadata.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds good!
I wanted to overload with a const version, but there are some nested calls also asking for HnswVectorFieldMetadata* as a parameter, so eventually didn't implement it to avoid making this PR look too messy. I'll take a note in tracking issue #2426 and improve it in the future.

Copy link
Member

@PragmaTwice PragmaTwice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rest looks good to me. Thank you!

PragmaTwice
PragmaTwice previously approved these changes Jul 24, 2024
@PragmaTwice PragmaTwice merged commit 79a740c into apache:unstable Jul 24, 2024
29 checks passed
Copy link

sonarcloud bot commented Jul 24, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants