[feat] Add sparse index support to knowhere#341
Conversation
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: zhengbuqian The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
| for (size_t i = 0; i < rows; ++i) { | ||
| raw_data_[current_rows + i] = data[i]; | ||
| for (auto [idx, val] : data[i].data) { | ||
| if (val < val_threshold) { |
There was a problem hiding this comment.
if drop_ratio_build_ == 0, then val_threshold is 0, so negative val values will be skipped. Is it by design? This implies that sparse values are expected to be positive only
There was a problem hiding this comment.
Thanks Alex! I have changed it so that now we ignore only values that are close enough to zero.
Also the drop strategy has been modified from "drop any value that is below ratio * global max" to "drop the smallest ratio% abs values".
3958730 to
884f2b6
Compare
345c673 to
950a375
Compare
950a375 to
4108734
Compare
include/knowhere/sparse_utils.h
Outdated
|
|
||
| SparseRow& | ||
| operator=(const SparseRow& other) { | ||
| copy_from(other); |
There was a problem hiding this comment.
Please do all these nasty C++ things like if (this == &other) here.
Also, typically it is recommended to create a clone object
SparseRow clone(other);
swap(*this, clone);
just in case of an exception during the cloning process
| } | ||
| } | ||
|
|
||
| size_t |
There was a problem hiding this comment.
move constructor and move assignment operator please :)
include/knowhere/sparse_utils.h
Outdated
|
|
||
| IdVal<T> | ||
| operator[](size_t i) const { | ||
| table_t index = *(table_t*)(data + i * (sizeof(table_t) + sizeof(T))); |
There was a problem hiding this comment.
all these raw pointers operations are a nightmare. Please introduce a helper struct, such as
template<typename T>
struct SparseRowElement {
table_t idx;
T value;
};
make sure that it is packed and just use reinterpret_cast<SparseRowElement<T>*>(data) or reinterpret_cast<const ...*>(data) for working .
Alternatively, just use IdVal<T> but make sure that it is packed.
There was a problem hiding this comment.
Done, added a private structure.
|
|
||
| // When pushing new elements into a MaxMinHeap, only `capacity` elements with the | ||
| // largest val are kept. pop()/top() returns the smallest element out of them. | ||
| template <typename T> |
There was a problem hiding this comment.
Don't we have any similar structures in our code already? I'm just curious, the implementation is fine
There was a problem hiding this comment.
I don't see one that matches exactly the use case.
| return Status::success; | ||
| } | ||
|
|
||
| std::vector<T> vals; |
There was a problem hiding this comment.
just compute the size that needs to be preallocated, such as
size_t amount = 0;
for (size_t i = 0; i < rows; i++) { amount += data[i].size(); }
std::vector<T> vals(amount);
| if (v < q_threshold || i >= n_cols_internal()) { | ||
| continue; | ||
| } | ||
| for (size_t j = 0; j < inverted_lut_[i].size(); j++) { |
There was a problem hiding this comment.
this loop can be improved via SIMD. Maybe, later
There was a problem hiding this comment.
great suggestion! I'll leave that as is for now but I left a TODO.
|
|
||
| // LUT supports size() and operator[] which returns an IdVal. | ||
| template <typename LUT> | ||
| class Cursor { |
There was a problem hiding this comment.
the design of the iterator is somewhat unusual, but it is fine
| void | ||
| Search(const SparseRow<T>& query, size_t k, float drop_ratio_search, float* distances, label_t* labels, | ||
| size_t refine_factor, const BitsetView& bitset) const { | ||
| std::shared_lock<std::shared_mutex> lock(mu_); |
There was a problem hiding this comment.
nit: you could move the lock of mu_ a little bit down further, because in the following lines you do nothing but std::fill and not using any of the fields of this struct
| table_t index; | ||
| T value; | ||
| ElementProxy() = delete; | ||
| ElementProxy(const ElementProxy&) = delete; |
There was a problem hiding this comment.
I would add ElementProxy& operator=(const ElementProxy&) = delete; as well
On the other hand, I don't understand why it is needed to delete the constructors
There was a problem hiding this comment.
deleting the default and the copy constructor(so no constructor will be generated) should be sufficient. if no instance can ever be made there is no way to call copy assignment operator.
the reason is this element is used solely to help access what's in the raw data_ pointer, we don't actually have the need to create such objects.
src/common/comp/brute_force.cc
Outdated
| for (auto& fut : futs) { | ||
| fut.wait(); | ||
| RETURN_IF_ERROR(fut.result().value()); | ||
| } |
There was a problem hiding this comment.
I'd use the following to make sure that no futs get unfinished
for (auto& fut : futs) fut.wait();
for (auto& fut : futs) RETURN_IF_ERROR(fut.result().value());
There was a problem hiding this comment.
that is correct. Considering the lambda does not return a Status other than success, I removed the return value and use fut.get() instead.
There was a problem hiding this comment.
there are actually other places that made the same mistake, creating another PR to fix
src/common/comp/brute_force.cc
Outdated
| return expected<DataSetPtr>::Err(status, msg); | ||
| } | ||
| int topk = cfg.k.value(); | ||
| auto labels = new sparse::label_t[nq * topk]; |
There was a problem hiding this comment.
A potential memory leak. I'd use something like
auto labels = std::make_unique<sparse::label_t>(nq * topk);
auto distances = std::make_unique<float>(nq * topk);
SearchSparseWithBuf(base_dataset, query_dataset, labels.get(), distances.get(), config, bitset);
return GenResultDataSet(nq, topk, labels.release(), distances.release());
If we're using C++ 17, of course
| auto refine_factor = cfg.refine_factor.value_or(10); | ||
| auto drop_ratio_search = cfg.drop_ratio_search.value_or(0.0f); | ||
|
|
||
| auto p_id = new sparse::label_t[nq * k]; |
There was a problem hiding this comment.
same for using std::make_unique<> as above
| for (auto& fut : futs) { | ||
| fut.wait(); | ||
| } | ||
| return GenResultDataSet(nq, k, p_id, p_dist); |
There was a problem hiding this comment.
for (auto& fut : futs) RETURN_IF_ERROR(fut.result().value()); is missing
There was a problem hiding this comment.
changed to fut.get()
There was a problem hiding this comment.
I've added comments to #355 . Basically, it does not work as desired.
b569ba7 to
1ee5dfb
Compare
|
/hold |
|
/unhold |
|
Please change for using |
… added related unit test Signed-off-by: Buqian Zheng <zhengbuqian@gmail.com>
1ee5dfb to
c0f90d3
Compare
|
/lgtm |
/kind feature
#193
This PR adds the basic framework of supporting sparse index in knowhere, with 2 index types: inverted index and wand.
In this PR only Search and GetVectorByIds is supported.
Added basic unit tests:
Search without filtering, with different dataset sizes/recall/sparisity, etc.
Search with filtering, with different filter rate.