Shuffle ids for building graph index#322
Conversation
|
@chasingegg 🔍 Important: PR Classification Needed! For efficient project management and a seamless review process, it's essential to classify your PR correctly. Here's how:
For any PR outside the kind/improvement category, ensure you link to the associated issue using the format: “issue: #”. Thanks for your efforts and contribution to the community!. |
Thanks |
Signed-off-by: chasingegg <chao.gao@zilliz.com>
cf5db0a to
5937628
Compare
|
Still need to know why order plus OOD data can damage the recall. |
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: chasingegg, foxspy The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
Fix #326 |
|
/kind improvement |
Signed-off-by: chasingegg <chao.gao@zilliz.com>
Signed-off-by: chasingegg <chao.gao@zilliz.com>
Signed-off-by: chasingegg <chao.gao@zilliz.com>
Signed-off-by: chasingegg <chao.gao@zilliz.com>
* Remove omp (#276) Signed-off-by: Yudong Cai <yudong.cai@zilliz.com> * [cherry-pick]Pip3 install in requirements.txt order (#303) Signed-off-by: cqy123456 <qianya.cheng@zilliz.com> * Cherry-pick #294 #298 from main (#300) Signed-off-by: Yudong Cai <yudong.cai@zilliz.com> * Deprecate Invalid config checking (#304) Signed-off-by: Li Liu <li.liu@zilliz.com> * Fix scann range search (#316) Signed-off-by: chasingegg <chao.gao@zilliz.com> * Upgrade conan to 1.61.0 (#182) (#347) Signed-off-by: Enwei Jiao <enwei.jiao@zilliz.com> Co-authored-by: Enwei Jiao <enwei.jiao@zilliz.com> * raft hasrawdata return false Signed-off-by: yusheng.ma <yusheng.ma@zilliz.com> * switch knowhere-test branch (#384) Signed-off-by: cqy123456 <qianya.cheng@zilliz.com> * Ensure topk results for IVF_FLAT_CC (#353) (#383) Signed-off-by: chasingegg <chao.gao@zilliz.com> * make sure we rethrow exceptions in async tasks: make sure we do not crash due to uncaught exceptions when we called folly::Future::wait but not trying to get the values; use folly::collect to simplify code (#382) Signed-off-by: Buqian Zheng <zhengbuqian@gmail.com> * [2.2 fix] wrap IVF index train/build calls in lambdas passed to knowhere thread pool, so OMP threads spawned will have low nice values (#379) Signed-off-by: Buqian Zheng <zhengbuqian@gmail.com> * fix diskann async cache generation (#377) Signed-off-by: xianliang <xianliang.li@zilliz.com> * fix:miss wait thread tasks finish in diskann. (#380) Signed-off-by: cqy123456 <qianya.cheng@zilliz.com> * Shuffle ids for building hnsw index (#322) (#381) Signed-off-by: chasingegg <chao.gao@zilliz.com> * sync knowherer 2.2.4 Signed-off-by: xianliang <xianliang.li@zilliz.com> --------- Signed-off-by: Yudong Cai <yudong.cai@zilliz.com> Signed-off-by: cqy123456 <qianya.cheng@zilliz.com> Signed-off-by: Li Liu <li.liu@zilliz.com> Signed-off-by: chasingegg <chao.gao@zilliz.com> Signed-off-by: Enwei Jiao <enwei.jiao@zilliz.com> Signed-off-by: yusheng.ma <yusheng.ma@zilliz.com> Signed-off-by: Buqian Zheng <zhengbuqian@gmail.com> Signed-off-by: xianliang <xianliang.li@zilliz.com> Co-authored-by: Cai Yudong <yudong.cai@zilliz.com> Co-authored-by: cqy123456 <39671710+cqy123456@users.noreply.github.com> Co-authored-by: liliu-z <105927039+liliu-z@users.noreply.github.com> Co-authored-by: Gao <chao.gao@zilliz.com> Co-authored-by: Enwei Jiao <enwei.jiao@zilliz.com> Co-authored-by: yusheng.ma <yusheng.ma@zilliz.com> Co-authored-by: Buqian Zheng <zhengbuqian@gmail.com>
This pr is to shuffle ids for building graph index since user's data may be clustered by its sequence, this could cause some islands in graph to affect recall significantly, shuffle could alleviate this.
Divide base vector ids(not raw data itself) into different blocks(diskann already did this), and shuffle these blocks to build index, and adding points in each block is still memory friendly, result shows on a dataset with clustered data. Try to use 8192 as default block size in hnsw, which is currently used in diskann, to achieve almost the same building time and high recall in these cases.