Skip to content

Shuffle ids for building graph index#322

Merged
sre-ci-robot merged 1 commit intozilliztech:mainfrom
chasingegg:hnsw-shuffle
Jan 3, 2024
Merged

Shuffle ids for building graph index#322
sre-ci-robot merged 1 commit intozilliztech:mainfrom
chasingegg:hnsw-shuffle

Conversation

@chasingegg
Copy link
Collaborator

@chasingegg chasingegg commented Jan 2, 2024

This pr is to shuffle ids for building graph index since user's data may be clustered by its sequence, this could cause some islands in graph to affect recall significantly, shuffle could alleviate this.
Divide base vector ids(not raw data itself) into different blocks(diskann already did this), and shuffle these blocks to build index, and adding points in each block is still memory friendly, result shows on a dataset with clustered data. Try to use 8192 as default block size in hnsw, which is currently used in diskann, to achieve almost the same building time and high recall in these cases.

hnsw Build time(s) recall(top100) diskann Build time(s) recall(top100)
Full shuffle 246 0.926   479 0.933
Block size=1024 147 0.9079   389 0.9203
Block size=8192 126 0.8988   376 0.8667
No shuffle(now) 117 0.4975   371 0.4752

@mergify
Copy link

mergify bot commented Jan 2, 2024

@chasingegg 🔍 Important: PR Classification Needed!

For efficient project management and a seamless review process, it's essential to classify your PR correctly. Here's how:

  1. If you're fixing a bug, label it as kind/bug.
  2. For small tweaks (less than 20 lines without altering any functionality), please use kind/improvement.
  3. Significant changes that don't modify existing functionalities should be tagged as kind/enhancement.
  4. Adjusting APIs or changing functionality? Go with kind/feature.

For any PR outside the kind/improvement category, ensure you link to the associated issue using the format: “issue: #”.

Thanks for your efforts and contribution to the community!.

@alexanderguzhva
Copy link
Collaborator

@chasingegg

  1. what problem does this commit solve exactly and why the problem is actually solved?
  2. is it worth making such a shuffle configurable in order to have 'more reproducible' results, for example, during some kind of debugging?

Thanks

@mergify mergify bot added the ci-passed label Jan 3, 2024
Signed-off-by: chasingegg <chao.gao@zilliz.com>
@sre-ci-robot sre-ci-robot added size/M and removed size/S labels Jan 3, 2024
@mergify mergify bot removed the ci-passed label Jan 3, 2024
@chasingegg chasingegg changed the title Shuffle ids for building hnsw index Shuffle ids for building graph index Jan 3, 2024
@mergify mergify bot added the ci-passed label Jan 3, 2024
@liliu-z
Copy link
Collaborator

liliu-z commented Jan 3, 2024

Still need to know why order plus OOD data can damage the recall.

@liliu-z
Copy link
Collaborator

liliu-z commented Jan 3, 2024

/lgtm

Copy link
Collaborator

@foxspy foxspy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@sre-ci-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: chasingegg, foxspy

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@chasingegg
Copy link
Collaborator Author

Fix #326

@chasingegg
Copy link
Collaborator Author

/kind improvement

@sre-ci-robot sre-ci-robot merged commit 157fdbc into zilliztech:main Jan 3, 2024
@chasingegg chasingegg deleted the hnsw-shuffle branch January 3, 2024 11:58
chasingegg added a commit to chasingegg/Knowhere that referenced this pull request Feb 4, 2024
Signed-off-by: chasingegg <chao.gao@zilliz.com>
chasingegg added a commit to chasingegg/Knowhere that referenced this pull request Feb 4, 2024
Signed-off-by: chasingegg <chao.gao@zilliz.com>
chasingegg added a commit to chasingegg/Knowhere that referenced this pull request Feb 4, 2024
Signed-off-by: chasingegg <chao.gao@zilliz.com>
chasingegg added a commit to chasingegg/Knowhere that referenced this pull request Feb 4, 2024
Signed-off-by: chasingegg <chao.gao@zilliz.com>
liliu-z pushed a commit that referenced this pull request Feb 4, 2024
Signed-off-by: chasingegg <chao.gao@zilliz.com>
foxspy added a commit that referenced this pull request Feb 29, 2024
* Remove omp (#276)

Signed-off-by: Yudong Cai <yudong.cai@zilliz.com>

* [cherry-pick]Pip3 install in requirements.txt order (#303)

Signed-off-by: cqy123456 <qianya.cheng@zilliz.com>

* Cherry-pick #294 #298 from main (#300)

Signed-off-by: Yudong Cai <yudong.cai@zilliz.com>

* Deprecate Invalid config checking (#304)

Signed-off-by: Li Liu <li.liu@zilliz.com>

* Fix scann range search (#316)

Signed-off-by: chasingegg <chao.gao@zilliz.com>

* Upgrade conan to 1.61.0 (#182) (#347)

Signed-off-by: Enwei Jiao <enwei.jiao@zilliz.com>
Co-authored-by: Enwei Jiao <enwei.jiao@zilliz.com>

* raft hasrawdata return false

Signed-off-by: yusheng.ma <yusheng.ma@zilliz.com>

* switch knowhere-test branch (#384)

Signed-off-by: cqy123456 <qianya.cheng@zilliz.com>

* Ensure topk results for IVF_FLAT_CC (#353) (#383)

Signed-off-by: chasingegg <chao.gao@zilliz.com>

* make sure we rethrow exceptions in async tasks: make sure we do not crash due to uncaught exceptions when we called folly::Future::wait but not trying to get the values; use folly::collect to simplify code (#382)

Signed-off-by: Buqian Zheng <zhengbuqian@gmail.com>

* [2.2 fix] wrap IVF index train/build calls in lambdas passed to knowhere thread pool, so OMP threads spawned will have low nice values (#379)

Signed-off-by: Buqian Zheng <zhengbuqian@gmail.com>

* fix diskann async cache generation (#377)

Signed-off-by: xianliang <xianliang.li@zilliz.com>

* fix:miss wait thread tasks finish in diskann. (#380)

Signed-off-by: cqy123456 <qianya.cheng@zilliz.com>

* Shuffle ids for building hnsw index (#322) (#381)

Signed-off-by: chasingegg <chao.gao@zilliz.com>

* sync knowherer 2.2.4

Signed-off-by: xianliang <xianliang.li@zilliz.com>

---------

Signed-off-by: Yudong Cai <yudong.cai@zilliz.com>
Signed-off-by: cqy123456 <qianya.cheng@zilliz.com>
Signed-off-by: Li Liu <li.liu@zilliz.com>
Signed-off-by: chasingegg <chao.gao@zilliz.com>
Signed-off-by: Enwei Jiao <enwei.jiao@zilliz.com>
Signed-off-by: yusheng.ma <yusheng.ma@zilliz.com>
Signed-off-by: Buqian Zheng <zhengbuqian@gmail.com>
Signed-off-by: xianliang <xianliang.li@zilliz.com>
Co-authored-by: Cai Yudong <yudong.cai@zilliz.com>
Co-authored-by: cqy123456 <39671710+cqy123456@users.noreply.github.com>
Co-authored-by: liliu-z <105927039+liliu-z@users.noreply.github.com>
Co-authored-by: Gao <chao.gao@zilliz.com>
Co-authored-by: Enwei Jiao <enwei.jiao@zilliz.com>
Co-authored-by: yusheng.ma <yusheng.ma@zilliz.com>
Co-authored-by: Buqian Zheng <zhengbuqian@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants