feat(vector/hnsw): add per‑query ef and distance_threshold to similar_to, fix early termination #17
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hugely appreciative of the Dgraph team’s work. Native vector search integrated directly into a graph database is kind of a no brainer today. Deployed Dgraph (both vanilla and customised) in systems with 1M+ vectors guiding deep traversal queries across 10M+ nodes -- tight coupling of vector search with graph traversal at massive scale gets us closer to something that could represent the fuzzy nuances of everything in an enterprise. Certainly not the biggest deployment your team will have seen, but this PR fixes an under‑recall edge case in HNSW and introduces opt‑in, per‑query controls that let users dial recall vs latency safely and predictably. I’ve had this running in production for a while and thought it worth proposing to main.
Summary
efanddistance_threshold(string or JSON‑like fourth argument).Motivation
efmeant recall vs latency trade‑offs required global tuning or inflating k (and downstream work).Changes (key files)
tok/hnsw/persistent_hnsw.go: fix early termination, addSearchWithOptions/SearchWithUidAndOptions, applyefoverride at upper layers andmax(k, ef)at bottom layer, applydistance_thresholdin the metric domain (Euclidean squared internally, cosine as 1 − sim).tok/index/index.go: addVectorIndexOptionsandOptionalSearchOptions(non‑breaking).worker/task.go: parse optional fourth argument tosimilar_to(ef,distance_threshold), thread options, route to optional methods when provided, guard zero/negative k.tok/index/search_path.go: addSearchPathResulthelper.tok/hnsw/ef_recall_test.goaddsTestHNSWSearchEfOverrideImprovesRecallTestHNSWDistanceThreshold_EuclideanTestHNSWDistanceThreshold_CosineCHANGELOG.md: Unreleased entry for HNSW fix and per‑query options.Backwards compatibility
similar_to(attr, k, vector_or_uid)is unchanged.efanddistance_thresholdare optional, unsupported metrics safely ignore the threshold.Performance
ef, bottom‑layer candidate size becomesmax(k, ef)(as in HNSW), cost scales accordingly.Rationale and alignment
ef_searchcontrols exploration/recall,kcontrols output size.efanddistance_thresholdsemantics for familiarity.Checklist
CHANGELOG.mddescribing this PR