-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fuzzy match support #2916
Fuzzy match support #2916
Conversation
…ue-1883_fuzzy_match_support_in_dgraph
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 12 files reviewed, 7 unresolved discussions (waiting on @martinmr)
systest/queries_test.go, line 579 at r2 (raw file):
Previously, srfrog (Gus) wrote…
Move this down? Ok
Done.
tok/tokens.go, line 50 at r2 (raw file):
Previously, srfrog (Gus) wrote…
Because the caller carries the cost of the check, and typically we send a slice of arguments. But there's room for an optimization here.
I removed this func and added a general GetTokens func that uses tokenizer ID, and the args are variadic so don't need to send a slice. In another change I will refactor the code to use it as needed.
worker/tokens.go, line 76 at r2 (raw file):
Previously, srfrog (Gus) wrote…
actually I'll get rid of the switch and use if. thanks.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 12 files reviewed, 1 unresolved discussion (waiting on @srfrog)
systest/queries_test.go, line 579 at r2 (raw file):
Previously, srfrog (Gus) wrote…
Done.
Sorry, I meant having the code like
tests := []struct {in, out, failure string} {
cause otherwise there's a line with just }{
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 12 files reviewed, 1 unresolved discussion (waiting on @martinmr)
systest/queries_test.go, line 579 at r2 (raw file):
Previously, martinmr (Martin Martinez Rivera) wrote…
Sorry, I meant having the code like
tests := []struct {in, out, failure string} {
cause otherwise there's a line with just
}{
ah i see. i leave like that for clarity and consistency with the other tests.
…actor, which would cause early termination of the algo to save CPU resources. Remove the fuzzysearch lib.
This change allows setting a second integer argument in match() to set the max Levenshtein distance threshold. If no value is set, the default value of 8 is used.
Updated test for new matchFuzzy using threshold.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few more changes required.
Reviewed 3 of 8 files at r2, 3 of 6 files at r3, 1 of 6 files at r4, 3 of 3 files at r5, 1 of 1 files at r6.
Reviewable status: all files reviewed, 6 unresolved discussions (waiting on @martinmr and @srfrog)
vendor/vendor.json, line 655 at r6 (raw file):
{ "checksumSHA1": "9fTIdD63nJT3Y4QvHtw9dCBhzzE=", "path": "github.com/lithammer/fuzzysearch/fuzzy",
Needs to be removed.
worker/task.go, line 1195 at r6 (raw file):
// convert data from binary to appropriate format strVal, err := types.Convert(val, types.StringID) if err == nil && matchFuzzy(matchQuery, strVal.Value.(string), int(arg.srcFn.threshold)) {
Convert the threshold to int once, and just use for all the uids.
worker/task.go, line 1507 at r6 (raw file):
l := len(q.SrcFunc.Args) if l == 0 || l > 2 { return nil, x.Errorf("Function '%s' requires at most 2 arguments, but got %d (%v)",
at most? Isn't it always 2 args?
worker/task.go, line 1516 at r6 (raw file):
fc.intersectDest = needsIntersect(f) // Max Levenshtein distance fc.threshold = 8
Don't add artificial limits. Remove this.
Note that the matching is only done on lists which have been returned by the trigram index. So, we're not matching against the universal data set. Additionally, the cost of reading these lists from the disk is a lot more than running a limited computation in memory.
In the worst case, we're returning all the uids that were returned by the trigram index, which is a manageable list.
worker/task.go, line 1524 at r6 (raw file):
return nil, x.Errorf("Levenshtein distance value must be an int, got %v", s) } if max > 0 && max < 8 {
if less than zero, return an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 6 unresolved discussions (waiting on @manishrjain and @martinmr)
vendor/vendor.json, line 655 at r6 (raw file):
Previously, manishrjain (Manish R Jain) wrote…
Needs to be removed.
Done.
worker/task.go, line 1195 at r6 (raw file):
Previously, manishrjain (Manish R Jain) wrote…
Convert the threshold to int once, and just use for all the uids.
Done.
worker/task.go, line 1507 at r6 (raw file):
Previously, manishrjain (Manish R Jain) wrote…
at most? Isn't it always 2 args?
No, I made the distance optional. But I will change it to expect 2 args.
worker/task.go, line 1516 at r6 (raw file):
Previously, manishrjain (Manish R Jain) wrote…
Don't add artificial limits. Remove this.
Note that the matching is only done on lists which have been returned by the trigram index. So, we're not matching against the universal data set. Additionally, the cost of reading these lists from the disk is a lot more than running a limited computation in memory.
In the worst case, we're returning all the uids that were returned by the trigram index, which is a manageable list.
Done.
worker/task.go, line 1524 at r6 (raw file):
Previously, manishrjain (Manish R Jain) wrote…
if less than zero, return an error.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r7.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @martinmr)
* added "match" to list of valid funcs * added "match" to list of valid func names * added handleMatchFunction and MatchFn type * the match func needs a fulltext tokenizer * added ngram bleve analizer * added ngram tokenizer * verify match uses ngram tokenizer * get string tokens for match * added func to build match tokens * changed handleMatchFunction to use index * cherry-pick schema.HasTokenizer * configure bigram filter before the analyzer to update cache * added EncodeTokens convenience func to encapsulate encodeToken * we dont need to pre-get tokens, we do that when the task is running * handleMatchFunction updated for fuzzy match, filter optimizations, and code cleanups * matchFuzzy func using index and bigram (ngram) index * adding Bleve ngram support * added func comments * all fuzzy tokens must match * cp: test cases might not be mutex in the future, revert the order. * cp: dont try to match all posting values against match * cp: added comment * removed extra branch * switch to trigram for fuzzy match indexing * fixed typo * renamed ngram to match * fixed typo * added full posting value to search terms, minor cleanups * added fuzzy matching test * remove Bleve ngram pkg * revert this change * fixed needsIntersect * switched to a new fuzzy matching pkg * tweaked test with misspelling * added func GetTokenizerByID to search registered tokenizers by id * added tok.GetTokens for generating tokens by id * changed to use tok.GetTokens * fixed grammar in comment * replaced underused switch with if block * small test change * Pick up and modify Levenshtein distance to introduce a max distance factor, which would cause early termination of the algo to save CPU resources. Remove the fuzzysearch lib. * using threshold for lev distance max * worker/task.go: added match() argument for specifying max distance. This change allows setting a second integer argument in match() to set the max Levenshtein distance threshold. If no value is set, the default value of 8 is used. * systest/queries_test.go: updated match query tests Updated test for new matchFuzzy using threshold. * wiki/content/query-language/index.md: added section for match function * vendor/vendor.json: removed old fuzzy pkg * worker/task.go: match func enforce 2 args, max distance must be gt zero * wiki/content/query-language/index.md: updated syntax, example and fixed typos * systest/queries_test.go: updated syntax in tests * wiki/content/query-language/index.md: minior doc fixes
Is there a chance to cherry-pick support for the |
This PR adds support for fuzzy matching to root queries and filters. It requires a trigram index on the predicate.
match()
that uses trigram index to make fuzzy comparisons.Examples:
No breaking changes are expected.
Closes #1883
This change is