Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RankFind does not use Levenshtein distance to build results #10

Open
colelawrence opened this issue Sep 8, 2015 · 6 comments
Open

RankFind does not use Levenshtein distance to build results #10

colelawrence opened this issue Sep 8, 2015 · 6 comments

Comments

@colelawrence
Copy link

RankFind still uses plain Find to determine initial results, then measures Levenshtein distance afterwards. Maybe we can have another RankFindLevenshtein that uses Levenshtein distance to determine results.

@lithammer
Copy link
Owner

I actually played around with something like that for RankFind at one point. What I found was that unless you set a high threshold, most longer targets would never match.

Say you search for "aaa" in "aaabbbbbbbbbbbbbbbbbbb" and your threshold is 15 it wouldn't match. Perhaps that's exactly what you wanted, but I think it might have to be a bit smarter and take length into consideration as well or something... This is basically when I went with the current solution of RankFind 😃

Another approach would be to create a RankFindFunc(s, t string, fn func(t, s string) bool) where you can define your own criteria.

@lithammer
Copy link
Owner

Then you'd have something like this

const threshold = 15

func predicate(s, t string) bool {
    distance := LevenshteinDistance(s, t)  // 19
    return distance < threshold
}

fuzzy.RankFindFunc("aaa", "aaabbbbbbbbbbbbbbbbbbb", predicate)  // false

@colelawrence
Copy link
Author

Interesting, thank you for the explanations!

On Wed, Sep 9, 2015, 6:08 AM Peter Renström [email protected]
wrote:

Then you'd have something like this

const threshold = 15
func predicate(s, t string) bool {
distance := LevenshteinDistance(s, t)
return distance < threshold
}
RankFindFunc("aaa", "aaabbbbbbbbbbbbbbbbbbb", predicate)


Reply to this email directly or view it on GitHub
#10 (comment)
.

@lithammer
Copy link
Owner

Yeah, so I'm not sure the distance alone is enough to determine a match. Having "a" match "ab" but not "abc" is confusing, it should give at least as many hits as a plain old substring search.

Did you have a good use-case for this? Perhaps I'm just missing something here 😄

@colelawrence
Copy link
Author

Yeah, so I use it to create a search engine for my GoCourseSort project, and the reason I use Levenshtein instead of simple match, is so a word in the search query matches a keyword from the database of strings.

I think it is vaguely similar to the way BigTables work (don't quote me).

Imagine book titles: "Gone with the Wind", "Gone Girl", and "The Girls", we create an index of key words with references:

gonegirl := &Book{
  title: "Gone Girl",
}
gonewiththewind := &Book{
  title: "Gone with the Wind",
}
thegirls := &Book{
  title: "The Girls",
}
indexByKeywords := map[string][]*Book {
  "girls": { thegirls },
  "girl": { gonegirl },
  "gone": { gonegirl, gonewiththewind },
  "with": { gonewiththewind },
  "the": { gonewiththewind, thegirls },
  "wind": { gonewiththewind },
}

Now you enter the search term "the girls"

Then I'm Levenshtein ranking "the" against the keys of indexByKeywords, and "girls" against the keys of indexByKeywords, then using a ranking formula based on Levenshtein distance, index of word in search, and index of word in title, for each list of references I get back per search term.

This is important because in my search engine, I don't want order of words to be important in any way, but I need "CS" to match "CSC" and "131" to match "121", because I'm using it for a college course catalog.

@elazarl
Copy link

elazarl commented Jan 7, 2016

You might want to look at my implementation and the fuzzy find

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants