Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhanced Regexp Search | Permutations, Fuzzy matching #404

Closed
EmilianoCostantini opened this issue Jul 15, 2019 · 4 comments
Closed

Enhanced Regexp Search | Permutations, Fuzzy matching #404

EmilianoCostantini opened this issue Jul 15, 2019 · 4 comments

Comments

@EmilianoCostantini
Copy link

It would be great to enhance the search capabilities with features such as:

  • deterministic permutations
  • fuzzy searching

I'm not big on fuzzy searching, but let me make an example about permutations.

Say the user enters the search terms:

  • xxx yyy zzz www

That would be 4 different terms, that can be ordered in 4! (that is, 24) different ways:

  1. xxx yyy zzz www
  2. xxx yyy www zzz
  3. xxx www zzz yyy
  4. www yyy zzz xxx
  5. zzz yyy xxx www
  6. ...

(and so on and so forth, up till the 24th possible permutation.)

Each permutation should be converted to regexp:

  1. /([\s\S]*)xxx([\s\S]*)yyy([\s\S]*)zzz([\s\S]*)www([\s\S]*)/i
  2. /([\s\S]*)xxx([\s\S]*)yyy([\s\S]*)www([\s\S]*)zzz([\s\S]*)/i
  3. /([\s\S]*)xxx([\s\S]*)www([\s\S]*)zzz([\s\S]*)yyy([\s\S]*)/i
  4. /([\s\S]*)www([\s\S]*)yyy([\s\S]*)zzz([\s\S]*)xxx([\s\S]*)/i
  5. /([\s\S]*)zzz([\s\S]*)yyy([\s\S]*)xxx([\s\S]*)www([\s\S]*)/i
  6. ...

Then each bookmark matching (in either title or description/comments) at least one of the 24 regexps should be added to results.

As you can check on RegExr, such a bookmark would be —for instance— one containing the substrings:

  • foobar--xxXxX----foobar foobar foobar...YyYy...foobar foobar foobar****ZZZzzz**foobar foobar foobar__wWWW____foobar

regardless both the order and the case, just as long as each and every search term is there.

I have already proposed the feature on BukuBrow, but on second thought it would make more sense for it to be built directly in Buku.

@jarun
Copy link
Owner

jarun commented Jul 15, 2019

We are storing a lot of text and I know users having up to 40K bookmarks. I would leave fuzzy search out of the equation.

The search is case-insensitive - https://github.com/jarun/Buku/wiki/Operational-notes#search

Please go through the Search section. We support regex searches. And the searches are independent of the order of the tokens (try --sany and --sall). Am I missing something here?

@EmilianoCostantini
Copy link
Author

It may be the case.

Please, try adding the following three bookmarks:


# Lines beginning with "#" will be stripped.
# Add URL in next line (single line).
www.animals.com
# Add TITLE in next line (single line). Leave blank to web fetch, "-" for no title.
1 cat, 2 dogs, 2 monkeys, 2 lions, 2 tigers, 2 elephants
# Add comma-separated TAGS in next line (single line).

# Add COMMENTS in next line(s).
1 cat, 2 dogs, 2 monkeys, 2 lions, 2 tigers, 2 elephants


# Lines beginning with "#" will be stripped.
# Add URL in next line (single line).
www.justonecat.com
# Add TITLE in next line (single line). Leave blank to web fetch, "-" for no title.
Just 1 cat.
# Add comma-separated TAGS in next line (single line).

# Add COMMENTS in next line(s).
Just 1 cat.


# Lines beginning with "#" will be stripped.
# Add URL in next line (single line).
www.justdogs.com
# Add TITLE in next line (single line). Leave blank to web fetch, "-" for no title.
Just 2 dogs.
# Add comma-separated TAGS in next line (single line).

# Add COMMENTS in next line(s).
Just 2 dogs.


Now, say you'd like to extract the bookmarks containing each and everyone of the substrings:
elephant
cat
dog
lion
monkey
tiger
regardless both the order and the case (please note the search terms are singular).

  • buku --sall elephant cat dog lion monkey tiger won't return anything at all, because of the exact matching.
  • buku --sany elephant cat dog lion monkey tiger will return 1 cat, 2 dogs, 2 monkeys, 2 lions, 2 tigers, 2 elephants and Just 1 cat, also because of the exact matching.
    That's not good, though, since you're interested in bookmarks containing all of the search terms, not just one of them.
  • buku --deep elephant cat dog lion monkey tiger will return 1 cat, 2 dogs, 2 monkeys, 2 lions, 2 tigers, 2 elephants, along with Just 1 cat and Just 2 dogs.
    Still not good, for the same reason as before.

You might think this is not a big deal. After all both --sany and --deep return —among the others— the very bookmark you were looking for, right?
But the others actually are a problem:

  • when your database stores hundreds of bookmarks sharing lots of terms with each other, and you're trying to specifically locate one of them.
  • when common words (e.g. articles) are added to the search terms.

Unless something is eluding me, currently there's no way to get just the bookmarks containing all the requested substrings regardless both the order and the case.

If so, please consider the possibility to add a search option like --allDeep implementing the first post's permutations-and-regexps logic.

@jarun
Copy link
Owner

jarun commented Jul 17, 2019

Unless something is eluding me, currently there's no way to get just the bookmarks containing all the requested substrings regardless both the order and the case.

Yes, you didn't try regex.

$ buku -r monkey[s]

1. 1 cat, 2 dogs, 2 monkeys, 2 lions, 2 tigers, 2 elephants [496]
   > www.animals.com
   + 1 cat, 2 dogs, 2 monkeys, 2 lions, 2 tigers, 2 elephants

That was a simple one but I think you can have a more complex regex pattern that would match all the string and return just what you want. Probably you are looking for lookahead.

https://www.regular-expressions.info/completelines.html

Coming to the deep search results. I see:

$ buku --deep elephant cat dog lion monkey tiger

1. 1 cat, 2 dogs, 2 monkeys, 2 lions, 2 tigers, 2 elephants [496]
   > www.animals.com
   + 1 cat, 2 dogs, 2 monkeys, 2 lions, 2 tigers, 2 elephants

So the desired result is at the top because of the ranking algorithm (basically max matches). You can limit the number of results per page as well. So you can figure out the results you need from the topmost ones and discard the later ones.

At the current stage of the project, that's the best we can offer. There's literally no RoI in implementing anything too heavy.

@jarun jarun closed this as completed Jul 17, 2019
@jarun
Copy link
Owner

jarun commented Jul 21, 2019

I think I missed this in my earlier note - if you really need it, please feel free to implement this and raise a PR. We would be glad to help with anything you need.

@github-actions github-actions bot locked and limited conversation to collaborators Jun 15, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants