Enhanced Regexp Search | Permutations, Fuzzy matching #404

EmilianoCostantini · 2019-07-15T17:34:09Z

It would be great to enhance the search capabilities with features such as:

deterministic permutations
fuzzy searching

I'm not big on fuzzy searching, but let me make an example about permutations.

Say the user enters the search terms:

xxx yyy zzz www

That would be 4 different terms, that can be ordered in 4! (that is, 24) different ways:

xxx yyy zzz www
xxx yyy www zzz
xxx www zzz yyy
www yyy zzz xxx
zzz yyy xxx www
...

(and so on and so forth, up till the 24th possible permutation.)

Each permutation should be converted to regexp:

/([\s\S]*)xxx([\s\S]*)yyy([\s\S]*)zzz([\s\S]*)www([\s\S]*)/i
/([\s\S]*)xxx([\s\S]*)yyy([\s\S]*)www([\s\S]*)zzz([\s\S]*)/i
/([\s\S]*)xxx([\s\S]*)www([\s\S]*)zzz([\s\S]*)yyy([\s\S]*)/i
/([\s\S]*)www([\s\S]*)yyy([\s\S]*)zzz([\s\S]*)xxx([\s\S]*)/i
/([\s\S]*)zzz([\s\S]*)yyy([\s\S]*)xxx([\s\S]*)www([\s\S]*)/i
...

Then each bookmark matching (in either title or description/comments) at least one of the 24 regexps should be added to results.

As you can check on RegExr, such a bookmark would be —for instance— one containing the substrings:

foobar--xxXxX----foobar foobar foobar...YyYy...foobar foobar foobar****ZZZzzz**foobar foobar foobar__wWWW____foobar

regardless both the order and the case, just as long as each and every search term is there.

I have already proposed the feature on BukuBrow, but on second thought it would make more sense for it to be built directly in Buku.

The text was updated successfully, but these errors were encountered:

jarun · 2019-07-15T18:11:27Z

We are storing a lot of text and I know users having up to 40K bookmarks. I would leave fuzzy search out of the equation.

The search is case-insensitive - https://github.com/jarun/Buku/wiki/Operational-notes#search

Please go through the Search section. We support regex searches. And the searches are independent of the order of the tokens (try --sany and --sall). Am I missing something here?

EmilianoCostantini · 2019-07-17T18:58:03Z

It may be the case.

Please, try adding the following three bookmarks:

# Lines beginning with "#" will be stripped.
# Add URL in next line (single line).
www.animals.com
# Add TITLE in next line (single line). Leave blank to web fetch, "-" for no title.
1 cat, 2 dogs, 2 monkeys, 2 lions, 2 tigers, 2 elephants
# Add comma-separated TAGS in next line (single line).

# Add COMMENTS in next line(s).
1 cat, 2 dogs, 2 monkeys, 2 lions, 2 tigers, 2 elephants

# Lines beginning with "#" will be stripped.
# Add URL in next line (single line).
www.justonecat.com
# Add TITLE in next line (single line). Leave blank to web fetch, "-" for no title.
Just 1 cat.
# Add comma-separated TAGS in next line (single line).

# Add COMMENTS in next line(s).
Just 1 cat.

# Lines beginning with "#" will be stripped.
# Add URL in next line (single line).
www.justdogs.com
# Add TITLE in next line (single line). Leave blank to web fetch, "-" for no title.
Just 2 dogs.
# Add comma-separated TAGS in next line (single line).

# Add COMMENTS in next line(s).
Just 2 dogs.

Now, say you'd like to extract the bookmarks containing each and everyone of the substrings:
elephant
cat
dog
lion
monkey
tiger
regardless both the order and the case (please note the search terms are singular).

buku --sall elephant cat dog lion monkey tiger won't return anything at all, because of the exact matching.
buku --sany elephant cat dog lion monkey tiger will return 1 cat, 2 dogs, 2 monkeys, 2 lions, 2 tigers, 2 elephants and Just 1 cat, also because of the exact matching.
That's not good, though, since you're interested in bookmarks containing all of the search terms, not just one of them.
buku --deep elephant cat dog lion monkey tiger will return 1 cat, 2 dogs, 2 monkeys, 2 lions, 2 tigers, 2 elephants, along with Just 1 cat and Just 2 dogs.
Still not good, for the same reason as before.

You might think this is not a big deal. After all both --sany and --deep return —among the others— the very bookmark you were looking for, right?
But the others actually are a problem:

when your database stores hundreds of bookmarks sharing lots of terms with each other, and you're trying to specifically locate one of them.
when common words (e.g. articles) are added to the search terms.

Unless something is eluding me, currently there's no way to get just the bookmarks containing all the requested substrings regardless both the order and the case.

If so, please consider the possibility to add a search option like --allDeep implementing the first post's permutations-and-regexps logic.

jarun · 2019-07-17T19:24:19Z

Unless something is eluding me, currently there's no way to get just the bookmarks containing all the requested substrings regardless both the order and the case.

Yes, you didn't try regex.

$ buku -r monkey[s]

1. 1 cat, 2 dogs, 2 monkeys, 2 lions, 2 tigers, 2 elephants [496]
   > www.animals.com
   + 1 cat, 2 dogs, 2 monkeys, 2 lions, 2 tigers, 2 elephants

That was a simple one but I think you can have a more complex regex pattern that would match all the string and return just what you want. Probably you are looking for lookahead.

https://www.regular-expressions.info/completelines.html

Coming to the deep search results. I see:

$ buku --deep elephant cat dog lion monkey tiger

1. 1 cat, 2 dogs, 2 monkeys, 2 lions, 2 tigers, 2 elephants [496]
   > www.animals.com
   + 1 cat, 2 dogs, 2 monkeys, 2 lions, 2 tigers, 2 elephants

So the desired result is at the top because of the ranking algorithm (basically max matches). You can limit the number of results per page as well. So you can figure out the results you need from the topmost ones and discard the later ones.

At the current stage of the project, that's the best we can offer. There's literally no RoI in implementing anything too heavy.

jarun · 2019-07-21T04:58:24Z

I think I missed this in my earlier note - if you really need it, please feel free to implement this and raise a PR. We would be glad to help with anything you need.

jarun closed this as completed Jul 17, 2019

github-actions bot locked and limited conversation to collaborators Jun 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhanced Regexp Search | Permutations, Fuzzy matching #404

Enhanced Regexp Search | Permutations, Fuzzy matching #404

EmilianoCostantini commented Jul 15, 2019

jarun commented Jul 15, 2019

EmilianoCostantini commented Jul 17, 2019

jarun commented Jul 17, 2019

jarun commented Jul 21, 2019

Enhanced Regexp Search | Permutations, Fuzzy matching #404

Enhanced Regexp Search | Permutations, Fuzzy matching #404

Comments

EmilianoCostantini commented Jul 15, 2019

jarun commented Jul 15, 2019

EmilianoCostantini commented Jul 17, 2019

jarun commented Jul 17, 2019

jarun commented Jul 21, 2019