Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use optimal subsequence match to rank completions #4653

Closed
wants to merge 7 commits into from

Conversation

krobelus
Copy link
Contributor

When using fuzzy completion with candidates from "git ls-files" in
a project with many files I often get unexpected ordering even when
one candidate is clearly the best.
This usually happens if both candidates are the same kind of match
(typically only a subsequence match), which means that the leftmost
match wins (via max_index), which is often wrong.
Also there are some other matching heuristics like FirstCharMatch
that can lead to unexpected ordering by dominating everything else
(for example with query "remcc" we rank "README.asciidoc" over
"src/remote.cc").

Fix these two issues by

  1. switching to a restricted variant[*] of the Needleman-Wunsch
    algorithm to find an optimal match instead of the leftmost one.
  2. Dropping some of the old heuristics in favor of ones based on the
    optimal match.

Optimality is defined by a new matching heuristic which favors longer
subsequences by adding a penalty to each new gap.
On top of this, when comparing two matches, we first compare by the
number of gaps. This makes sure that short queries can prefer long
candidates over short (but worse) candidates. Alternatively, we could
just increase the penalty. I didn't do that yet because I didn't
encounter a real-world use case where the proposed heuristic fails.

Some heuristics are kept because they enable some test cases;
for example "higher number of matched word boundaries is better".
Since we now compute this heuristic based on an optimal match rather
than a greedy match, this metric should be more accurate too -
though we could probably achieve even better matches by including
word boundaries in the optimal-matching heuristic. Again I didn't
do this because I don't have a concrete example to improve.

The algorithm needs quadratic time/space. I believe this should be
mostly okay because we expect small inputs.
Insert completion candidates are usually below max_word_len=50.
The exception is line completion, (which is off in the default
completers). With lines that are abnormally large, this can cause lag.
For example, if I have a 2 line with 2 million characters and a query
with 5, then <c-x>l takes like half a second. No harm I think since
other things are also slow given that line length.

Of course, there's plenty of room to optimize, for example we can
skip matching prefixes.

Every successful match adds a temporary heap allocation. This feels
bad but it's still much lower than the number of string allocations,
so I didn't optimize that yet.

Closes #3806

[*]: Since we require a (smartcase) subsequence match, we need no
deletions or mismatches, which reduces the search space.

Alternative solution: another fuzzy-matching algorithm
with public domain implementations is described in
https://github.com/tajmone/fuzzy-search/tree/8a0e231f756cccd20f7c57e9593df08fd2bb79ef
but I didn't really look into it.

@krobelus krobelus changed the title [better-rank-heuristic] Use optimal subsequence match to rank completions Use optimal subsequence match to rank completions Jun 26, 2022
@Screwtapello
Copy link
Contributor

Does this new ranking algorithm affect #1709 (comment) ?

@krobelus
Copy link
Contributor Author

Does this new ranking algorithm affect #1709 (comment) ?

No, we still sort all completions.

I wonder what the best solution is, maybe a static-option=<option-name> is enough, or perhaps a user-defined sorting function.
Maybe a priority field in the completions options, but it's not clear how the priority should interact with other things in RankedMatch::operator<.
Custom filtering is a similar issue although less important.

@mawww
Copy link
Owner

mawww commented Jun 30, 2022

Thats a big one, I need to take some time to review but I am happy to improve the fuzzy matching algorithm. I'll try to dedicate some time to go throug that soon.

@krobelus
Copy link
Contributor Author

Needleman-Wunsch is a nice and simple divide-and-conquer algorithm but it took me a while to get this adaption right.
I'll try to add some more documentation on the weekend.

I realized that there is a small but noticeable slowdown. I'm testing with

define-command git-edit -params 1.. %{ edit -existing -- %arg{@} }
complete-command git-edit shell-script-candidates %{ git ls-files }

on the llvm-project monorepo (~100k files). When typing :git-edit clang
I can definitely tell that it's slower to update.

I don't think it's a huge deal but snappier behavior would feel way better.

Somehow fzf is much faster (also faster than skim). I'm not sure how much peeking at non-public-domain code is okay.

Wikipedia mentions Hirschberg's_algorithm and the Method of Four Russians as improvements.
I'll try them but probably not in this PR because they make the implementation more complex.

@krobelus krobelus force-pushed the better-rank-heuristic branch 6 times, most recently from 33a9ef4 to 16af089 Compare July 8, 2022 13:12
@krobelus
Copy link
Contributor Author

krobelus commented Jul 8, 2022

I'm fairly happy with this now, though the heuristic could still be fine-tuned.
The last couple of commits are not so important / experimental, I can drop them.

@krobelus krobelus force-pushed the better-rank-heuristic branch 2 times, most recently from 947678a to e574fcb Compare July 9, 2022 15:47
@krobelus
Copy link
Contributor Author

krobelus commented Jul 9, 2022

Fixed a regression where we failed to prioritize exact matches over smartcase matches (for some reason other fuzzy finders don't do this).

@krobelus
Copy link
Contributor Author

krobelus commented Jul 9, 2022

Recent versions also optimize the number of gaps and matched word boundaries.
I found some weird orderings which made me realize that the gaps/word boundaries need a slightly different algorithm, I'll work on that.

@krobelus
Copy link
Contributor Author

yeah, given query ch, this change sorts chgrp before chattr because the first is shorter.
If the query is empty we fall back to lexicographical order, which is the other way round here.

@krobelus
Copy link
Contributor Author

krobelus commented Dec 24, 2022

Is that the weight by which Screwtapello was referring to?

No, this PR makes the fuzzy match scoring algorithm more robust and general.
The weight would be a way to let users influence the scoring algorithm, that's #4813

After the pull request , sort matching them would yield (in that order):

12.txt
101.txt

This exposes the same quality as the "chgrp"/"chattr" case.

  1. if the query is empty, the scores of "12.txt" and "101.txt" are equal. So we fall back to lexicographic order.
  2. if the query is non-empty, say "1"
    • without this PR, the scores of "12.txt" and "101.txt" are equal, so we still use lexicographic order.
    • with this PR, "12.txt" gets a better score than "101.txt" because the first is shorter.

This change is a consequence of the new fuzzy scoring algorithm.
Every mismatched character adds a small penalty, hence the ordering.
Both fzf and fzy have the same behavior (empty query means lexicographic order while "1" picks the shorter one).

I think this behavior is good in general but I have no strong opinion about this specific case where the query is a prefix of the candidate.
If we want to change this case, the way to do so would be to add a preprocessing step that removes matching prefixes.
Then "match 1 against 101.txt and 12.txt" becomes "match the empty string against 01.txt and 2.txt" which would use lexicographic order as before.

@raiguard
Copy link
Contributor

raiguard commented Sep 28, 2023

This algorithm is undoubtedly an improvement, but for file finding it still isn't as good as fzf. In my usecase it keeps ranking the test file ahead of the src file.

Kak (bad):
image

Fzf (good):
image
Note that fzf best match is at the bottom while kak is at the top. Sorry for the visual confusion.

Here is the prompt command I am using in kak:

# Go to file (basic fuzzy finder)
prompt -shell-script-candidates %{ fd --type f --hidden } file: %{ edit %val{text} }

@krobelus
Copy link
Contributor Author

@raiguard thanks for the test case, I have tweaked the heuristic to make it work.

FWIW for buffer type completion it already worked this way (also without this PR) because Kakoune knows that the completions are file paths. For those we prioritize matches on the basenames (see 2df7b1f (In buffer name completion, give priority to filename (not dirname) matches, 2016-03-02)).
For your prompt command, we don't know if the input is a file path or some other token, so it seems wrong to prioritize basenames.
What I did was change the word boundary bonus so matching H against /Here is worth twice as much as matching it against OverHere.

…tart

This implies a bunch of renamings but the following commits will
replace count_word_boundaries_match() anyway, so I'll do that on-the-fly.

The only functional change in this patch is that something like "/"
at the start of a string no longer counts as word start. That should
hardly matter.
I broke most of these while working on the new implementation.
Some are redundant though.
We use candidate/query in the constructor. The next patch adds the
same names in the subsequence scoring function.  Let's use the same
terminology in the subsequence matching function.

While at it, rename the subsequence matching function to prepare
for adding an optimal subsequence matching.
This allows using CharCount in operator[] instead of casting to size_t
every time. I'm afraid that some other operations like "end()" are
not supported yet with types like CharCount; for some reason adding
a cast there didn't work. I got some error about not being able to
convert CharCount to size_t.
@krobelus krobelus force-pushed the better-rank-heuristic branch 2 times, most recently from 8cd981b to fd0c36f Compare November 12, 2023 08:24
When using fuzzy completion with candidates from "git ls-files" in
a project with many files I often get unexpected ordering even when
one candidate is clearly the best.
This usually happens if both candidates are the same kind of match
(typically only a subsequence match), which means that the leftmost
match wins (via max_index), which is often wrong.
Also there are some other matching heuristics like FirstCharMatch
that can lead to unexpected ordering by dominating everything else
(for example with query "remcc" we rank "README.asciidoc" over
"src/remote.cc").

Fix these two issues by
1. switching to the Gotoh algorithm[1], a refinement of the
   Smith-Waterman algorithm that is optimized for affine gap
   penalties[2].  This makes us  find an optimal match instead of
   the leftmost one.
2. dropping old heuristics that are obsoleted by the new one.

Optimality is defined by a new distance heuristic which favors longer
matching subsequences (as opposed to multiple shorter ones) by adding
a penalty to each new gap between matches.

For most of our RankedMatch test cases, we match the behavior of
other popular fuzzy matchers.

The algorithm needs quadratic time/space but that's okay, since
our input is expected to be small. For example, candidates for
insert-mode completion are usually below max_word_len=50. Second,
if there's ever a very large input candidate, only match against the
first 1000 characters.  Third, a follow-up commit will switch to the
linear space variant.

Every successful match adds a temporary heap allocation.  This feels
bad but it's still dominated by the number of string allocations,
so there's not much reason to optimize that.

Besides, with the following patches, it's fast enough for me.

In src/diff.h we have an implementation of a similar algorithm, but
that one is optimized for matching strings with small differences.
It is also not guaranteed to find the optimal match.

Closes mawww#3806

[1]: as described in "An Improved Algorithm for Matching Biological Sequences", see
     https://courses.cs.duke.edu/spring21/compsci260/resources/AlignmentPapers/1982.gotoh.pdf
[2]: https://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm#Affine
Our fuzzy-matching algorithm uses iswlower and towlower to implement
smartcase matching.  Sadly, when using glibc these functions are quite
slow, even for ASCII inputs.  With the new fuzzy-matching algorithm,
this shows up prominently in a CPU profile. We tirelessly call iswlower
and towlower on the query chars, so just cache them in a prepared query
called "RankedMatchQuery", which allows to implement smartcase_eq()
more efficiently.

When matching query "clang" against 100k files in the LLVM monorepo
(of which 30k match), this commit makes us go from 1.8 billion cycles
to just 1.2 (same as the old fuzzy-matching algorithm).

The implementation is a bit ugly because the RankedMatchQuery object
needs to outlive all uses of RankedMatch::operator<.  I guess we
could try to create a type that contains both the query and query
results to make this less ugly.  We could use the same type to get
rid of the allocations in subsequence_match_scores(), though I don't
know if that matters.

A previous approach added a fast path for ASCII input to our wrappers
like is_lower and to_lower. This was not ideal because it doesn't
solve the problem for non-ASCII input.
We can compute the fuzzy matching score without keeping in memory the
scores of all prefixes.
Currently we only use the full score matrix for debugging.  In future
I want to use it to get the positions of matched characters (which
could be underlined in the UI).
krobelus added a commit to krobelus/kakoune that referenced this pull request Nov 16, 2023
This enables the test case from
mawww#4653 (comment)
and a few others from that PR.
@krobelus
Copy link
Contributor Author

latest master works fine for almost all test cases, it seems to be good enough for my current use cases

@krobelus krobelus closed this Dec 28, 2023
@mawww
Copy link
Owner

mawww commented Jan 5, 2024

Thanks again for exploring this and pushing for fuzzy matching to be improved.

@krobelus krobelus deleted the better-rank-heuristic branch October 27, 2024 10:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fuzzy finder accuracy
4 participants