-
Notifications
You must be signed in to change notification settings - Fork 715
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use optimal subsequence match to rank completions #4653
Conversation
Does this new ranking algorithm affect #1709 (comment) ? |
No, we still sort all completions. I wonder what the best solution is, maybe a |
Thats a big one, I need to take some time to review but I am happy to improve the fuzzy matching algorithm. I'll try to dedicate some time to go throug that soon. |
Needleman-Wunsch is a nice and simple divide-and-conquer algorithm but it took me a while to get this adaption right. I realized that there is a small but noticeable slowdown. I'm testing with
on the llvm-project monorepo (~100k files). When typing I don't think it's a huge deal but snappier behavior would feel way better. Somehow fzf is much faster (also faster than skim). I'm not sure how much peeking at non-public-domain code is okay. Wikipedia mentions Hirschberg's_algorithm and the Method of Four Russians as improvements. |
33a9ef4
to
16af089
Compare
I'm fairly happy with this now, though the heuristic could still be fine-tuned. |
947678a
to
e574fcb
Compare
Fixed a regression where we failed to prioritize exact matches over smartcase matches (for some reason other fuzzy finders don't do this). |
Recent versions also optimize the number of gaps and matched word boundaries. |
ce7e44e
to
88356a8
Compare
88356a8
to
1e5e378
Compare
yeah, given query |
No, this PR makes the fuzzy match scoring algorithm more robust and general.
This exposes the same quality as the "chgrp"/"chattr" case.
This change is a consequence of the new fuzzy scoring algorithm. I think this behavior is good in general but I have no strong opinion about this specific case where the query is a prefix of the candidate. |
1e5e378
to
c795e78
Compare
c795e78
to
f11236a
Compare
f11236a
to
f4f8717
Compare
@raiguard thanks for the test case, I have tweaked the heuristic to make it work. FWIW for buffer type completion it already worked this way (also without this PR) because Kakoune knows that the completions are file paths. For those we prioritize matches on the basenames (see 2df7b1f (In buffer name completion, give priority to filename (not dirname) matches, 2016-03-02)). |
…tart This implies a bunch of renamings but the following commits will replace count_word_boundaries_match() anyway, so I'll do that on-the-fly. The only functional change in this patch is that something like "/" at the start of a string no longer counts as word start. That should hardly matter.
I broke most of these while working on the new implementation. Some are redundant though.
We use candidate/query in the constructor. The next patch adds the same names in the subsequence scoring function. Let's use the same terminology in the subsequence matching function. While at it, rename the subsequence matching function to prepare for adding an optimal subsequence matching.
This allows using CharCount in operator[] instead of casting to size_t every time. I'm afraid that some other operations like "end()" are not supported yet with types like CharCount; for some reason adding a cast there didn't work. I got some error about not being able to convert CharCount to size_t.
8cd981b
to
fd0c36f
Compare
When using fuzzy completion with candidates from "git ls-files" in a project with many files I often get unexpected ordering even when one candidate is clearly the best. This usually happens if both candidates are the same kind of match (typically only a subsequence match), which means that the leftmost match wins (via max_index), which is often wrong. Also there are some other matching heuristics like FirstCharMatch that can lead to unexpected ordering by dominating everything else (for example with query "remcc" we rank "README.asciidoc" over "src/remote.cc"). Fix these two issues by 1. switching to the Gotoh algorithm[1], a refinement of the Smith-Waterman algorithm that is optimized for affine gap penalties[2]. This makes us find an optimal match instead of the leftmost one. 2. dropping old heuristics that are obsoleted by the new one. Optimality is defined by a new distance heuristic which favors longer matching subsequences (as opposed to multiple shorter ones) by adding a penalty to each new gap between matches. For most of our RankedMatch test cases, we match the behavior of other popular fuzzy matchers. The algorithm needs quadratic time/space but that's okay, since our input is expected to be small. For example, candidates for insert-mode completion are usually below max_word_len=50. Second, if there's ever a very large input candidate, only match against the first 1000 characters. Third, a follow-up commit will switch to the linear space variant. Every successful match adds a temporary heap allocation. This feels bad but it's still dominated by the number of string allocations, so there's not much reason to optimize that. Besides, with the following patches, it's fast enough for me. In src/diff.h we have an implementation of a similar algorithm, but that one is optimized for matching strings with small differences. It is also not guaranteed to find the optimal match. Closes mawww#3806 [1]: as described in "An Improved Algorithm for Matching Biological Sequences", see https://courses.cs.duke.edu/spring21/compsci260/resources/AlignmentPapers/1982.gotoh.pdf [2]: https://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm#Affine
Our fuzzy-matching algorithm uses iswlower and towlower to implement smartcase matching. Sadly, when using glibc these functions are quite slow, even for ASCII inputs. With the new fuzzy-matching algorithm, this shows up prominently in a CPU profile. We tirelessly call iswlower and towlower on the query chars, so just cache them in a prepared query called "RankedMatchQuery", which allows to implement smartcase_eq() more efficiently. When matching query "clang" against 100k files in the LLVM monorepo (of which 30k match), this commit makes us go from 1.8 billion cycles to just 1.2 (same as the old fuzzy-matching algorithm). The implementation is a bit ugly because the RankedMatchQuery object needs to outlive all uses of RankedMatch::operator<. I guess we could try to create a type that contains both the query and query results to make this less ugly. We could use the same type to get rid of the allocations in subsequence_match_scores(), though I don't know if that matters. A previous approach added a fast path for ASCII input to our wrappers like is_lower and to_lower. This was not ideal because it doesn't solve the problem for non-ASCII input.
We can compute the fuzzy matching score without keeping in memory the scores of all prefixes. Currently we only use the full score matrix for debugging. In future I want to use it to get the positions of matched characters (which could be underlined in the UI).
fd0c36f
to
02f8757
Compare
This enables the test case from mawww#4653 (comment) and a few others from that PR.
latest master works fine for almost all test cases, it seems to be good enough for my current use cases |
Thanks again for exploring this and pushing for fuzzy matching to be improved. |
When using fuzzy completion with candidates from "git ls-files" in
a project with many files I often get unexpected ordering even when
one candidate is clearly the best.
This usually happens if both candidates are the same kind of match
(typically only a subsequence match), which means that the leftmost
match wins (via max_index), which is often wrong.
Also there are some other matching heuristics like FirstCharMatch
that can lead to unexpected ordering by dominating everything else
(for example with query "remcc" we rank "README.asciidoc" over
"src/remote.cc").
Fix these two issues by
algorithm to find an optimal match instead of the leftmost one.
optimal match.
Optimality is defined by a new matching heuristic which favors longer
subsequences by adding a penalty to each new gap.
On top of this, when comparing two matches, we first compare by the
number of gaps. This makes sure that short queries can prefer long
candidates over short (but worse) candidates. Alternatively, we could
just increase the penalty. I didn't do that yet because I didn't
encounter a real-world use case where the proposed heuristic fails.
Some heuristics are kept because they enable some test cases;
for example "higher number of matched word boundaries is better".
Since we now compute this heuristic based on an optimal match rather
than a greedy match, this metric should be more accurate too -
though we could probably achieve even better matches by including
word boundaries in the optimal-matching heuristic. Again I didn't
do this because I don't have a concrete example to improve.
The algorithm needs quadratic time/space. I believe this should be
mostly okay because we expect small inputs.
Insert completion candidates are usually below max_word_len=50.
The exception is line completion, (which is off in the default
completers). With lines that are abnormally large, this can cause lag.
For example, if I have a 2 line with 2 million characters and a query
with 5, then
<c-x>l
takes like half a second. No harm I think sinceother things are also slow given that line length.
Of course, there's plenty of room to optimize, for example we can
skip matching prefixes.
Every successful match adds a temporary heap allocation. This feels
bad but it's still much lower than the number of string allocations,
so I didn't optimize that yet.
Closes #3806
[*]: Since we require a (smartcase) subsequence match, we need no
deletions or mismatches, which reduces the search space.
Alternative solution: another fuzzy-matching algorithm
with public domain implementations is described in
https://github.com/tajmone/fuzzy-search/tree/8a0e231f756cccd20f7c57e9593df08fd2bb79ef
but I didn't really look into it.