Use optimal subsequence match to rank completions #4653

krobelus · 2022-06-26T16:13:02Z

When using fuzzy completion with candidates from "git ls-files" in
a project with many files I often get unexpected ordering even when
one candidate is clearly the best.
This usually happens if both candidates are the same kind of match
(typically only a subsequence match), which means that the leftmost
match wins (via max_index), which is often wrong.
Also there are some other matching heuristics like FirstCharMatch
that can lead to unexpected ordering by dominating everything else
(for example with query "remcc" we rank "README.asciidoc" over
"src/remote.cc").

Fix these two issues by

switching to a restricted variant[*] of the Needleman-Wunsch
algorithm to find an optimal match instead of the leftmost one.
Dropping some of the old heuristics in favor of ones based on the
optimal match.

Optimality is defined by a new matching heuristic which favors longer
subsequences by adding a penalty to each new gap.
On top of this, when comparing two matches, we first compare by the
number of gaps. This makes sure that short queries can prefer long
candidates over short (but worse) candidates. Alternatively, we could
just increase the penalty. I didn't do that yet because I didn't
encounter a real-world use case where the proposed heuristic fails.

Some heuristics are kept because they enable some test cases;
for example "higher number of matched word boundaries is better".
Since we now compute this heuristic based on an optimal match rather
than a greedy match, this metric should be more accurate too -
though we could probably achieve even better matches by including
word boundaries in the optimal-matching heuristic. Again I didn't
do this because I don't have a concrete example to improve.

The algorithm needs quadratic time/space. I believe this should be
mostly okay because we expect small inputs.
Insert completion candidates are usually below max_word_len=50.
The exception is line completion, (which is off in the default
completers). With lines that are abnormally large, this can cause lag.
For example, if I have a 2 line with 2 million characters and a query
with 5, then <c-x>l takes like half a second. No harm I think since
other things are also slow given that line length.

Of course, there's plenty of room to optimize, for example we can
skip matching prefixes.

Every successful match adds a temporary heap allocation. This feels
bad but it's still much lower than the number of string allocations,
so I didn't optimize that yet.

Closes #3806

[*]: Since we require a (smartcase) subsequence match, we need no
deletions or mismatches, which reduces the search space.

Alternative solution: another fuzzy-matching algorithm
with public domain implementations is described in
https://github.com/tajmone/fuzzy-search/tree/8a0e231f756cccd20f7c57e9593df08fd2bb79ef
but I didn't really look into it.

Screwtapello · 2022-06-27T01:22:35Z

Does this new ranking algorithm affect #1709 (comment) ?

krobelus · 2022-06-27T05:32:55Z

Does this new ranking algorithm affect #1709 (comment) ?

No, we still sort all completions.

I wonder what the best solution is, maybe a static-option=<option-name> is enough, or perhaps a user-defined sorting function.
Maybe a priority field in the completions options, but it's not clear how the priority should interact with other things in RankedMatch::operator<.
Custom filtering is a similar issue although less important.

mawww · 2022-06-30T00:10:34Z

Thats a big one, I need to take some time to review but I am happy to improve the fuzzy matching algorithm. I'll try to dedicate some time to go throug that soon.

krobelus · 2022-06-30T18:16:05Z

Needleman-Wunsch is a nice and simple divide-and-conquer algorithm but it took me a while to get this adaption right.
I'll try to add some more documentation on the weekend.

I realized that there is a small but noticeable slowdown. I'm testing with

define-command git-edit -params 1.. %{ edit -existing -- %arg{@} }
complete-command git-edit shell-script-candidates %{ git ls-files }

on the llvm-project monorepo (~100k files). When typing :git-edit clang
I can definitely tell that it's slower to update.

I don't think it's a huge deal but snappier behavior would feel way better.

Somehow fzf is much faster (also faster than skim). I'm not sure how much peeking at non-public-domain code is okay.

Wikipedia mentions Hirschberg's_algorithm and the Method of Four Russians as improvements.
I'll try them but probably not in this PR because they make the implementation more complex.

krobelus · 2022-07-08T13:13:44Z

I'm fairly happy with this now, though the heuristic could still be fine-tuned.
The last couple of commits are not so important / experimental, I can drop them.

krobelus · 2022-07-09T15:49:26Z

Fixed a regression where we failed to prioritize exact matches over smartcase matches (for some reason other fuzzy finders don't do this).

krobelus · 2022-07-09T19:32:25Z

Recent versions also optimize the number of gaps and matched word boundaries.
I found some weird orderings which made me realize that the gaps/word boundaries need a slightly different algorithm, I'll work on that.

krobelus · 2022-12-24T06:36:46Z

yeah, given query ch, this change sorts chgrp before chattr because the first is shorter.
If the query is empty we fall back to lexicographical order, which is the other way round here.

krobelus · 2022-12-24T15:13:15Z

Is that the weight by which Screwtapello was referring to?

No, this PR makes the fuzzy match scoring algorithm more robust and general.
The weight would be a way to let users influence the scoring algorithm, that's #4813

After the pull request , sort matching them would yield (in that order):

12.txt
101.txt

This exposes the same quality as the "chgrp"/"chattr" case.

if the query is empty, the scores of "12.txt" and "101.txt" are equal. So we fall back to lexicographic order.
if the query is non-empty, say "1"
- without this PR, the scores of "12.txt" and "101.txt" are equal, so we still use lexicographic order.
- with this PR, "12.txt" gets a better score than "101.txt" because the first is shorter.

This change is a consequence of the new fuzzy scoring algorithm.
Every mismatched character adds a small penalty, hence the ordering.
Both fzf and fzy have the same behavior (empty query means lexicographic order while "1" picks the shorter one).

I think this behavior is good in general but I have no strong opinion about this specific case where the query is a prefix of the candidate.
If we want to change this case, the way to do so would be to add a preprocessing step that removes matching prefixes.
Then "match 1 against 101.txt and 12.txt" becomes "match the empty string against 01.txt and 2.txt" which would use lexicographic order as before.

raiguard · 2023-09-28T15:28:05Z

This algorithm is undoubtedly an improvement, but for file finding it still isn't as good as fzf. In my usecase it keeps ranking the test file ahead of the src file.

Kak (bad):

Fzf (good):

Note that fzf best match is at the bottom while kak is at the top. Sorry for the visual confusion.

Here is the prompt command I am using in kak:

# Go to file (basic fuzzy finder)
prompt -shell-script-candidates %{ fd --type f --hidden } file: %{ edit %val{text} }

krobelus · 2023-11-11T14:16:03Z

@raiguard thanks for the test case, I have tweaked the heuristic to make it work.

FWIW for buffer type completion it already worked this way (also without this PR) because Kakoune knows that the completions are file paths. For those we prioritize matches on the basenames (see 2df7b1f (In buffer name completion, give priority to filename (not dirname) matches, 2016-03-02)).
For your prompt command, we don't know if the input is a file path or some other token, so it seems wrong to prioritize basenames.
What I did was change the word boundary bonus so matching H against /Here is worth twice as much as matching it against OverHere.

…tart This implies a bunch of renamings but the following commits will replace count_word_boundaries_match() anyway, so I'll do that on-the-fly. The only functional change in this patch is that something like "/" at the start of a string no longer counts as word start. That should hardly matter.

I broke most of these while working on the new implementation. Some are redundant though.

We use candidate/query in the constructor. The next patch adds the same names in the subsequence scoring function. Let's use the same terminology in the subsequence matching function. While at it, rename the subsequence matching function to prepare for adding an optimal subsequence matching.

This allows using CharCount in operator[] instead of casting to size_t every time. I'm afraid that some other operations like "end()" are not supported yet with types like CharCount; for some reason adding a cast there didn't work. I got some error about not being able to convert CharCount to size_t.

When using fuzzy completion with candidates from "git ls-files" in a project with many files I often get unexpected ordering even when one candidate is clearly the best. This usually happens if both candidates are the same kind of match (typically only a subsequence match), which means that the leftmost match wins (via max_index), which is often wrong. Also there are some other matching heuristics like FirstCharMatch that can lead to unexpected ordering by dominating everything else (for example with query "remcc" we rank "README.asciidoc" over "src/remote.cc"). Fix these two issues by 1. switching to the Gotoh algorithm[1], a refinement of the Smith-Waterman algorithm that is optimized for affine gap penalties[2]. This makes us find an optimal match instead of the leftmost one. 2. dropping old heuristics that are obsoleted by the new one. Optimality is defined by a new distance heuristic which favors longer matching subsequences (as opposed to multiple shorter ones) by adding a penalty to each new gap between matches. For most of our RankedMatch test cases, we match the behavior of other popular fuzzy matchers. The algorithm needs quadratic time/space but that's okay, since our input is expected to be small. For example, candidates for insert-mode completion are usually below max_word_len=50. Second, if there's ever a very large input candidate, only match against the first 1000 characters. Third, a follow-up commit will switch to the linear space variant. Every successful match adds a temporary heap allocation. This feels bad but it's still dominated by the number of string allocations, so there's not much reason to optimize that. Besides, with the following patches, it's fast enough for me. In src/diff.h we have an implementation of a similar algorithm, but that one is optimized for matching strings with small differences. It is also not guaranteed to find the optimal match. Closes mawww#3806 [1]: as described in "An Improved Algorithm for Matching Biological Sequences", see https://courses.cs.duke.edu/spring21/compsci260/resources/AlignmentPapers/1982.gotoh.pdf [2]: https://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm#Affine

Our fuzzy-matching algorithm uses iswlower and towlower to implement smartcase matching. Sadly, when using glibc these functions are quite slow, even for ASCII inputs. With the new fuzzy-matching algorithm, this shows up prominently in a CPU profile. We tirelessly call iswlower and towlower on the query chars, so just cache them in a prepared query called "RankedMatchQuery", which allows to implement smartcase_eq() more efficiently. When matching query "clang" against 100k files in the LLVM monorepo (of which 30k match), this commit makes us go from 1.8 billion cycles to just 1.2 (same as the old fuzzy-matching algorithm). The implementation is a bit ugly because the RankedMatchQuery object needs to outlive all uses of RankedMatch::operator<. I guess we could try to create a type that contains both the query and query results to make this less ugly. We could use the same type to get rid of the allocations in subsequence_match_scores(), though I don't know if that matters. A previous approach added a fast path for ASCII input to our wrappers like is_lower and to_lower. This was not ideal because it doesn't solve the problem for non-ASCII input.

We can compute the fuzzy matching score without keeping in memory the scores of all prefixes. Currently we only use the full score matrix for debugging. In future I want to use it to get the positions of matched characters (which could be underlined in the UI).

This enables the test case from mawww#4653 (comment) and a few others from that PR.

krobelus · 2023-12-28T11:10:29Z

latest master works fine for almost all test cases, it seems to be good enough for my current use cases

mawww · 2024-01-05T04:16:34Z

Thanks again for exploring this and pushing for fuzzy matching to be improved.

krobelus changed the title ~~[better-rank-heuristic] Use optimal subsequence match to rank completions~~ Use optimal subsequence match to rank completions Jun 26, 2022

krobelus mentioned this pull request Jun 26, 2022

Fuzzy finder accuracy #3806

Open

krobelus force-pushed the better-rank-heuristic branch 6 times, most recently from 33a9ef4 to 16af089 Compare July 8, 2022 13:12

krobelus force-pushed the better-rank-heuristic branch 2 times, most recently from 947678a to e574fcb Compare July 9, 2022 15:47

krobelus force-pushed the better-rank-heuristic branch 3 times, most recently from ce7e44e to 88356a8 Compare July 17, 2022 11:02

krobelus force-pushed the better-rank-heuristic branch from 88356a8 to 1e5e378 Compare August 28, 2022 19:28

krobelus mentioned this pull request Aug 30, 2022

lsp-menu: customize prompt string for each caller kakoune-lsp/kakoune-lsp#639

Open

krobelus mentioned this pull request Dec 22, 2022

Allow users to influence sorting of user-supplied completions #4813

Closed

krobelus force-pushed the better-rank-heuristic branch from 1e5e378 to c795e78 Compare January 28, 2023 22:57

krobelus force-pushed the better-rank-heuristic branch from c795e78 to f11236a Compare June 17, 2023 09:24

krobelus force-pushed the better-rank-heuristic branch from f11236a to f4f8717 Compare November 11, 2023 14:09

krobelus added 4 commits November 11, 2023 15:17

Add some tests for ranked matching

2699495

I broke most of these while working on the new implementation. Some are redundant though.

krobelus force-pushed the better-rank-heuristic branch 2 times, most recently from 8cd981b to fd0c36f Compare November 12, 2023 08:24

krobelus added 3 commits November 12, 2023 09:32

krobelus force-pushed the better-rank-heuristic branch from fd0c36f to 02f8757 Compare November 12, 2023 08:38

krobelus added a commit to krobelus/kakoune that referenced this pull request Nov 16, 2023

ranked match: prefer real word boundaries to camelCase boundaries

a0281d7

This enables the test case from mawww#4653 (comment) and a few others from that PR.

krobelus closed this Dec 28, 2023

krobelus deleted the better-rank-heuristic branch October 27, 2024 10:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use optimal subsequence match to rank completions #4653

Use optimal subsequence match to rank completions #4653

krobelus commented Jun 26, 2022

Screwtapello commented Jun 27, 2022

krobelus commented Jun 27, 2022

mawww commented Jun 30, 2022

krobelus commented Jun 30, 2022

krobelus commented Jul 8, 2022

krobelus commented Jul 9, 2022

krobelus commented Jul 9, 2022

krobelus commented Dec 24, 2022

krobelus commented Dec 24, 2022 •

edited

Loading

raiguard commented Sep 28, 2023 •

edited

Loading

krobelus commented Nov 11, 2023

krobelus commented Dec 28, 2023

mawww commented Jan 5, 2024

Use optimal subsequence match to rank completions #4653

Use optimal subsequence match to rank completions #4653

Conversation

krobelus commented Jun 26, 2022

Screwtapello commented Jun 27, 2022

krobelus commented Jun 27, 2022

mawww commented Jun 30, 2022

krobelus commented Jun 30, 2022

krobelus commented Jul 8, 2022

krobelus commented Jul 9, 2022

krobelus commented Jul 9, 2022

krobelus commented Dec 24, 2022

krobelus commented Dec 24, 2022 • edited Loading

raiguard commented Sep 28, 2023 • edited Loading

krobelus commented Nov 11, 2023

krobelus commented Dec 28, 2023

mawww commented Jan 5, 2024

krobelus commented Dec 24, 2022 •

edited

Loading

raiguard commented Sep 28, 2023 •

edited

Loading