Use optimal subsequence match to rank completions #4653

…tart This implies a bunch of renamings but the following commits will replace count_word_boundaries_match() anyway, so I'll do that on-the-fly. The only functional change in this patch is that something like "/" at the start of a string no longer counts as word start. That should hardly matter.

I broke most of these while working on the new implementation. Some are redundant though.

We use candidate/query in the constructor. The next patch adds the same names in the subsequence scoring function. Let's use the same terminology in the subsequence matching function. While at it, rename the subsequence matching function to prepare for adding an optimal subsequence matching.

This allows using CharCount in operator[] instead of casting to size_t every time. I'm afraid that some other operations like "end()" are not supported yet with types like CharCount; for some reason adding a cast there didn't work. I got some error about not being able to convert CharCount to size_t.

When using fuzzy completion with candidates from "git ls-files" in a project with many files I often get unexpected ordering even when one candidate is clearly the best. This usually happens if both candidates are the same kind of match (typically only a subsequence match), which means that the leftmost match wins (via max_index), which is often wrong. Also there are some other matching heuristics like FirstCharMatch that can lead to unexpected ordering by dominating everything else (for example with query "remcc" we rank "README.asciidoc" over "src/remote.cc"). Fix these two issues by 1. switching to the Gotoh algorithm[1], a refinement of the Smith-Waterman algorithm that is optimized for affine gap penalties[2]. This makes us find an optimal match instead of the leftmost one. 2. dropping old heuristics that are obsoleted by the new one. Optimality is defined by a new distance heuristic which favors longer matching subsequences (as opposed to multiple shorter ones) by adding a penalty to each new gap between matches. For most of our RankedMatch test cases, we match the behavior of other popular fuzzy matchers. The algorithm needs quadratic time/space but that's okay, since our input is expected to be small. For example, candidates for insert-mode completion are usually below max_word_len=50. Second, if there's ever a very large input candidate, only match against the first 1000 characters. Third, a follow-up commit will switch to the linear space variant. Every successful match adds a temporary heap allocation. This feels bad but it's still dominated by the number of string allocations, so there's not much reason to optimize that. Besides, with the following patches, it's fast enough for me. In src/diff.h we have an implementation of a similar algorithm, but that one is optimized for matching strings with small differences. It is also not guaranteed to find the optimal match. Closes mawww#3806 [1]: as described in "An Improved Algorithm for Matching Biological Sequences", see https://courses.cs.duke.edu/spring21/compsci260/resources/AlignmentPapers/1982.gotoh.pdf [2]: https://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm#Affine

Our fuzzy-matching algorithm uses iswlower and towlower to implement smartcase matching. Sadly, when using glibc these functions are quite slow, even for ASCII inputs. With the new fuzzy-matching algorithm, this shows up prominently in a CPU profile. We tirelessly call iswlower and towlower on the query chars, so just cache them in a prepared query called "RankedMatchQuery", which allows to implement smartcase_eq() more efficiently. When matching query "clang" against 100k files in the LLVM monorepo (of which 30k match), this commit makes us go from 1.8 billion cycles to just 1.2 (same as the old fuzzy-matching algorithm). The implementation is a bit ugly because the RankedMatchQuery object needs to outlive all uses of RankedMatch::operator<. I guess we could try to create a type that contains both the query and query results to make this less ugly. We could use the same type to get rid of the allocations in subsequence_match_scores(), though I don't know if that matters. A previous approach added a fast path for ASCII input to our wrappers like is_lower and to_lower. This was not ideal because it doesn't solve the problem for non-ASCII input.

We can compute the fuzzy matching score without keeping in memory the scores of all prefixes. Currently we only use the full score matrix for debugging. In future I want to use it to get the positions of matched characters (which could be underlined in the UI).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use optimal subsequence match to rank completions #4653

Use optimal subsequence match to rank completions #4653

Commits on Nov 11, 2023

Commits on Nov 12, 2023