Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Case insensitivity for non-latin #41

Closed
MahouShoujoMivutilde opened this issue Oct 13, 2022 · 23 comments
Closed

Case insensitivity for non-latin #41

MahouShoujoMivutilde opened this issue Oct 13, 2022 · 23 comments
Labels
enhancement New feature or request

Comments

@MahouShoujoMivutilde
Copy link

TL;DR

Currently tofi is only case insensitive for latin.

Would be really nice to have support for more locales (in my case - cyrillic).

...How it should map (but doesn't)

абвгдеёжзийклмнопрстуфхцчшщъыьэюя to АБВГДЕЁЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯ

Of course, unicode-aware is better than just another alphabet.

Use case

This is useful if e.g. tofi is used as a front end for clipboard manager.

What I did

I've tried to implement it myself.

Found utf8.h, which provides a bunch of utf8 aware functions as replacements for the ones from string.h.

Digging around tofi's source I've found that the main search stuff happens in src/fuzzy_match.c with two functions - fuzzy_match_simple_words() (space split and strcasestr()) and fuzzy_match_words() (space split and fuzzy search algo).

There are also uses of strcasestr() related to highlight of the found pattern, but that's beside the point.

Replacing strcasestr() with utf8casestr() from the utf8.h in fuzzy_match_simple_words() easily makes tofi case insensetive for cyrillic (and not only that), at least without --fuzzy-match true.

But with fuzzy_match_words() it is trickier.

Even if I replace string.h related functions in fuzzy_match_words() and things related to it (I can replace everything aside isalnum()), fuzzy search starts to behave weirdly.

It works fine on latin strings, still.

But if I enter а single cyrillic letter, it instantly shows no results.

I've narrowed it to

while ((match = strcasestr(match, search)) != NULL) {

Basically, if you replace strcasestr() with utf8casestr() it never matches non-latin, while loop never starts.

Idk why.

I suspect this line

const char search[2] = { *pattern, '\0' };

which is the only significantly different argument compared to the ones used for strcasestr() in fuzzy_match_simple_words().

This is where I am at.

Any insight?

Patch

Just so you know, I've tried replacing everything string.h everywhere with utf8.h alternative. It compiles, it seems to run fine for a brief test, but it doesn't make a difference to the problem with fuzzy search function, so I didn't include it.

Patch
diff --git a/src/entry_backend/harfbuzz.c b/src/entry_backend/harfbuzz.c
index c5fc07e..77869df 100644
--- a/src/entry_backend/harfbuzz.c
+++ b/src/entry_backend/harfbuzz.c
@@ -7,6 +7,8 @@
 #include "../nelem.h"
 #include "../xmalloc.h"

+#include "../utf8.h"
+
 /*
  * FreeType is normally compiled without error strings, so we have to do this
  * funky macro trick to get them. See <freetype/fterrors.h> for more details.
@@ -368,7 +370,7 @@ void entry_backend_harfbuzz_update(struct entry *entry)
 			char *postmatch = NULL;
 			cairo_text_extents_t subextents;
 			if (entry->input_mb_length > 0 && entry->selection_highlight_color.a != 0) {
-				char *match_pos = strcasestr(prematch, entry->input_mb);
+				char *match_pos = utf8casestr(prematch, entry->input_mb);
 				if (match_pos != NULL) {
 					match = xstrdup(result);
 					prematch_len = (match_pos - prematch);
diff --git a/src/entry_backend/pango.c b/src/entry_backend/pango.c
index 1f19bce..2439166 100644
--- a/src/entry_backend/pango.c
+++ b/src/entry_backend/pango.c
@@ -6,6 +6,8 @@
 #include "../nelem.h"
 #include "../xmalloc.h"

+#include "../utf8.h"
+
 #undef MAX
 #define MAX(a, b) ((a) > (b) ? (a) : (b))

@@ -181,7 +183,7 @@ void entry_backend_pango_update(struct entry *entry)
 			PangoRectangle ink_subrect;
 			PangoRectangle logical_subrect;
 			if (entry->input_mb_length > 0 && entry->selection_highlight_color.a != 0) {
-				char *match_pos = strcasestr(str, entry->input_mb);
+				char *match_pos = utf8casestr(str, entry->input_mb);
 				if (match_pos != NULL) {
 					prematch_len = (match_pos - str);
 					postmatch_len = strlen(str) - prematch_len - match_len;
diff --git a/src/fuzzy_match.c b/src/fuzzy_match.c
index 94c60e8..47aff1a 100644
--- a/src/fuzzy_match.c
+++ b/src/fuzzy_match.c
@@ -6,6 +6,9 @@

 #include "fuzzy_match.h"
 #include "xmalloc.h"
+#include "log.h"
+
+#include "utf8.h"

 #undef MAX
 #define MAX(a, b) ((a) > (b) ? (a) : (b))
@@ -33,7 +36,7 @@ int32_t fuzzy_match_simple_words(const char *restrict patterns, const char *rest
 	char *tmp = xstrdup(patterns);
 	char *pattern = strtok_r(tmp, " ", &saveptr);
 	while (pattern != NULL) {
-		char *c = strcasestr(str, pattern);
+		char *c = utf8casestr(str, pattern);
 		if (c == NULL) {
 			score = INT32_MIN;
 			break;
@@ -78,8 +81,8 @@ int32_t fuzzy_match_words(const char *restrict patterns, const char *restrict st
 int32_t fuzzy_match(const char *restrict pattern, const char *restrict str)
 {
 	const int unmatched_letter_penalty = -1;
-	const size_t slen = strlen(str);
-	const size_t plen = strlen(pattern);
+	const size_t slen = utf8len(str);
+	const size_t plen = utf8len(pattern);
 	int32_t score = 0;

 	if (*pattern == '\0') {
@@ -119,15 +122,22 @@ int32_t fuzzy_match_recurse(
 	}

 	const char *match = str;
+
+	// XXX: could be the culprit?
 	const char search[2] = { *pattern, '\0' };

 	int32_t best_score = INT32_MIN;
+    log_debug(">> Fuzzy match recurse 1!\n");

 	/*
 	 * Find all occurrences of the next pattern character in str, and
 	 * recurse on them.
 	 */
-	while ((match = strcasestr(match, search)) != NULL) {
+	// FIXME utf8casestr doesn't work here
+	while ((match = utf8casestr(match, search)) != NULL) {
+		// we never get there with utf8casestr
+		log_debug(">> Fuzzy match recurse 2!\n");
+
 		int32_t subscore = fuzzy_match_recurse(
 				pattern + 1,
 				match + 1,
@@ -174,23 +184,30 @@ int32_t compute_score(int32_t jump, bool first_char, const char *restrict match)

 	/* Apply bonuses. */
 	if (!first_char && jump == 0) {
+		log_debug(">> Compute score 1!\n");
 		score += adjacency_bonus;
 	}
 	if (!first_char || jump > 0) {
-		if (isupper(*match) && islower(*(match - 1))) {
+		log_debug(">> Compute score 2!\n");
+		if (utf8isupper(*match) && utf8islower(*(match - 1))) {
+			log_debug(">> Compute score 3!\n");
 			score += camel_bonus;
 		}
+		// XXX this is ascii only, but we never actually get there, wtf
 		if (isalnum(*match) && !isalnum(*(match - 1))) {
+			log_debug(">> Compute score 4!\n");
 			score += separator_bonus;
 		}
 	}
 	if (first_char && jump == 0) {
+		log_debug(">> Compute score 5!\n");
 		/* Match at start of string gets separator bonus. */
 		score += first_letter_bonus;
 	}

 	/* Apply penalties. */
 	if (first_char) {
+		log_debug(">> Compute score 6!\n");
 		score += MAX(leading_letter_penalty * jump,
 				max_leading_letter_penalty);
 	}

Assumes that utf8.h inside src/.

@philj56
Copy link
Owner

philj56 commented Oct 13, 2022

Hey, thanks for the detailed write up! Yeah, non-ascii stuff is generally a bit busted in C. Seems people use menus like this for much more than just running programs like I do, so I agree proper Unicode handling would be nice to have. On that note, I think I may also need to look into allowing multiple font files to be passed to --font as well at some point.

A general note: I'd like to have as few dependencies as possible, vendored or otherwise. Therefore, instead of using utf8.h (which does look pretty good), I'd rather use GLib as tofi already depends on it, and it includes all the necessary functions. According to the note in the documentation of g_strncasecmp, the correct way to do this with GLib is g_utf8_casefold followed by strcmp or similar. islower and isupper can be done with a combination of g_utf8_get_char and g_unichar_islower / g_unichar_isupper.

For your specific problem, I think you're exactly right in saying

I suspect this line

const char search[2] = { *pattern, '\0' };

That line's grabbing the first byte of pattern, and sticking a null terminator after it. For utf8 input, the characters are more than 1 byte long, which means that search is not a valid utf8 string, and utf8casestr probably balks at it. The correct way is to search for a whole Unicode character - with GLib, that'd be with g_utf8_strchr.

It's late here now, but I'll take a look at this tomorrow and see how it works in practice.

@MahouShoujoMivutilde
Copy link
Author

Cool, thank you for the best dmenu alternative I've used, btw.

@philj56 philj56 added the enhancement New feature or request label Oct 14, 2022
@philj56
Copy link
Owner

philj56 commented Oct 14, 2022

Ok, I think I've got this mostly working on the utf8 branch (at least I haven't broken Latin text). I don't have an easy way to type non-Latin text into tofi at the moment, so could you give that branch a go and let me know if it works alright (with and without fuzzy matching)? I'll also look into ways to automate typing text into tofi, which would help with testing stuff like this.

Cool, thank you for the best dmenu alternative I've used, btw.

No problem, I'm glad you like it!

@MahouShoujoMivutilde
Copy link
Author

So far it seems to work just fine with fuzzy matching and without, but I'll use it for a few days to make sure.

I'll also look into ways to automate typing text into tofi, which would help with testing stuff like this.

Wtype can do that.

01.08.15.2022.mp4
sleep 5; wtype 'Something about fox; Что-то там про лису; you can even autotype this 不規則性エントロピー'

But I think just headless test for search functions with predefined "input" strings, patterns and corresponding expected return values would be more robust in term of automation.

@philj56
Copy link
Owner

philj56 commented Oct 15, 2022

But I think just headless test for search functions with predefined "input" strings, patterns and corresponding expected return values would be more robust in term of automation.

Definitely, I was more thinking of something to quickly check it's roughly working.

Wtype can do that.

Nice, thanks!

@MahouShoujoMivutilde
Copy link
Author

To make sure I thoroughly tested everything I've also replaced fzf with tofi --fuzzy-match true (and I use it a lot when in terminal). (In addition to clipboard/drun/etc).

General notes

This is just a few remarks, not a critique.

  • Tofi works faster with larger number of shorter lines than fewer number of longer lines, even if the actual number of characters is the same. (Relevant and noticeable only when there really a lot of text in the input - e.g. when used for file search)

  • Tofi's interface is strongly tied to the search, so if search takes long enough, prompt can ignore your input.

  • There are some weird edge cases where fuzzy search will display the line with some obscure phrase that simply has the same letters as the prompt as the best match, and as the second place, right after it, - the expected result, exact match for the prompt. It happens once in a blue moon, so it doesn't matter that much, and it looks like the original algo's problem, rather than utf8 port problem (considering testing still matches expected results after initial score is set back to 100).

  • Would be nice if hotkeys like Ctrl-j/k also ignored keyboard layout (maybe relevant as an example: rofi wayland fork (written in C) does that).

UTF8

Overall, I haven't seen any practical problems with new utf8 search backend.

@philj56
Copy link
Owner

philj56 commented Oct 18, 2022

Nice! I've merged the utf8 handling into master.

To make sure I thoroughly tested everything I've also replaced fzf with tofi --fuzzy-match true (and I use it a lot when in terminal). (In addition to clipboard/drun/etc).

This is well beyond what I thought it would ever be used for 😅 Thanks for being so thorough! I just tried it myself and it's pretty bad with find * | tofi. It probably needs to work similarly to fzf and start drawing before all input is loaded for something like that to be usable.

Tofi works faster with larger number of shorter lines than fewer number of longer lines, even if the actual number of characters is the same. (Relevant and noticeable only when there really a lot of text in the input - e.g. when used for file search)

Is that just with fuzzy match, or for simple matching as well? For fuzzy matching it makes sense, as there's likely lots of potential matches for a long input string. Off the top of my head, I'd expect the time to find the best match to grow exponentially with string length. I guess there could be some heuristic, to say "Only ever try to find the n next characters", so accuracy can be traded for speed.

Tofi's interface is strongly tied to the search, so if search takes long enough, prompt can ignore your input.

Yes, it only draws a new frame every keypress, and will hang until it's finished. I can't get it to miss input with some quick testing, but I can get it to exhibit an old bug: rendering becomes out-of-sync with input somehow, so what's displayed lags behind your input by 1 keypress. Solving this probably requires some re-architecturing.

There are some weird edge cases ...

Interesting, I guess this is due to the camel-case / word separator bonuses being larger than the letter adjacency bonus:

tofi/src/fuzzy_match.c

Lines 170 to 173 in 3406922

const int adjacency_bonus = 15;
const int separator_bonus = 30;
const int camel_bonus = 30;
const int first_letter_bonus = 15;

I can reproduce this with something like:

printf "think\nTxHxIxNxK" | tofi --fuzzy-match true

typing "think" (or even just "th") then preferentially matches "TxHxIxNxK". This could be solved by making the fuzzy match heuristics customisable, but I don't want to add too many seldom-used options. Maybe this is worth it though, as the original algorithm's really just optimised for code searching.

Would be nice if hotkeys like Ctrl-j/k also ignored keyboard layout

I can look into it, but just to be clear, you mean selecting a keyboard layout with e.g. j in a different place shouldn't change which keyboard key you have to hit? (Also, keyboard handling in general needs sorting out, it's on my todo list)

Overall, I haven't seen any practical problems with new utf8 search backend.

👍 Great! After you suggested it, I actually added a tiny bit of unit testing in 3406922 and found one minor bug: an accented character will fuzzily match against a string containing the same character and accent, but not together, e.g.
will match against aọ (but not against just a).

Thanks again for all the discussion and feedback, it's very helpful!

@philj56
Copy link
Owner

philj56 commented Oct 18, 2022

Off the top of my head, I'd expect the time to find the best match to grow exponentially with string length.

Actually, I remembered I've thought about this before and did some benchmarking. The time for one match should be O(n*m), where n is the length of the search pattern and m is the length of the target string, so it should be the same. Rendering is likely to be slower with long lines though - I'll have to do some testing.

@MahouShoujoMivutilde
Copy link
Author

MahouShoujoMivutilde commented Oct 18, 2022

It probably needs to work similarly to fzf and start drawing before all input is loaded for something like that to be usable

It is already surprisingly usable for me, I use it as general media opener when I know exactly what I want to pull up and don't want to go down the directory tree.

But yeah, I can see it being slow to open with large enough collection of files.

Here is the script
#!/usr/bin/env sh

fd -t f -t l \
    -e '.bmp' \
    -e '.gif' \
    -e '.jpeg' \
    -e '.jpg' \
    -e '.jxr' \
    -e '.heif' \
    -e '.webp' \
    -e '.png' \
    -e '.svg' \
    -e '.tif' \
    -e '.tiff' \
    -e '.cbz' \
    -e '.cbr' \
    -e '.flac' \
    -e '.m4a' \
    -e '.mp3' \
    -e '.ogg' \
    -e '.opus' \
    -e '.wav' \
    -e '.wma' \
    -e '.m4v' \
    -e '.avi' \
    -e '.mkv' \
    -e '.mov' \
    -e '.mp4' \
    -e '.mpeg' \
    -e '.wmv' \
    -e '.webm' \
    -e '.djvu' \
    -e '.epub' \
    -e '.fb2' \
    -e '.pdf' \
    -e '.docx' \
    -e '.xlsx' \
    . ~ |
    sed "s#^$HOME#~#g" | tofi | sed "s#^~#$HOME#g" |
    xargs -d '\n' -r xdg-open

# and -E and --ignore-file for fd to not go in places you don't expect to find relevant files

Also fd is faster than find, so there is that.

Is that just with fuzzy match, or for simple matching as well?

Just the fuzzy match, yes.

Tested via tr -d '\n' < alice_in_wonderland.txt | fold -s -w1000 | tofi --fuzzy-match true.

Try with 1000 and more reasonable 100 - the difference is very noticeable.

alice_in_wonderland.txt.

old bug: rendering becomes out-of-sync with input somehow, so what's displayed lags behind your input by 1 keypress.

This is exactly what I saw and meant, yes.

Weird edge cases
TxHxIxNxK

Yeah, I guess that's distilled version of what I saw.

Maybe this is worth it though, as the original algorithm's really just optimised for code searching.

I think those kind of options would likely be configured once by user and left alone after, but it'll be definitely nice to have.

I'll play around with the values to see maybe there is a better default, idk.

but just to be clear, you mean selecting a keyboard layout with e.g. j in a different place shouldn't change which keyboard key you have to hit?

No, more like j is the same physical key as о (cyrillic) and k is л (cyrillic).

This is a long shot, but I know that GUI apps don't care about your keyboard layout when there is some mod mask, e.g. ctrl+t will open new tab in chromium regardless of my keyboard layout.

EDIT: That's seems to be called "control transformation":

If the Control modifier is active and was not consumed by the translation process, the string produced is transformed to its matching ASCII control character (if applicable). Keysyms are not affected.

Same for fzf (somehow) with ctrl-j/k.

Great! After you suggested it, I actually added a tiny bit of unit testing in 3406922

Nice!

@MahouShoujoMivutilde
Copy link
Author

In regards to ignoring layout:

Here is what wev thinks about ctrl-j in different layouts
# Pressed Ctrl-j
[14:     wl_keyboard] key: serial: 2930; time: 5602413; key: 37; state: 1 (pressed)
                      sym: Control_L    (65507), utf8: ''
[14:     wl_keyboard] modifiers: serial: 0; group: 16
                      depressed: 00000004: Control 
                      latched: 00000000
                      locked: 00000010: Mod2 
[14:     wl_keyboard] key: serial: 2932; time: 5603933; key: 44; state: 1 (pressed)
                      sym: j            (106), utf8: '
'
[14:     wl_keyboard] key: serial: 2933; time: 5604061; key: 44; state: 0 (released)
                      sym: j            (106), utf8: ''
[14:     wl_keyboard] key: serial: 2934; time: 5606685; key: 37; state: 0 (released)
                      sym: Control_L    (65507), utf8: ''
[14:     wl_keyboard] modifiers: serial: 0; group: 16
                      depressed: 00000000
                      latched: 00000000
                      locked: 00000010: Mod2 

# Switched layout
[14:     wl_keyboard] key: serial: 2936; time: 5608053; key: 108; state: 1 (pressed)
                      sym: ISO_Next_Group (65032), utf8: ''
[14:     wl_keyboard] modifiers: serial: 1; group: 16
                      depressed: 00000000
                      latched: 00000000
                      locked: 00000010: Mod2 
[14:     wl_keyboard] key: serial: 2938; time: 5608141; key: 108; state: 0 (released)
                      sym: ISO_Next_Group (65032), utf8: ''

# Pressed Ctrl-j again
[14:     wl_keyboard] key: serial: 2939; time: 5610821; key: 37; state: 1 (pressed)
                      sym: Control_L    (65507), utf8: ''
[14:     wl_keyboard] modifiers: serial: 1; group: 16
                      depressed: 00000004: Control 
                      latched: 00000000
                      locked: 00000010: Mod2 
[14:     wl_keyboard] key: serial: 2941; time: 5612221; key: 44; state: 1 (pressed)
                      sym: Cyrillic_o   (1743), utf8: '
'
[14:     wl_keyboard] key: serial: 2942; time: 5612381; key: 44; state: 0 (released)
                      sym: Cyrillic_o   (1743), utf8: ''
[14:     wl_keyboard] key: serial: 2943; time: 5614469; key: 37; state: 0 (released)
                      sym: Control_L    (65507), utf8: ''
[14:     wl_keyboard] modifiers: serial: 1; group: 16
                      depressed: 00000000
                      latched: 00000000
                      locked: 00000010: Mod2 

Notice that while sym is either j or Cyrillic_o, key number stays 44.

https://git.sr.ht/~sircmpwn/wev

j is 44,
k is 45.

@MahouShoujoMivutilde
Copy link
Author

Just quick and dirty

patch
diff --git a/src/main.c b/src/main.c
index 043f6a8..e02bd08 100644
--- a/src/main.c
+++ b/src/main.c
@@ -182,7 +182,7 @@ static void handle_keypress(struct tofi *tofi, xkb_keycode_t keycode)
 		}
 	} else if (entry->input_length > 0
 			&& (sym == XKB_KEY_BackSpace
-				|| (sym == XKB_KEY_w
+				|| (keycode == 25 /* `w` physical key */
 					&& xkb_state_mod_name_is_active(
 						tofi->xkb_state,
 						XKB_MOD_NAME_CTRL,
@@ -214,7 +214,7 @@ static void handle_keypress(struct tofi *tofi, xkb_keycode_t keycode)
 		} else {
 			entry->results = string_vec_filter(&entry->commands, entry->input_mb, tofi->fuzzy_match);
 		}
-	} else if (sym == XKB_KEY_u
+	} else if (keycode == 30 /* `u` physical key */
 			&& xkb_state_mod_name_is_active(
 				tofi->xkb_state,
 				XKB_MOD_NAME_CTRL,
@@ -233,7 +233,7 @@ static void handle_keypress(struct tofi *tofi, xkb_keycode_t keycode)
 			entry->results = string_vec_filter(&entry->commands, entry->input_mb, tofi->fuzzy_match);
 		}
 	} else if (sym == XKB_KEY_Escape
-			|| (sym == XKB_KEY_c
+			|| (keycode == 54 /* `c` physical key */
 				&& xkb_state_mod_name_is_active(
 					tofi->xkb_state,
 					XKB_MOD_NAME_CTRL,
@@ -250,7 +250,7 @@ static void handle_keypress(struct tofi *tofi, xkb_keycode_t keycode)
 
 	uint32_t nsel = MAX(MIN(entry->num_results_drawn, entry->results.count), 1);
 	if (sym == XKB_KEY_Up || sym == XKB_KEY_Left || sym == XKB_KEY_ISO_Left_Tab
-			|| (sym == XKB_KEY_k
+			|| (keycode == 45 /* `k` physical key */
 				&& xkb_state_mod_name_is_active(
 					tofi->xkb_state,
 					XKB_MOD_NAME_CTRL,
@@ -270,7 +270,7 @@ static void handle_keypress(struct tofi *tofi, xkb_keycode_t keycode)
 			}
 		}
 	} else if (sym == XKB_KEY_Down || sym == XKB_KEY_Right || sym == XKB_KEY_Tab
-			|| (sym == XKB_KEY_j
+			|| (keycode == 44 /* `j` physical key */
 				&& xkb_state_mod_name_is_active(
 					tofi->xkb_state,
 					XKB_MOD_NAME_CTRL,

...seems to actually work just fine! Now hotkeys are layout-agnostic.

Will test further.

@philj56
Copy link
Owner

philj56 commented Oct 20, 2022

Sorry for the slow response, been a busy couple of days.

Here is the script

Pretty cool - on my desktop, running that in my home folder gives ~15000 files, which takes 100ms to open. Not bad, but would definitely be improved by the early start to rendering I mentioned. Nice to know about fd.

Try with 1000 and more reasonable 100 - the difference is very noticeable.

Wow that's bad! Sway actually completely locked up for me. I tried checking just how many times fuzzy_match_recurse() is called when entering a pattern:

Pattern fold width fuzzy_match_recurse() calls
s 100 7996
ss 100 29859
sss 100 70713
s 1000 6648
ss 1000 159426
sss 1000 2512656

Safe to say, 2.5 million function calls is never going to be very quick - I'll do some more investigation. Maybe the "Only ever try to find the n next characters" strategy isn't a bad one to have as a failsafe.

This is exactly what I saw and meant, yes.

Cool, I'll look into it.

I think those kind of options would likely be configured once by user and left alone after, but it'll be definitely nice to have.

Yeah probably, it's more just about making the man page / command line help more awkward to read. I'll think about it at some point.

...seems to actually work just fine! Now hotkeys are layout-agnostic.

Nice! Yeah I've come across this in gamedev stuff before, keycodes vs. keysyms was never that clear to me. My first thought is, does this still work on a physical AZERTY/Dvorak/whatever keyboard? If so, then yeah looks like keycodes are the way to go.

@philj56
Copy link
Owner

philj56 commented Oct 20, 2022

Very briefly, here's the performance of fuzzy matching sss against Alice in Wonderland with different line lengths:

Figure_1

so yeah, some exponential / power-law is going on.

Edit: Did some more thinking, some theoretical background: asking how many ways $N$ there are to match $n$ copies of a character in a string with $k$ of that character is the same question as asking how many $k$-bit numbers there are with $n$ bits set, which turns out to be:

$$ N = \frac{k!}{n!(k-n)!} $$

For example, in the first 1000 characters of Alice in Wonderland, there are 39 s's, so

$$ \begin{aligned} N &= \frac{39!}{3! \cdot 36!} \\ &= 9139 \end{aligned}$$

This scales very poorly with $k$. For the full text wrapped to 1000 character lines, the total number of matches for sss is (excuse the overkill one-liner):

$ tr -d '\n' < alice_in_wonderland.txt | fold -s -w1000 | sed "s/[^s]//g" | awk '{ print length }' | sort -n | python -c 'import sys; import math; [print(math.factorial(int(line)) / (math.factorial(3) * math.factorial(int(line) - 3))) for line in sys.stdin]' | paste -s -d+ | bc -l
2000025.0

So yeah, 2 million possible matches to search (plus ~500000 function calls just to check the null byte at the end of the pattern I guess).

@MahouShoujoMivutilde
Copy link
Author

MahouShoujoMivutilde commented Oct 20, 2022

Sorry for the slow response, been a busy couple of days.

No worries!

Sway actually completely locked up for me.

I am not sure it was a real lock up.

I had similar effect (running Hyprland) with really long lines, but I was able to just click outside the tofi's window (which really became locked up), press hotkey to open the terminal, and pkill tofi.

It'll actually lock up with just 2 1000 character lines

tr -d '\n' < alice_in_wonderland.txt | fold -s -w1000 | head -n 2 | tofi --fuzzy-match true

The score is -700-something for alice...

Maybe the "Only ever try to find the n next characters" strategy isn't a bad one to have as a failsafe.

...So another way to approach this is that if penalty for unmatched characters is large enough, we could just drop fuzzy search for this particular line, and use regular search, or maybe find n characters, then give up and try regular search.

That way we won't miss a match at the end of the long line, but still avoid taking too long.

keycodes vs. keysyms, AZERTY/Dvorak

From arch wiki, keyboard_input:

  1. The keyboard sends a scancode to the computer.
  2. The Linux kernel maps the scancode to a keycode, see Map scancodes to keycodes.
  3. The keyboard layout maps the keycode to a symbol or keysym, depending on what modifier keys are pressed.
  • For the Linux console, see Linux console/Keyboard configuration.
  • For Xorg and Wayland, see Xorg/Keyboard configuration.

From map scancodes to keycodes:

Mapping scancodes to keycodes is achieved in a layer lower than Xorg and Linux console, which means that changes to this mapping will be effective in both
...
The preferred method is to use udev because it uses hardware information (which is a quite reliable source) to choose the keyboard model in a database. It means that if your keyboard model has been found in the database, your keys are recognized out of the box.

At the same time, keycode = scancode + 8, https://unix.stackexchange.com/a/364652

Or, more like keycode = mapping_stuff(scancode) + 8.

So, the way I understand it works is:

Case 1

  • You have QWERTY keyboard with QWERTY firmware that is known to udev.

  • You press physical key Q, which generates scancode 16 (maybe?), which is sent to udev, which knows it is a QWERTY keyboard, and so it translates it to 24 keycode, then your compositor looks at what layout you have, and translates the keycode to corresponding keysym, which is XKB_KEY_q.

Or

Case 2

  • You have AZERTY keyboard
  • You press A, which is in the same place as Q for QWERTY, which generates scancode (16? idk), but since udev knows this is AZERTY, it translates it to keycode 38, which gives you desired XKB_KEY_a in the end.

However, if you have layout level AZERTY or Dvorak, you'll get different keys for the same keycodes, so of course, hotkeys wont work as expected.

So my guess:

  • alternative hardware-based layouts will work fine
  • alternative software-based layouts (set in compositor) will have bindings work as if it was QWERTY

I could be wrong, I don't fully get how scancode to keycode translation happens, and AW covers it in terms of remapping, not deep dive "what will break when".

Ideally, we could use some virtual keyboard, that goes through all those layers, has its own /dev/input/... device, etc, to check.

Oh, there are also canonical key names, like AC07 for j. It's a thing. I don't know what to do with it, yet. Yeaaaah.

Maybe they are the proper way to do layout-agnostic stuff, idk.

e.g.

/* keycode = 44, `j` physical key */

/* this will always, independent of selected layout return "AC07" for keycode = 44 */
xkb_keymap_key_get_name(tofi->xkb_keymap, keycode);

@philj56
Copy link
Owner

philj56 commented Oct 20, 2022

I am not sure it was a real lock up.

Yeah haha, but foolishly I'd used a fullscreen tofi window, and after switching to a different tty to kill tofi, sway wouldn't show again on changing back 😅.

That way we won't miss a match at the end of the long line, but still avoid taking too long.

Ah I meant "the first n matches per character", so for each pattern character, you'd only look at the next "n" matching characters - you'd still find a match at the end of the string if that was the only one.

...So another way to approach this is that if penalty for unmatched characters is large enough, we could just drop fuzzy search for this particular line, and use regular search, or maybe find n characters, then give up and try regular search.

Yeah I started to think this was the best approach too. It can still be fuzzy, but just return only the first match if the line's too long, rather than trying to find the best possible match.

So, the way I understand it works is:

👍 I'll have to read this again tomorrow when I'm more awake, but this seems like a good understanding of things. I'll do some more searching first, but I guess the easiest solution is just to use keycodes for now, as that fixes software layout stuff, and worry about other things if someone brings it up in the future. I also keep meaning to overhaul the input stuff at some point, at which point I may add customisable keybinds, and then we could also make this behaviour customisable as well. May decide not to go that route though, who knows.

Oh, there are also canonical key names, like AC07 for j. It's a thing. I don't know what to do with it, yet. Yeaaaah.

Haha, oh dear.

@MahouShoujoMivutilde
Copy link
Author

so for each pattern character, you'd only look at the next "n" matching characters - you'd still find a match at the end of the string if that was the only one.

Oh, makes sense.

@philj56
Copy link
Owner

philj56 commented Oct 21, 2022

Alright, after some more reading I think you're right and keycodes are the way to go for the CTRL-J style shortcuts. These are defined in <linux/input-event-codes.h> (without the +8 that XKB uses). Initially I wasn't sure what this would mean for *BSD, but it seems that at least FreeBSD uses the same definitions. OpenBSD is different, but Wayland isn't even available there yet so it's not an issue for now.

I've made the change and done some rearranging in d1b94b4, hopefully it should work fine. This is off-topic, but I think seeing as we're already using GLib functions for UTF-8 stuff, soon I'll set about ripping out any mention of wchar_t and make everything explicitly either UTF-32 or UTF-8 (This is partly motivated by seeing that C23 is finally adding some standard library stuff for them, and partly just because I've always hated the ambiguity around wchar_t).

Maybe they are the proper way to do layout-agnostic stuff, idk.

Ah I didn't see this edit before. I think the keycode stuff is probably fine, but that's good to know about.

@MahouShoujoMivutilde
Copy link
Author

MahouShoujoMivutilde commented Oct 22, 2022

I've made the change and done some rearranging in d1b94b4, hopefully it should work fine.

It does, thanks, looks a lot cleaner too!

This is off-topic, but I think seeing as we're already using GLib functions for UTF-8 stuff, soon I'll set about ripping out any mention of wchar_t...

I am not a C programmer, but yeah, wide characters always seemed weird to me.

I think the keycode stuff is probably fine, but that's good to know about.

Yes. I've also found where AC07 for j comes from - /usr/share/X11/xkb/keycodes/evdev (which is owned by xkeyboard-config, which is a dependency of libxkbcommon).

// translation from evdev scancodes to something resembling xfree86 keycodes.

default xkb_keycodes "evdev" {
	minimum = 8;
	maximum = 255;
....
	<AC07> = 44;
...
}

> ... evdev scancodes ...

I don't see them in linux source (just the numbers from <linux/input-event-codes.h>), but I haven't looked very thoroughly, they could be defined in different ways, or it could be just a Xorg thing pulled in libxkbcommon.

The meaning of AC07 is defined in grep -r AC07 /usr/share/X11/xkb/symbols for different layouts.

Also

means the first key (01) in the third row up (C) of the alphanumeric area (A). So why does it use that?

(from here)

So you'll probably be fine with either of keycodes or "canonical names", idk.

@philj56
Copy link
Owner

philj56 commented Oct 22, 2022

It does, thanks, looks a lot cleaner too!

👍 Yeah that's the first step towards customisable keybindings, if that ends up being a thing.

means the first key (01) in the third row up (C) of the alphanumeric area (A). So why does it use that?

Nice find! So yeah, seems keycodes and "canonical names" are both fine.

I've limited the line length for "perfect" fuzzy matching to 100 characters in 820fb11, with a comment explaining the logic:

tofi/src/fuzzy_match.c

Lines 97 to 119 in 820fb11

/*
* If the string is more than 100 characters, just find the first fuzzy
* match rather than the best.
*
* This is required as the number of possible matches (for patterns and
* strings all consisting of one letter) scales something like:
*
* slen! / (plen! (slen - plen)!) ~ slen^plen for plen << slen
*
* This quickly grinds everything to a halt. 100 is chosen fairly
* arbitrarily from the following logic:
*
* - e is the most common character in English, at around 13% of
* letters. Depending on the context, let's say this be up to 20%.
* - 100 * 0.20 = 20 repeats of the same character.
* - In the worst case here, 20! / (10! 10!) ~200,000 possible matches,
* which is "slow but not frozen" for my machine.
*
* In reality, this worst case shouldn't be hit, and finding the "best"
* fuzzy match in lines of text > 100 characters isn't really in scope
* for a dmenu clone.
*/
bool first_match_only = slen > 100;

For the Alice in Wonderland text wrapped to 100 character lines, repeatedly typing e still causes tofi to slow down a lot, but not hang for more than a second (and as mentioned, that's by far the worst case - normal searches shouldn't see slowdowns now).

This issue's mostly off the original topic and getting a bit long to search through now, so I'm gonna mark it as resolved and open some new ones for the other things you mentioned. Thanks again for all the input, it's very useful!

@philj56 philj56 closed this as completed Oct 22, 2022
@MahouShoujoMivutilde
Copy link
Author

Thanks for being so responsive and making tofi even better!

@philj56
Copy link
Owner

philj56 commented Oct 23, 2022

One final word on this actually: I've done the conversion from wchar_t to explicit UTF-8 / UTF-32 in 3bbd8ff. This shouldn't change behaviour at all, and I'll make a new release in a couple of days if no problems present themselves, as there's been a lot of changes by now.

@estnml
Copy link

estnml commented Sep 25, 2023

Hello, i would be very happy if u could help me how can i launch tofi with clipboard content from terminal.

@MahouShoujoMivutilde
Copy link
Author

I use copyq as an actual clipboard manager, tofi is just an interface for it.

https://gist.github.com/MahouShoujoMivutilde/7113baadb13208d91f4271a28debc58b

Replace dmenu with tofi --fuzzy-match true.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants