-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Case insensitivity for non-latin #41
Comments
Hey, thanks for the detailed write up! Yeah, non-ascii stuff is generally a bit busted in C. Seems people use menus like this for much more than just running programs like I do, so I agree proper Unicode handling would be nice to have. On that note, I think I may also need to look into allowing multiple font files to be passed to A general note: I'd like to have as few dependencies as possible, vendored or otherwise. Therefore, instead of using For your specific problem, I think you're exactly right in saying
That line's grabbing the first byte of pattern, and sticking a null terminator after it. For utf8 input, the characters are more than 1 byte long, which means that It's late here now, but I'll take a look at this tomorrow and see how it works in practice. |
Cool, thank you for the best dmenu alternative I've used, btw. |
Ok, I think I've got this mostly working on the utf8 branch (at least I haven't broken Latin text). I don't have an easy way to type non-Latin text into tofi at the moment, so could you give that branch a go and let me know if it works alright (with and without fuzzy matching)? I'll also look into ways to automate typing text into tofi, which would help with testing stuff like this.
No problem, I'm glad you like it! |
So far it seems to work just fine with fuzzy matching and without, but I'll use it for a few days to make sure.
Wtype can do that. 01.08.15.2022.mp4
But I think just headless test for search functions with predefined "input" strings, patterns and corresponding expected return values would be more robust in term of automation. |
Definitely, I was more thinking of something to quickly check it's roughly working.
Nice, thanks! |
To make sure I thoroughly tested everything I've also replaced General notesThis is just a few remarks, not a critique.
UTF8
Overall, I haven't seen any practical problems with new utf8 search backend. |
Nice! I've merged the utf8 handling into master.
This is well beyond what I thought it would ever be used for 😅 Thanks for being so thorough! I just tried it myself and it's pretty bad with
Is that just with fuzzy match, or for simple matching as well? For fuzzy matching it makes sense, as there's likely lots of potential matches for a long input string. Off the top of my head, I'd expect the time to find the best match to grow exponentially with string length. I guess there could be some heuristic, to say "Only ever try to find the n next characters", so accuracy can be traded for speed.
Yes, it only draws a new frame every keypress, and will hang until it's finished. I can't get it to miss input with some quick testing, but I can get it to exhibit an old bug: rendering becomes out-of-sync with input somehow, so what's displayed lags behind your input by 1 keypress. Solving this probably requires some re-architecturing.
Interesting, I guess this is due to the camel-case / word separator bonuses being larger than the letter adjacency bonus: Lines 170 to 173 in 3406922
I can reproduce this with something like: printf "think\nTxHxIxNxK" | tofi --fuzzy-match true typing "think" (or even just "th") then preferentially matches "TxHxIxNxK". This could be solved by making the fuzzy match heuristics customisable, but I don't want to add too many seldom-used options. Maybe this is worth it though, as the original algorithm's really just optimised for code searching.
I can look into it, but just to be clear, you mean selecting a keyboard layout with e.g.
👍 Great! After you suggested it, I actually added a tiny bit of unit testing in 3406922 and found one minor bug: an accented character will fuzzily match against a string containing the same character and accent, but not together, e.g. Thanks again for all the discussion and feedback, it's very helpful! |
Actually, I remembered I've thought about this before and did some benchmarking. The time for one match should be |
It is already surprisingly usable for me, I use it as general media opener when I know exactly what I want to pull up and don't want to go down the directory tree. But yeah, I can see it being slow to open with large enough collection of files. Here is the script#!/usr/bin/env sh
fd -t f -t l \
-e '.bmp' \
-e '.gif' \
-e '.jpeg' \
-e '.jpg' \
-e '.jxr' \
-e '.heif' \
-e '.webp' \
-e '.png' \
-e '.svg' \
-e '.tif' \
-e '.tiff' \
-e '.cbz' \
-e '.cbr' \
-e '.flac' \
-e '.m4a' \
-e '.mp3' \
-e '.ogg' \
-e '.opus' \
-e '.wav' \
-e '.wma' \
-e '.m4v' \
-e '.avi' \
-e '.mkv' \
-e '.mov' \
-e '.mp4' \
-e '.mpeg' \
-e '.wmv' \
-e '.webm' \
-e '.djvu' \
-e '.epub' \
-e '.fb2' \
-e '.pdf' \
-e '.docx' \
-e '.xlsx' \
. ~ |
sed "s#^$HOME#~#g" | tofi | sed "s#^~#$HOME#g" |
xargs -d '\n' -r xdg-open
# and -E and --ignore-file for fd to not go in places you don't expect to find relevant files Also fd is faster than
Just the fuzzy match, yes. Tested via Try with
This is exactly what I saw and meant, yes.
Yeah, I guess that's distilled version of what I saw.
I think those kind of options would likely be configured once by user and left alone after, but it'll be definitely nice to have. I'll play around with the values to see maybe there is a better default, idk.
No, more like This is a long shot, but I know that GUI apps don't care about your keyboard layout when there is some mod mask, e.g. ctrl+t will open new tab in chromium regardless of my keyboard layout. EDIT: That's seems to be called "control transformation":
Same for fzf (somehow) with ctrl-j/k.
Nice! |
In regards to ignoring layout: Here is what wev thinks about ctrl-j in different layouts
Notice that while https://git.sr.ht/~sircmpwn/wev
|
Just quick and dirty patchdiff --git a/src/main.c b/src/main.c
index 043f6a8..e02bd08 100644
--- a/src/main.c
+++ b/src/main.c
@@ -182,7 +182,7 @@ static void handle_keypress(struct tofi *tofi, xkb_keycode_t keycode)
}
} else if (entry->input_length > 0
&& (sym == XKB_KEY_BackSpace
- || (sym == XKB_KEY_w
+ || (keycode == 25 /* `w` physical key */
&& xkb_state_mod_name_is_active(
tofi->xkb_state,
XKB_MOD_NAME_CTRL,
@@ -214,7 +214,7 @@ static void handle_keypress(struct tofi *tofi, xkb_keycode_t keycode)
} else {
entry->results = string_vec_filter(&entry->commands, entry->input_mb, tofi->fuzzy_match);
}
- } else if (sym == XKB_KEY_u
+ } else if (keycode == 30 /* `u` physical key */
&& xkb_state_mod_name_is_active(
tofi->xkb_state,
XKB_MOD_NAME_CTRL,
@@ -233,7 +233,7 @@ static void handle_keypress(struct tofi *tofi, xkb_keycode_t keycode)
entry->results = string_vec_filter(&entry->commands, entry->input_mb, tofi->fuzzy_match);
}
} else if (sym == XKB_KEY_Escape
- || (sym == XKB_KEY_c
+ || (keycode == 54 /* `c` physical key */
&& xkb_state_mod_name_is_active(
tofi->xkb_state,
XKB_MOD_NAME_CTRL,
@@ -250,7 +250,7 @@ static void handle_keypress(struct tofi *tofi, xkb_keycode_t keycode)
uint32_t nsel = MAX(MIN(entry->num_results_drawn, entry->results.count), 1);
if (sym == XKB_KEY_Up || sym == XKB_KEY_Left || sym == XKB_KEY_ISO_Left_Tab
- || (sym == XKB_KEY_k
+ || (keycode == 45 /* `k` physical key */
&& xkb_state_mod_name_is_active(
tofi->xkb_state,
XKB_MOD_NAME_CTRL,
@@ -270,7 +270,7 @@ static void handle_keypress(struct tofi *tofi, xkb_keycode_t keycode)
}
}
} else if (sym == XKB_KEY_Down || sym == XKB_KEY_Right || sym == XKB_KEY_Tab
- || (sym == XKB_KEY_j
+ || (keycode == 44 /* `j` physical key */
&& xkb_state_mod_name_is_active(
tofi->xkb_state,
XKB_MOD_NAME_CTRL,
...seems to actually work just fine! Now hotkeys are layout-agnostic. Will test further. |
Sorry for the slow response, been a busy couple of days.
Pretty cool - on my desktop, running that in my home folder gives ~15000 files, which takes 100ms to open. Not bad, but would definitely be improved by the early start to rendering I mentioned. Nice to know about fd.
Wow that's bad! Sway actually completely locked up for me. I tried checking just how many times
Safe to say, 2.5 million function calls is never going to be very quick - I'll do some more investigation. Maybe the "Only ever try to find the n next characters" strategy isn't a bad one to have as a failsafe.
Cool, I'll look into it.
Yeah probably, it's more just about making the man page / command line help more awkward to read. I'll think about it at some point.
Nice! Yeah I've come across this in gamedev stuff before, keycodes vs. keysyms was never that clear to me. My first thought is, does this still work on a physical AZERTY/Dvorak/whatever keyboard? If so, then yeah looks like keycodes are the way to go. |
Very briefly, here's the performance of fuzzy matching so yeah, some exponential / power-law is going on. Edit: Did some more thinking, some theoretical background: asking how many ways For example, in the first 1000 characters of Alice in Wonderland, there are 39 This scales very poorly with $ tr -d '\n' < alice_in_wonderland.txt | fold -s -w1000 | sed "s/[^s]//g" | awk '{ print length }' | sort -n | python -c 'import sys; import math; [print(math.factorial(int(line)) / (math.factorial(3) * math.factorial(int(line) - 3))) for line in sys.stdin]' | paste -s -d+ | bc -l
2000025.0 So yeah, 2 million possible matches to search (plus ~500000 function calls just to check the null byte at the end of the pattern I guess). |
No worries!
I am not sure it was a real lock up. I had similar effect (running Hyprland) with really long lines, but I was able to just click outside the tofi's window (which really became locked up), press hotkey to open the terminal, and It'll actually lock up with just 2 1000 character lines tr -d '\n' < alice_in_wonderland.txt | fold -s -w1000 | head -n 2 | tofi --fuzzy-match true The score is
...So another way to approach this is that if penalty for unmatched characters is large enough, we could just drop fuzzy search for this particular line, and use regular search, or maybe find n characters, then give up and try regular search. That way we won't miss a match at the end of the long line, but still avoid taking too long.
From arch wiki, keyboard_input:
From map scancodes to keycodes:
At the same time, Or, more like So, the way I understand it works is: Case 1
Or Case 2
However, if you have layout level AZERTY or Dvorak, you'll get different keys for the same keycodes, so of course, hotkeys wont work as expected. So my guess:
I could be wrong, I don't fully get how scancode to keycode translation happens, and AW covers it in terms of remapping, not deep dive "what will break when". Ideally, we could use some virtual keyboard, that goes through all those layers, has its own Oh, there are also canonical key names, like Maybe they are the proper way to do layout-agnostic stuff, idk. e.g. /* keycode = 44, `j` physical key */
/* this will always, independent of selected layout return "AC07" for keycode = 44 */
xkb_keymap_key_get_name(tofi->xkb_keymap, keycode); |
Yeah haha, but foolishly I'd used a fullscreen tofi window, and after switching to a different tty to kill tofi, sway wouldn't show again on changing back 😅.
Ah I meant "the first n matches per character", so for each pattern character, you'd only look at the next "n" matching characters - you'd still find a match at the end of the string if that was the only one.
Yeah I started to think this was the best approach too. It can still be fuzzy, but just return only the first match if the line's too long, rather than trying to find the best possible match.
👍 I'll have to read this again tomorrow when I'm more awake, but this seems like a good understanding of things. I'll do some more searching first, but I guess the easiest solution is just to use keycodes for now, as that fixes software layout stuff, and worry about other things if someone brings it up in the future. I also keep meaning to overhaul the input stuff at some point, at which point I may add customisable keybinds, and then we could also make this behaviour customisable as well. May decide not to go that route though, who knows.
Haha, oh dear. |
Oh, makes sense. |
Alright, after some more reading I think you're right and keycodes are the way to go for the CTRL-J style shortcuts. These are defined in I've made the change and done some rearranging in d1b94b4, hopefully it should work fine. This is off-topic, but I think seeing as we're already using GLib functions for UTF-8 stuff, soon I'll set about ripping out any mention of
Ah I didn't see this edit before. I think the keycode stuff is probably fine, but that's good to know about. |
It does, thanks, looks a lot cleaner too!
I am not a C programmer, but yeah, wide characters always seemed weird to me.
Yes. I've also found where
> ... evdev scancodes ... I don't see them in linux source (just the numbers from The meaning of Also
(from here) So you'll probably be fine with either of keycodes or "canonical names", idk. |
👍 Yeah that's the first step towards customisable keybindings, if that ends up being a thing.
Nice find! So yeah, seems keycodes and "canonical names" are both fine. I've limited the line length for "perfect" fuzzy matching to 100 characters in 820fb11, with a comment explaining the logic: Lines 97 to 119 in 820fb11
For the Alice in Wonderland text wrapped to 100 character lines, repeatedly typing e still causes tofi to slow down a lot, but not hang for more than a second (and as mentioned, that's by far the worst case - normal searches shouldn't see slowdowns now). This issue's mostly off the original topic and getting a bit long to search through now, so I'm gonna mark it as resolved and open some new ones for the other things you mentioned. Thanks again for all the input, it's very useful! |
Thanks for being so responsive and making tofi even better! |
One final word on this actually: I've done the conversion from |
Hello, i would be very happy if u could help me how can i launch tofi with clipboard content from terminal. |
I use copyq as an actual clipboard manager, tofi is just an interface for it. https://gist.github.com/MahouShoujoMivutilde/7113baadb13208d91f4271a28debc58b Replace dmenu with |
TL;DR
Currently tofi is only case insensitive for latin.
Would be really nice to have support for more locales (in my case - cyrillic).
...How it should map (but doesn't)
абвгдеёжзийклмнопрстуфхцчшщъыьэюя
toАБВГДЕЁЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯ
Of course, unicode-aware is better than just another alphabet.
Use case
This is useful if e.g. tofi is used as a front end for clipboard manager.
What I did
I've tried to implement it myself.
Found utf8.h, which provides a bunch of utf8 aware functions as replacements for the ones from
string.h
.Digging around tofi's source I've found that the main search stuff happens in
src/fuzzy_match.c
with two functions -fuzzy_match_simple_words()
(space split andstrcasestr()
) andfuzzy_match_words()
(space split and fuzzy search algo).There are also uses of
strcasestr()
related to highlight of the found pattern, but that's beside the point.Replacing
strcasestr()
withutf8casestr()
from theutf8.h
infuzzy_match_simple_words()
easily makes tofi case insensetive for cyrillic (and not only that), at least without--fuzzy-match true
.But with
fuzzy_match_words()
it is trickier.Even if I replace
string.h
related functions infuzzy_match_words()
and things related to it (I can replace everything asideisalnum()
), fuzzy search starts to behave weirdly.It works fine on latin strings, still.
But if I enter а single cyrillic letter, it instantly shows no results.
I've narrowed it to
tofi/src/fuzzy_match.c
Line 130 in ef585fc
Basically, if you replace
strcasestr()
withutf8casestr()
it never matches non-latin,while
loop never starts.Idk why.
I suspect this line
tofi/src/fuzzy_match.c
Line 122 in ef585fc
which is the only significantly different argument compared to the ones used for
strcasestr()
infuzzy_match_simple_words()
.This is where I am at.
Any insight?
Patch
Just so you know, I've tried replacing everything
string.h
everywhere withutf8.h
alternative. It compiles, it seems to run fine for a brief test, but it doesn't make a difference to the problem with fuzzy search function, so I didn't include it.Patch
Assumes that
utf8.h
insidesrc/
.The text was updated successfully, but these errors were encountered: