Ignore diacritics in searches #778

catdevnull · 2021-09-06T20:45:42Z

Is your feature request related to a problem? Please describe.

When searching for words in Spanish with accents (más vs mas) or names (Elío vs Elio) Foliate will only match the words with the exact same accents.

Describe the solution you'd like

Similarily to how it ignores the casing of the search query, Foliate should also ignore the diacritics of the query.

Foliate could also provide an option like Firefox's "Match Diacritics" for forcing Foliate to consider the diacritics in the query.

johnfactotum · 2021-09-13T14:33:57Z

Will need to patch Epub.js for this. Probably just need to change this bit: https://github.com/futurepress/epub.js/blob/5c7f21d648d9d20d44c6c365d164b16871847023/src/section.js#L197. It currently makes all text lowercase before matching. So we'll have to add an option to remove all diacritics before matching.

catdevnull · 2021-09-13T16:43:23Z

Would be happy to send a patch if desired.

johnfactotum · 2022-01-15T17:54:05Z

I looked a bit more into this. It seems that it's not simple to implement at all.

First, it's not enough to test whether the text contains the query; you'd have to get the offset as well (so that it can be highlighted and navigated to). But if one simply removes diacritics, it can alter the length of the text, depending on whether the diacritic is a separate code point.

Another problem is that the behavior can vary depending on the language, so the best way to see if two characters are equal is to use a Collator (where you can get fine grain control over sensitivity).

These problems means that a simple indexOf will not do. Instead one would probably need to manually traverse and match the strings.

Edit: found a locale-aware implementation of indexOf: https://github.com/arty-name/locale-index-of

johnfactotum · 2022-01-17T03:19:11Z

Another thing I just realized: apart from diacritics, one also needs to remove or ignore other kinds characters, such as the various zero-width characters.

For example, if you use Calibre to insert soft hyphens into your book (which is necessary because Kindle doesn't support auto hyphens for KF7 and KF8) suddenly you can't find anything in Foliate anymore. One can observe this bug by opening any of the .azw files from Standard Ebooks.

Edit: it seems this is already handled by Intl.Collator, so that's yet another reason to use it.

johnfactotum · 2022-10-25T06:16:18Z

Fixed in the gtk4 branch.

catdevnull added the enhancement New feature or request label Sep 6, 2021

johnfactotum added this to the 3.0 milestone Sep 23, 2022

johnfactotum mentioned this issue Oct 19, 2022

New renderer #962

Closed

johnfactotum closed this as completed Oct 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ignore diacritics in searches #778

Ignore diacritics in searches #778

catdevnull commented Sep 6, 2021

johnfactotum commented Sep 13, 2021

catdevnull commented Sep 13, 2021

johnfactotum commented Jan 15, 2022 •

edited

Loading

johnfactotum commented Jan 17, 2022 •

edited

Loading

johnfactotum commented Oct 25, 2022

Ignore diacritics in searches #778

Ignore diacritics in searches #778

Comments

catdevnull commented Sep 6, 2021

johnfactotum commented Sep 13, 2021

catdevnull commented Sep 13, 2021

johnfactotum commented Jan 15, 2022 • edited Loading

johnfactotum commented Jan 17, 2022 • edited Loading

johnfactotum commented Oct 25, 2022

johnfactotum commented Jan 15, 2022 •

edited

Loading

johnfactotum commented Jan 17, 2022 •

edited

Loading