Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignore diacritics in searches #778

Closed
catdevnull opened this issue Sep 6, 2021 · 5 comments
Closed

Ignore diacritics in searches #778

catdevnull opened this issue Sep 6, 2021 · 5 comments
Labels
enhancement New feature or request
Milestone

Comments

@catdevnull
Copy link

Is your feature request related to a problem? Please describe.

When searching for words in Spanish with accents (más vs mas) or names (Elío vs Elio) Foliate will only match the words with the exact same accents.

Describe the solution you'd like

Similarily to how it ignores the casing of the search query, Foliate should also ignore the diacritics of the query.

Foliate could also provide an option like Firefox's "Match Diacritics" for forcing Foliate to consider the diacritics in the query.

@catdevnull catdevnull added the enhancement New feature or request label Sep 6, 2021
@johnfactotum
Copy link
Owner

Will need to patch Epub.js for this. Probably just need to change this bit: https://github.com/futurepress/epub.js/blob/5c7f21d648d9d20d44c6c365d164b16871847023/src/section.js#L197. It currently makes all text lowercase before matching. So we'll have to add an option to remove all diacritics before matching.

@catdevnull
Copy link
Author

Would be happy to send a patch if desired.

@johnfactotum
Copy link
Owner

johnfactotum commented Jan 15, 2022

I looked a bit more into this. It seems that it's not simple to implement at all.

First, it's not enough to test whether the text contains the query; you'd have to get the offset as well (so that it can be highlighted and navigated to). But if one simply removes diacritics, it can alter the length of the text, depending on whether the diacritic is a separate code point.

Another problem is that the behavior can vary depending on the language, so the best way to see if two characters are equal is to use a Collator (where you can get fine grain control over sensitivity).

These problems means that a simple indexOf will not do. Instead one would probably need to manually traverse and match the strings.

Edit: found a locale-aware implementation of indexOf: https://github.com/arty-name/locale-index-of

@johnfactotum
Copy link
Owner

johnfactotum commented Jan 17, 2022

Another thing I just realized: apart from diacritics, one also needs to remove or ignore other kinds characters, such as the various zero-width characters.

For example, if you use Calibre to insert soft hyphens into your book (which is necessary because Kindle doesn't support auto hyphens for KF7 and KF8) suddenly you can't find anything in Foliate anymore. One can observe this bug by opening any of the .azw files from Standard Ebooks.

Edit: it seems this is already handled by Intl.Collator, so that's yet another reason to use it.

@johnfactotum johnfactotum added this to the 3.0 milestone Sep 23, 2022
@johnfactotum
Copy link
Owner

Fixed in the gtk4 branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants