-
Notifications
You must be signed in to change notification settings - Fork 500
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Index Sorting: Introduce Unicode Collation Instead of Asciibetical Sorting #928
Comments
I've started exploring this. It looks like we could delegate to functions provided by the ffi-icu gem to achieve this result. In the meantime, I think we should spin off a separate issue to make sorting case insensitive. |
Sounds good to rely on a dedicated Gem for this.
Done: #1406. |
I was looking at the feature offered by ICU; it looks like it offers tons of cool features beside Unicode collation, including:
So, possibly the introduction of the ffi-icu gem could also allow more features into Asciidoctor, especially regarding formatting dates and numbers according to document locale, as well as handling non UTF-8 external assets (e.g. asciidoctor/asciidoctor#3248). Maybe both Asciidoctor PDF and Asciidoctor could benefit from it. |
I agree. If present, we should use it. If not, we should default to the current behavior (graceful degradation). |
…nd group terms under ASCII letter category
…nd group terms under ASCII letter category
…nd group terms under ASCII letter category
…nd group terms under ASCII letter category
…nd group terms under ASCII letter category
I've been able to successfully integrate the ffi-icu gem. I think it's a viable path forward. There's also the twitter_cdlr gem, but I found it more difficult to use. Here's an article I found that helped me to understand how to use ICU: https://bartvanraaij.dev/2020-10-17-converting-utf8-strings-to-ascii-using-icu-transliterator/ |
…nd group terms under ASCII letter category
I'm not entirely sure if this issue related to Asciidoctor PDF or to Asciidoctor itself (but since I've encountered the problem using Asciidoctor PDF, I'm reporting it here).
In the Index of a PDF document converted via Asciidoctor PDF, the entries for each letter are sorted Asciibetically, instead of alhpabetically.
Entries in uppercase are listed before entries in lowercase, which leads to differently cased same-entries to be separated by other entries which would normally come after them:
... where one would expect "aaa" to directly follow "AAA".
This makes looking up entries in the Index troublesome, as the reader is forced to jump up and down the entries list to check differently cased entries are available for a given term — needless to say, it could lead to missing out many entries that are actually indexed.
A quick solution could be that of converting entries to lowecase (internally) before sorting them (but preserve their letter case in the actual output).
But, ideally, a true Unicode sorting algorithm would be better, allowing to properly handle special letters with accents, diacritics, etc (so that, for example, the letters "è" and "È" would be treated as an "e" in sorting).
The latter solution might require Asciidoctor to rely on a Unicode collation algorithm to handle the task properly:
This might introduce some complications, as dirrefent locales have different criteria for sorting special letters, but Asciidoctor could rely on the document's language specification (
:lang:
) to decide how to sort entries.There are some ready made collation charts that Asciidoctor could use to establish how entries should be sorted, based on the document's locale:
Since the Index is an important tool for readers, ebabling them to look up and find topics in very large books, proper sorting of the Index entries according to the document's locale is an important feature, affecting the quality of the final document.
The text was updated successfully, but these errors were encountered: