Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Index Sorting: Introduce Unicode Collation Instead of Asciibetical Sorting #928

Closed
tajmone opened this issue Aug 30, 2018 · 5 comments
Closed
Assignees
Milestone

Comments

@tajmone
Copy link

tajmone commented Aug 30, 2018

I'm not entirely sure if this issue related to Asciidoctor PDF or to Asciidoctor itself (but since I've encountered the problem using Asciidoctor PDF, I'm reporting it here).

In the Index of a PDF document converted via Asciidoctor PDF, the entries for each letter are sorted Asciibetically, instead of alhpabetically.

Entries in uppercase are listed before entries in lowercase, which leads to differently cased same-entries to be separated by other entries which would normally come after them:

AAA
BBB
CCC
aaa

... where one would expect "aaa" to directly follow "AAA".

This makes looking up entries in the Index troublesome, as the reader is forced to jump up and down the entries list to check differently cased entries are available for a given term — needless to say, it could lead to missing out many entries that are actually indexed.

A quick solution could be that of converting entries to lowecase (internally) before sorting them (but preserve their letter case in the actual output).

But, ideally, a true Unicode sorting algorithm would be better, allowing to properly handle special letters with accents, diacritics, etc (so that, for example, the letters "è" and "È" would be treated as an "e" in sorting).

The latter solution might require Asciidoctor to rely on a Unicode collation algorithm to handle the task properly:

This might introduce some complications, as dirrefent locales have different criteria for sorting special letters, but Asciidoctor could rely on the document's language specification (:lang:) to decide how to sort entries.

There are some ready made collation charts that Asciidoctor could use to establish how entries should be sorted, based on the document's locale:

Since the Index is an important tool for readers, ebabling them to look up and find topics in very large books, proper sorting of the Index entries according to the document's locale is an important feature, affecting the quality of the final document.

@mojavelinux
Copy link
Member

I've started exploring this. It looks like we could delegate to functions provided by the ffi-icu gem to achieve this result.

In the meantime, I think we should spin off a separate issue to make sorting case insensitive.

@tajmone
Copy link
Author

tajmone commented Nov 24, 2019

I've started exploring this. It looks like we could delegate to functions provided by the ffi-icu gem to achieve this result.

Sounds good to rely on a dedicated Gem for this.

In the meantime, I think we should spin off a separate issue to make sorting case insensitive.

Done: #1406.

@tajmone
Copy link
Author

tajmone commented Nov 24, 2019

I was looking at the feature offered by ICU; it looks like it offers tons of cool features beside Unicode collation, including:

  • Code Page Conversion.
  • Formatting numbers, dates, times and currency amounts according the conventions of a chosen locale.
  • Time Calculations beyond the traditional Gregorian calendar.
  • Bidi support for handling text containing a mixture of left to right (English) and right to left (Arabic or Hebrew) data.

So, possibly the introduction of the ffi-icu gem could also allow more features into Asciidoctor, especially regarding formatting dates and numbers according to document locale, as well as handling non UTF-8 external assets (e.g. asciidoctor/asciidoctor#3248).

Maybe both Asciidoctor PDF and Asciidoctor could benefit from it.

@mojavelinux
Copy link
Member

mojavelinux commented Jan 19, 2020

Sounds good to rely on a dedicated Gem for this.

I agree. If present, we should use it. If not, we should default to the current behavior (graceful degradation).

@mojavelinux mojavelinux added this to the v2.0.0 milestone Feb 4, 2020
@graphitefriction graphitefriction modified the milestones: v2.0.0, v2.1.x Apr 25, 2022
@mojavelinux mojavelinux modified the milestones: v2.1.x, v2.2.x Jun 21, 2022
@mojavelinux mojavelinux modified the milestones: v2.2.x, v2.3.x Jul 28, 2022
@mojavelinux mojavelinux modified the milestones: v2.3.x, v3.0.x Aug 16, 2022
mojavelinux added a commit to mojavelinux/asciidoctor-pdf that referenced this issue Sep 4, 2022
…nd group terms under ASCII letter category
mojavelinux added a commit to mojavelinux/asciidoctor-pdf that referenced this issue Sep 4, 2022
…nd group terms under ASCII letter category
mojavelinux added a commit to mojavelinux/asciidoctor-pdf that referenced this issue Sep 4, 2022
…nd group terms under ASCII letter category
mojavelinux added a commit to mojavelinux/asciidoctor-pdf that referenced this issue Sep 5, 2022
…nd group terms under ASCII letter category
mojavelinux added a commit to mojavelinux/asciidoctor-pdf that referenced this issue Sep 5, 2022
…nd group terms under ASCII letter category
@mojavelinux
Copy link
Member

I've been able to successfully integrate the ffi-icu gem. I think it's a viable path forward. There's also the twitter_cdlr gem, but I found it more difficult to use.

Here's an article I found that helped me to understand how to use ICU: https://bartvanraaij.dev/2020-10-17-converting-utf8-strings-to-ascii-using-icu-transliterator/

@mojavelinux mojavelinux self-assigned this Sep 5, 2022
mojavelinux added a commit to mojavelinux/asciidoctor-pdf that referenced this issue Sep 7, 2022
…nd group terms under ASCII letter category
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants