Index Sorting: Introduce Unicode Collation Instead of Asciibetical Sorting #928

tajmone · 2018-08-30T10:15:04Z

I'm not entirely sure if this issue related to Asciidoctor PDF or to Asciidoctor itself (but since I've encountered the problem using Asciidoctor PDF, I'm reporting it here).

In the Index of a PDF document converted via Asciidoctor PDF, the entries for each letter are sorted Asciibetically, instead of alhpabetically.

Entries in uppercase are listed before entries in lowercase, which leads to differently cased same-entries to be separated by other entries which would normally come after them:

AAA
BBB
CCC
aaa

... where one would expect "aaa" to directly follow "AAA".

This makes looking up entries in the Index troublesome, as the reader is forced to jump up and down the entries list to check differently cased entries are available for a given term — needless to say, it could lead to missing out many entries that are actually indexed.

A quick solution could be that of converting entries to lowecase (internally) before sorting them (but preserve their letter case in the actual output).

But, ideally, a true Unicode sorting algorithm would be better, allowing to properly handle special letters with accents, diacritics, etc (so that, for example, the letters "è" and "È" would be treated as an "e" in sorting).

The latter solution might require Asciidoctor to rely on a Unicode collation algorithm to handle the task properly:

This might introduce some complications, as dirrefent locales have different criteria for sorting special letters, but Asciidoctor could rely on the document's language specification (:lang:) to decide how to sort entries.

There are some ready made collation charts that Asciidoctor could use to establish how entries should be sorted, based on the document's locale:

Since the Index is an important tool for readers, ebabling them to look up and find topics in very large books, proper sorting of the Index entries according to the document's locale is an important feature, affecting the quality of the final document.

The text was updated successfully, but these errors were encountered:

mojavelinux · 2019-11-24T06:51:23Z

I've started exploring this. It looks like we could delegate to functions provided by the ffi-icu gem to achieve this result.

In the meantime, I think we should spin off a separate issue to make sorting case insensitive.

tajmone · 2019-11-24T09:23:12Z

I've started exploring this. It looks like we could delegate to functions provided by the ffi-icu gem to achieve this result.

Sounds good to rely on a dedicated Gem for this.

In the meantime, I think we should spin off a separate issue to make sorting case insensitive.

Done: #1406.

tajmone · 2019-11-24T09:34:04Z

I was looking at the feature offered by ICU; it looks like it offers tons of cool features beside Unicode collation, including:

Code Page Conversion.
Formatting numbers, dates, times and currency amounts according the conventions of a chosen locale.
Time Calculations beyond the traditional Gregorian calendar.
Bidi support for handling text containing a mixture of left to right (English) and right to left (Arabic or Hebrew) data.

So, possibly the introduction of the ffi-icu gem could also allow more features into Asciidoctor, especially regarding formatting dates and numbers according to document locale, as well as handling non UTF-8 external assets (e.g. asciidoctor/asciidoctor#3248).

Maybe both Asciidoctor PDF and Asciidoctor could benefit from it.

mojavelinux · 2020-01-19T09:36:37Z

Sounds good to rely on a dedicated Gem for this.

I agree. If present, we should use it. If not, we should default to the current behavior (graceful degradation).

…nd group terms under ASCII letter category

mojavelinux · 2022-09-05T20:52:01Z

I've been able to successfully integrate the ffi-icu gem. I think it's a viable path forward. There's also the twitter_cdlr gem, but I found it more difficult to use.

Here's an article I found that helped me to understand how to use ICU: https://bartvanraaij.dev/2020-10-17-converting-utf8-strings-to-ascii-using-icu-transliterator/

…nd group terms under ASCII letter category

This was referenced Aug 30, 2018

Index Reconstruction alan-if/alan-docs#5

Closed

PDF Conversion Problems alan-if/alan-docs#9

Closed

tajmone mentioned this issue Nov 24, 2019

Index Sorting: Case Insensitive #1406

Closed

mojavelinux added the enhancement label Jan 19, 2020

mojavelinux added this to the v2.0.0 milestone Feb 4, 2020

graphitefriction modified the milestones: v2.0.0, v2.1.x Apr 25, 2022

mojavelinux modified the milestones: v2.1.x, v2.2.x Jun 21, 2022

mojavelinux modified the milestones: v2.2.x, v2.3.x Jul 28, 2022

mojavelinux modified the milestones: v2.3.x, v3.0.x Aug 16, 2022

mojavelinux added a commit to mojavelinux/asciidoctor-pdf that referenced this issue Sep 4, 2022

resolves asciidoctor#928 use ICU if available to sort index entries a…

2dc60ec

…nd group terms under ASCII letter category

mojavelinux added a commit to mojavelinux/asciidoctor-pdf that referenced this issue Sep 4, 2022

resolves asciidoctor#928 use ICU if available to sort index entries a…

338c3d3

…nd group terms under ASCII letter category

mojavelinux added a commit to mojavelinux/asciidoctor-pdf that referenced this issue Sep 4, 2022

resolves asciidoctor#928 use ICU if available to sort index entries a…

4c3263d

…nd group terms under ASCII letter category

mojavelinux added a commit to mojavelinux/asciidoctor-pdf that referenced this issue Sep 5, 2022

resolves asciidoctor#928 use ICU if available to sort index entries a…

0ae24ca

…nd group terms under ASCII letter category

mojavelinux added a commit to mojavelinux/asciidoctor-pdf that referenced this issue Sep 5, 2022

resolves asciidoctor#928 use ICU if available to sort index entries a…

180b4d1

…nd group terms under ASCII letter category

mojavelinux self-assigned this Sep 5, 2022

mojavelinux added a commit to mojavelinux/asciidoctor-pdf that referenced this issue Sep 7, 2022

resolves asciidoctor#928 use ICU if available to sort index entries a…

de8d799

…nd group terms under ASCII letter category

mojavelinux closed this as completed in 7d7a1a9 Sep 7, 2022

mojavelinux added the v3.0.0 label Sep 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Index Sorting: Introduce Unicode Collation Instead of Asciibetical Sorting #928

Index Sorting: Introduce Unicode Collation Instead of Asciibetical Sorting #928

tajmone commented Aug 30, 2018

mojavelinux commented Nov 24, 2019

tajmone commented Nov 24, 2019

tajmone commented Nov 24, 2019

mojavelinux commented Jan 19, 2020 •

edited

Loading

mojavelinux commented Sep 5, 2022

Index Sorting: Introduce Unicode Collation Instead of Asciibetical Sorting #928

Index Sorting: Introduce Unicode Collation Instead of Asciibetical Sorting #928

Comments

tajmone commented Aug 30, 2018

mojavelinux commented Nov 24, 2019

tajmone commented Nov 24, 2019

tajmone commented Nov 24, 2019

mojavelinux commented Jan 19, 2020 • edited Loading

mojavelinux commented Sep 5, 2022

mojavelinux commented Jan 19, 2020 •

edited

Loading