Support for Hebrew diacritics and other grapheme extenders #5

noomorph · 2014-04-18T10:05:02Z

Hello, I've used your online demo http://mothereff.in/reverse-string and tried entering some hebrew with niqqud (diacritics) there and what I've got:

Actual result: שָׁלוֹם (shalom) got reversed to םֹולָׁש (which is nonsense, because lamed ל got diacritics from ש, look שָׁ -> לָׁ )
Expected: שָׁלוֹם - at least should be reversed to םוֹלשָׁ (so that each letters keeps it diacritics).

What do you think?

mathiasbynens · 2014-04-18T10:23:30Z

Why this doesn’t work right now: U+05B8 HEBREW POINT QAMATS is not strictly a combining mark (as in, it is not strictly assigned to any of the combining mark blocks), but it does act like one. As CodePoints.net says:

In text U+05B8 behaves as Combining Mark regarding line breaks. It has type Extend for sentence and Extend for word breaks. The Grapheme Cluster Break is Extend.

You’re right: Esrever should probably support grapheme extenders, and not just combining marks.

As per http://www.unicode.org/reports/tr44/#Grapheme_Extend:

Grapheme_Extend property = Me category + Mn category + Other_Grapheme_Extend property

@Boldewyn Can you confirm this is the correct way to get all Grapheme Extenders?

noomorph · 2014-04-18T10:57:04Z

Thanks for very quick response. For my needs (just a tiny demo) I've decided to use RegExp instead of String.prototype.split.

"שָׁבּת שָׁלוֹם".match(/.[\u0591-\u05C7]*/g).reverse().join('').

If you run in browser, it will give you: "םוֹלשָׁ תבּשָׁ".
The idea is to greedy include diacritics for every match until other (NON-niqqud) character is met.

Of course, this way is not universal and accurate, partially because some of U+05CX symbols are not diacritics and it works just for hebrew. I'm just saying that it worked for me.

If there is any way I can be helpful for you, just tell me. Thanks!

Boldewyn · 2014-04-18T11:55:52Z

About getting the Grapheme Extenders: Well, that's the definition ;-) The UCD sustains the Grapheme_Extend property separately, so you should be fine using that directly.

The glossary of UAX44 for diacritics also suggests, that combining chars alone are not sufficient:

[...] Some diacritics are not combining characters, and some combining characters are not diacritics.

noomorph · 2014-04-18T13:19:51Z

Do you mean that esrever works as expected?
If yes then I also agree: sha_-l-o-m gets reversed to m-o-la_-sh, and this is phonetically correct. I've marked letters which get qamatz, with asterisk.

The only problem I see here is: 0x05C1 and 0x05C2 characters – sin and shin dots for שׂ (sin), שׁ (shin). They do not make any sense when reversed, because their only destination is to specify which ש letter is that.

I think it's better to keep them together with ש.

The other thing is "final forms of hebrew letters (sofit)".
מ (not in end) -> ם (in end), e.g: שלום - מולש

I think this also is worth a note when reversing words.

mathiasbynens · 2014-04-18T14:59:08Z

@noomorph I was explaining that Esrever works as currently advertised, in that it only takes care of combining marks.

But I agree we should change that and also take care of grapheme extenders.

mathiasbynens · 2014-04-24T07:57:06Z

Do Grapheme_Extend characters only apply to Grapheme_Base characters?

noomorph · 2014-04-24T10:24:37Z

Thank you, will be watching this thread. Unfortunately, I never dived into depths of Unicode so I cannot help. =(

patch · 2014-04-24T22:11:40Z

The job would be much easier if JavaScript supported \X in regular expressions for matching a Unicode extended grapheme cluster.

$ perl -CS -Mutf8 -E 'say join("", reverse("שָׁלוֹם" =~ /\X/g))'
םוֹלשָׁ

Boldewyn · 2014-04-25T06:49:05Z

@patch not necessarily for this project. When the browser is built against an old Unicode version, the results are outdated and incorrect for newer codepoints. With the major Unicode 7 update on the horizon, this is not only an academic problem.

(E.g., for the API in codepoints.net I use PHP's implementation of NFC/NFD transformations. The PHP version uses some Unicode 5.X data internally, therefore some newer Unicode 6 codepoints get incorrect transformations.)

mathiasbynens · 2014-05-02T06:57:53Z

The correct (well, the most correct) way to do this is to implement text segmentation as per TR29 and then reverse each grapheme cluster (as well as swapping surrogate pairs) before further processing the string as usual.

mathiasbynens mentioned this issue Apr 18, 2014

Add “Derived Core Properties” node-unicode/node-unicode-data#10

Closed

mathiasbynens changed the title ~~support for hebrew unicode diacritics~~ Support for Hebrew diacritics and other grapheme extenders Apr 23, 2014

mathiasbynens mentioned this issue Dec 2, 2014

String reversal works on code points krakjoe/ustring#19

Open

adishavit mentioned this issue May 21, 2015

Hebrew Fonts opentypejs/opentype.js#119

Open

lionel-rowe mentioned this issue Feb 17, 2025

feat(text/unstable): add reverse function denoland/std#6410

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for Hebrew diacritics and other grapheme extenders #5

Support for Hebrew diacritics and other grapheme extenders #5

noomorph commented Apr 18, 2014

mathiasbynens commented Apr 18, 2014

noomorph commented Apr 18, 2014

Boldewyn commented Apr 18, 2014

noomorph commented Apr 18, 2014

mathiasbynens commented Apr 18, 2014

mathiasbynens commented Apr 24, 2014

noomorph commented Apr 24, 2014

patch commented Apr 24, 2014

Boldewyn commented Apr 25, 2014

mathiasbynens commented May 2, 2014

Support for Hebrew diacritics and other grapheme extenders #5

Support for Hebrew diacritics and other grapheme extenders #5

Comments

noomorph commented Apr 18, 2014

mathiasbynens commented Apr 18, 2014

noomorph commented Apr 18, 2014

Boldewyn commented Apr 18, 2014

noomorph commented Apr 18, 2014

mathiasbynens commented Apr 18, 2014

mathiasbynens commented Apr 24, 2014

noomorph commented Apr 24, 2014

patch commented Apr 24, 2014

Boldewyn commented Apr 25, 2014

mathiasbynens commented May 2, 2014