Skip to content

Support for Hebrew diacritics and other grapheme extenders #5

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
noomorph opened this issue Apr 18, 2014 · 10 comments
Open

Support for Hebrew diacritics and other grapheme extenders #5

noomorph opened this issue Apr 18, 2014 · 10 comments

Comments

@noomorph
Copy link

Hello, I've used your online demo http://mothereff.in/reverse-string and tried entering some hebrew with niqqud (diacritics) there and what I've got:

Actual result: שָׁלוֹם (shalom) got reversed to םֹולָׁש (which is nonsense, because lamed ל got diacritics from ש, look שָׁ -> לָׁ )
Expected: שָׁלוֹם - at least should be reversed to םוֹלשָׁ (so that each letters keeps it diacritics).

What do you think?

@mathiasbynens
Copy link
Owner

Why this doesn’t work right now: U+05B8 HEBREW POINT QAMATS is not strictly a combining mark (as in, it is not strictly assigned to any of the combining mark blocks), but it does act like one. As CodePoints.net says:

In text U+05B8 behaves as Combining Mark regarding line breaks. It has type Extend for sentence and Extend for word breaks. The Grapheme Cluster Break is Extend.

You’re right: Esrever should probably support grapheme extenders, and not just combining marks.

As per http://www.unicode.org/reports/tr44/#Grapheme_Extend:

Grapheme_Extend property = Me category + Mn category + Other_Grapheme_Extend property

@Boldewyn Can you confirm this is the correct way to get all Grapheme Extenders?

@noomorph
Copy link
Author

Thanks for very quick response. For my needs (just a tiny demo) I've decided to use RegExp instead of String.prototype.split.

"שָׁבּת שָׁלוֹם".match(/.[\u0591-\u05C7]*/g).reverse().join('').

If you run in browser, it will give you: "םוֹלשָׁ תבּשָׁ".
The idea is to greedy include diacritics for every match until other (NON-niqqud) character is met.

Of course, this way is not universal and accurate, partially because some of U+05CX symbols are not diacritics and it works just for hebrew. I'm just saying that it worked for me.

If there is any way I can be helpful for you, just tell me. Thanks!

@Boldewyn
Copy link

About getting the Grapheme Extenders: Well, that's the definition ;-) The UCD sustains the Grapheme_Extend property separately, so you should be fine using that directly.

The glossary of UAX44 for diacritics also suggests, that combining chars alone are not sufficient:

[...] Some diacritics are not combining characters, and some combining characters are not diacritics.

@noomorph
Copy link
Author

Do you mean that esrever works as expected?
If yes then I also agree: sha_-l-o-m gets reversed to m-o-la_-sh, and this is phonetically correct. I've marked letters which get qamatz, with asterisk.

The only problem I see here is: 0x05C1 and 0x05C2 characters – sin and shin dots for שׂ (sin), שׁ (shin). They do not make any sense when reversed, because their only destination is to specify which ש letter is that.

I think it's better to keep them together with ש.

The other thing is "final forms of hebrew letters (sofit)".
מ (not in end) -> ם (in end), e.g: שלום - מולש

I think this also is worth a note when reversing words.

@mathiasbynens
Copy link
Owner

@noomorph I was explaining that Esrever works as currently advertised, in that it only takes care of combining marks.

But I agree we should change that and also take care of grapheme extenders.

@mathiasbynens mathiasbynens changed the title support for hebrew unicode diacritics Support for Hebrew diacritics and other grapheme extenders Apr 23, 2014
@mathiasbynens
Copy link
Owner

@noomorph
Copy link
Author

Thank you, will be watching this thread. Unfortunately, I never dived into depths of Unicode so I cannot help. =(

@patch
Copy link

patch commented Apr 24, 2014

The job would be much easier if JavaScript supported \X in regular expressions for matching a Unicode extended grapheme cluster.

$ perl -CS -Mutf8 -E 'say join("", reverse("שָׁלוֹם" =~ /\X/g))'
םוֹלשָׁ

@Boldewyn
Copy link

@patch not necessarily for this project. When the browser is built against an old Unicode version, the results are outdated and incorrect for newer codepoints. With the major Unicode 7 update on the horizon, this is not only an academic problem.

(E.g., for the API in codepoints.net I use PHP's implementation of NFC/NFD transformations. The PHP version uses some Unicode 5.X data internally, therefore some newer Unicode 6 codepoints get incorrect transformations.)

@mathiasbynens
Copy link
Owner

The correct (well, the most correct) way to do this is to implement text segmentation as per TR29 and then reverse each grapheme cluster (as well as swapping surrogate pairs) before further processing the string as usual.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants