-
Notifications
You must be signed in to change notification settings - Fork 32
Support for Hebrew diacritics and other grapheme extenders #5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Why this doesn’t work right now: U+05B8 HEBREW POINT QAMATS is not strictly a combining mark (as in, it is not strictly assigned to any of the combining mark blocks), but it does act like one. As CodePoints.net says:
You’re right: Esrever should probably support grapheme extenders, and not just combining marks. As per http://www.unicode.org/reports/tr44/#Grapheme_Extend:
@Boldewyn Can you confirm this is the correct way to get all Grapheme Extenders? |
Thanks for very quick response. For my needs (just a tiny demo) I've decided to use RegExp instead of String.prototype.split. "שָׁבּת שָׁלוֹם".match(/.[\u0591-\u05C7]*/g).reverse().join(''). If you run in browser, it will give you: Of course, this way is not universal and accurate, partially because some of U+05CX symbols are not diacritics and it works just for hebrew. I'm just saying that it worked for me. If there is any way I can be helpful for you, just tell me. Thanks! |
About getting the Grapheme Extenders: Well, that's the definition ;-) The UCD sustains the Grapheme_Extend property separately, so you should be fine using that directly. The glossary of UAX44 for diacritics also suggests, that combining chars alone are not sufficient:
|
Do you mean that esrever works as expected? The only problem I see here is: 0x05C1 and 0x05C2 characters – sin and shin dots for שׂ (sin), שׁ (shin). They do not make any sense when reversed, because their only destination is to specify which ש letter is that. I think it's better to keep them together with ש. The other thing is "final forms of hebrew letters (sofit)". I think this also is worth a note when reversing words. |
@noomorph I was explaining that Esrever works as currently advertised, in that it only takes care of combining marks. But I agree we should change that and also take care of grapheme extenders. |
Thank you, will be watching this thread. Unfortunately, I never dived into depths of Unicode so I cannot help. =( |
The job would be much easier if JavaScript supported $ perl -CS -Mutf8 -E 'say join("", reverse("שָׁלוֹם" =~ /\X/g))'
םוֹלשָׁ |
@patch not necessarily for this project. When the browser is built against an old Unicode version, the results are outdated and incorrect for newer codepoints. With the major Unicode 7 update on the horizon, this is not only an academic problem. (E.g., for the API in codepoints.net I use PHP's implementation of NFC/NFD transformations. The PHP version uses some Unicode 5.X data internally, therefore some newer Unicode 6 codepoints get incorrect transformations.) |
The correct (well, the most correct) way to do this is to implement text segmentation as per TR29 and then reverse each grapheme cluster (as well as swapping surrogate pairs) before further processing the string as usual. |
Hello, I've used your online demo http://mothereff.in/reverse-string and tried entering some hebrew with niqqud (diacritics) there and what I've got:
Actual result: שָׁלוֹם (shalom) got reversed to םֹולָׁש (which is nonsense, because lamed ל got diacritics from ש, look שָׁ -> לָׁ )
Expected: שָׁלוֹם - at least should be reversed to םוֹלשָׁ (so that each letters keeps it diacritics).
What do you think?
The text was updated successfully, but these errors were encountered: