-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Imporve Hebrew and Yiddish #68
Conversation
thanks! you beat me to it. i gave you edit permission, and replied to some of your comments. perhaps you can add a column to make it easier to see which ones you took/ignored/changed? I think the main open issue is the caps. afaiu, the reason you suggested eg SH to be capitalized is to differentiate SH (ש) from s+h (סה). when letters retain their regular meaning as OY=o+y there is no reason for caps. unidecode does not seem to have a policy requiring all substitutions more than 1 char length to be in caps. the transformation is not reversible. I think caps should be reserved for special cases of:
|
Thanks, I will fill the file. I believe that the difference between us is that you think that we should be phonetic compatible, while I believe this is wrong. The main use case for the unidecode and similar translation is not for non-native speaker reading the text and "vocally" transmit it to native speaker to understand, but for native people to be able to understand the transformation when native character set is not supported. For example a car mp3 player which does not support Hebrew. A native speaker has no choice but to use a transformation to be able to read the titles. If there is a redundancy in the translation it is very difficult to grasp the origin word. Believe me I've tried... Out entire family had played a competition who gets it first... and there were some we could not figure out the original term, although once we had it it was obvious... but still we failed. This is the reason why 'צ' MUST be 'TS' and not 'ts' as it takes hell amount of time to try all combinations to figure out what was the transformation. This is the reason why 'א' cannot be the same as 'ע', and also the reason of the difference between 'ק' and 'כ' and so on. I hope you agree with me about the pattern, this will settle most of the differences. Regards, |
1516f01
to
eaa2c16
Compare
I updated the document[1], filter by 1st columns for all opened issues. [1] https://docs.google.com/spreadsheets/d/1fvQtyDxiVbz4Yp2FY1fSvZ9qVugo2KKC_yX8LofAUGU/edit#gid=0 |
if you have willing would be happy to discuss over phone - i think it would be more efficient. of course this is a transliteration (as opposed to a phonetic transcription), since you cannot know the pronunciation from a single letter. were we presumably differ is that i do NOT believe that the transformation should be reversible, which by definition is what unidecode is doing by mapping everything to the ascii 127 range. therefore it is fine afaiac that two hebrew chars map to the same ascii char. and you already have other examples like both kamats and patah map to "a". both samekh and sin map to "s" (would you suggest S for sin?), both vav and vet now map to v (would you suggest V for vet? or better, the academy's "exact" taatik using w for vav) both final letter and middle letter map to the same symbol (do you suggest M for mem sofit?) and more. the official "simple" hebrew academy taatik also does this for kaf/kof->k, aleph/ain->' (geresh), tet/tav->t and tsadi/tav+samekh->ts. therefore i don't believe we should try here to be "holier than the pope", and i would be weary of trying the invent a new standard that is a mishmash between the official "simple" and "exact" taatik [1]. (btw, if we did want a 1-1 mapping a actually recently developed a 1-1 hebrew leet version (based on [2]). this puts emphasis on graphics instead of phonetics. but i dont think this is what we want here. namely due to the fact that the normalization changes to rendering direction to LTR.). therefore my suggestion is 1. follow to acadamy simple taatik is closely as possible. 2. possibly use caps only to differentiate between different phonetic readings (H, SH and KH). 3. make a special exception for alef and ain. Otherwise, we should go over the hebrew block and hebrew presentation forms block again, and fix other ambiguities, and if following the "exact" taatik as you suggest for ain (backtick) and kof (q), we should also prefer other choices there as w for vav. then use caps to address many remaining ambiguities. [1] https://hebrew-academy.org.il/wp-content/uploads/taatik-ivrit-latinit-1-1.pdf |
You are focused in the special cases, punctuation and such that do not actually exist in modern Hebrew language. Even if modern Hebrew had been written with punctuation, the logic of transforming full punctuated text to phonetic would have been much more complex than 1:1 character transformation. This is not the use case of tools such as unidecode. For the record, in the current implementation samech is 's' and Shin is 'SH' to allow distinguish between the two. I clearly stated the use case: ability to read Hebrew text while in Latin charset. If you use links web browser in Linux in text terminal and browse google.com you will notice it also performs conversion, for example: I believe you are in the wrong project trying to make the punctuation work with a tool which is character to character processor, while even if you would have offered such a logic for a fully punctuated text, you should have also provided the option of simplified Hebrew conversion as it exists now. Thank you for helping improve Hebrew special cases and cleanup of invalid chars and Yiddish, I believe this patch can be merged as-is and this discussion may continue elsewhere. |
i am not trying to make punctuation work. i just want unidecode hebrew transformations to be self-consistent and as close as possible to official standards. |
It is impossible to match the standards without an AI. Let's take [1], and let's agree we are using the precise model and agree we handle transcript without punctuation. How can we know if [1] https://hebrew-academy.org.il/wp-content/uploads/taatik-ivrit-latinit-1-1.pdf |
In other words these standards are incomplete and are for human translations and not for machines. |
I added the |
כ should not be kh (that is only for כ rafa). if this is not indicated כ is always transliterated as k (although for final ך it would probably be ok to also use KH, and the same goes for final ף which should be f, btw) |
I truly do not understand how can you distinguish between I can accept a translate of SIN (vs SHIN) to I updated the document[2] to match the output. The remaining issues apart from these that are not marked as Closed are related to capitalization of conflicts, which I do not find as a major issue compared to the standard and I've showed you that other transformations are doing the same. I believe most of the items are translated correctly now, please review this patch and let's agree if it is a progress compared to the existing implementation or not. If it is a progress, we can merge it and then discuss the remaining later. [1] https://hebrew-academy.org.il/wp-content/uploads/taatik-ivrit-latinit-1-1.pdf |
all letters are treated as consonants unless there is a special indication of a vowel. note that if you use S for sin, we would have a new issue of differing between SH from SHIN vs S+H from SIN+HET i think it is a progress apart from the recent changes of כ and ט which i consider to be a step back. כ should be k this is the more representative transliteration as it it is used in the beginning of words (compare to BET). a new issue with the recent T for ט is that now we cannot know if TSH is TET+SHIN or TSADI+HET... (by the way there is also FB38 טּ) so i suggest for now: revert all כ back to k and ט back to t. change YY to YYY (unidecode does have 3 letter representations). and let's submit it as a new hebrew baseline. Further PR's can then be limited to specific issues. thanks! |
Hi. I would be happy to merge this if you think it is an improvement compared to existing replacements. |
Cleanup special rearly used characters. Regular characters closer to formal document[1]. [1] https://hebrew-academy.org.il/wp-content/uploads/taatik-ivrit-latinit-1-1.pdf
Cleanup invalid characters and typos. Fixup special characters.
Hi @avian2, I believe this is a progress that worth merging. The discussion of special characters and the ability to perform logic which is not 1:1 character translation but based on dictionary or some other rules can be done later. Thanks! |
Merged and released in Unidecode 1.3.0. Thanks! |
No description provided.