Skip to content

Commit

Permalink
Improve Hebrew conversion
Browse files Browse the repository at this point in the history
Convert double letter translation to capital letter as very hard to understand
what the translation is because of duplicate, for example:

   kh - is it k and h or kh?
   tskh - is it t,s,kh or ts,k,h or ts,kh, etc...

0xa2
    Hebrew bible puncheation mark, should be ignored.

0xc6
    Opposite Nun, same as 'n'.

0xba
    Hulam Haser, vawel as 'o'.

0xbf
    Makaf Raphe, same as Makaf (0xbe).

0xc5
    Hebrew bible puncheation mark, should be ignored.

0xc7
    Makaf katan, vowel as 'o'.

0xd0
    Aleph, sounds as AHA must exist to make string readbale.
    Distinguish from '`' use capital A to distinguish from 'a' vowel.

0xf5
    Splitted Vave, same as 'v'.

0xf6
    Opposite Nun, same as 'n'.

0xf7
    Small Kuf, same as 'q'.

Signed-off-by: Alon Bar-Lev <[email protected]>
  • Loading branch information
alonbl committed Mar 10, 2018
1 parent 17aecae commit 81f938d
Showing 1 changed file with 15 additions and 15 deletions.
30 changes: 15 additions & 15 deletions unidecode/x005.py
Original file line number Diff line number Diff line change
Expand Up @@ -161,7 +161,7 @@
'', # 0x9f
'', # 0xa0
'', # 0xa1
'[?]', # 0xa2
'', # 0xa2
'', # 0xa3
'', # 0xa4
'', # 0xa5
Expand All @@ -185,20 +185,20 @@
'a', # 0xb7
'a', # 0xb8
'o', # 0xb9
'[?]', # 0xba
'o', # 0xba
'u', # 0xbb
'\'', # 0xbc
'', # 0xbd
'-', # 0xbe
'', # 0xbf
'-', # 0xbf
'|', # 0xc0
'', # 0xc1
'', # 0xc2
':', # 0xc3
'', # 0xc4
'[?]', # 0xc5
'[?]', # 0xc6
'[?]', # 0xc7
'', # 0xc5
'n', # 0xc6
'o', # 0xc7
'[?]', # 0xc8
'[?]', # 0xc9
'[?]', # 0xca
Expand All @@ -207,14 +207,14 @@
'[?]', # 0xcd
'[?]', # 0xce
'[?]', # 0xcf
'', # 0xd0
'A', # 0xd0
'b', # 0xd1
'g', # 0xd2
'd', # 0xd3
'h', # 0xd4
'v', # 0xd5
'z', # 0xd6
'kh', # 0xd7
'KH', # 0xd7
't', # 0xd8
'y', # 0xd9
'k', # 0xda
Expand All @@ -228,25 +228,25 @@
'`', # 0xe2
'p', # 0xe3
'p', # 0xe4
'ts', # 0xe5
'ts', # 0xe6
'TS', # 0xe5
'TS', # 0xe6
'q', # 0xe7
'r', # 0xe8
'sh', # 0xe9
'SH', # 0xe9
't', # 0xea
'[?]', # 0xeb
'[?]', # 0xec
'[?]', # 0xed
'[?]', # 0xee
'[?]', # 0xef
'V', # 0xf0
'oy', # 0xf1
'OY', # 0xf1
'i', # 0xf2
'\'', # 0xf3
'"', # 0xf4
'[?]', # 0xf5
'[?]', # 0xf6
'[?]', # 0xf7
'v', # 0xf5
'n', # 0xf6
'q', # 0xf7
'[?]', # 0xf8
'[?]', # 0xf9
'[?]', # 0xfa
Expand Down

3 comments on commit 81f938d

@eyaler
Copy link

@eyaler eyaler commented on 81f938d Jul 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alonbl reverse-nun is a punctuation mark and it should not be transliterated to n

@alonbl
Copy link
Contributor Author

@alonbl alonbl commented on 81f938d Jul 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alonbl reverse-nun is a punctuation mark and it should not be transliterated to n

Thanks for the notice, not that I thought anyone will actually use it :)

Feel free to submit a pull request as simple as:

- 'n',    # 0xc6
+ '',    # 0xc6

@eyaler
Copy link

@eyaler eyaler commented on 81f938d Jul 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally I believed it should be marked as unknown (None) rather than ignore (''). but now I see that the Paragraph sign ¶ is transliterated to P so your choice seems to be consistent. However I do believe that these choices could be an issue and would like to have an option to avoid replacing punctuation marks by regular letters. opened an issue for the more general case.
also @alonbl could you kindly refer me to why you added 05f5, 05f6, 05f7 - as i could not find these in the unicode specification.

Please sign in to comment.