-
Notifications
You must be signed in to change notification settings - Fork 153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
wrong default mapping of some Romanian diacritics #37
Comments
Where is input image or something that would demonstrate problem? |
the Romanian typographical convention is that the diacritics s and t are with a comma below not with cedilla (as specified also in UNICODE Latin ext A and B). best would be that any diacritical s or t in the -ron (Romanian) option should be mapped into Latin ext B code above; meaning that in the tesseract's ron unicharset there should be no trace of [15e ] [15f ] [162 ] or [163 ], only [218 ]-[21a ]. the wrong mapping is everywhere once the -ron option is selected... let me quote UNICODE 10 (chap.07) on this:
same goes for ȘțȚ. so option -ron means șțȚȘ [U+0218-A] with no ambiguity and should nowhere involve şŞŢţ [U+015e-f][U+0162-3]. |
This issue is not caused by Tesseract itself. It should be moved to another repo (not sure which one). |
I think langdata_lstm is a good one and transfer the issue. |
@latrau, so each of the wrong characters should be replaced? Do you want to send a pull request which fixes ron.training_text, maybe also ron.singles_text and ron.wordlist? |
@latrau, was cedilla used in historic Romanian texts? If yes, it might be a good idea to keep both forms (with cedilla for the historic characters and with comma for the modern ones). |
Environment
Debian Linux
Tesseract Version: tesseract 4.00.00alpha
Platform: Linux 4.15.0 SMP PREEMPT 2018 x86_64 GNU/Linux
Current Behavior:
using the ron option (Romanian):
romanian diacritics șȘțȚ are mapped into the wrong Unicode codes, namely:
Ș -> Ş=U+015E
ș -> ş=U+015F
Ț -> Ţ=U+0162
ț -> ţ=U+0163
Expected Behavior:
Ș -> Ș=U+0218
ș -> ș=U+0219
Ț -> Ț=U+021A
ț -> ț=U+021B
Suggested Fix:
edit the map accordingly;
The text was updated successfully, but these errors were encountered: